Bug #10490
closedwatchdog stopped
0%
Description
On thursday 2015/07/24 around 17h55, on cctrreqsxrrotd, the watchdog stopped.
It seems that no there were impact on the service, but a nagios alert was issued and PEB restarts treqs.
Updated by Chambon Bernard over 9 years ago
More details :
The last DB update from watchdog :
2015-07-24 17:55:50,756 [WrapperSimpleAppMain] DEBUG MySQLBroker - Query: 'UPDATE heart_beat SET last_time = NOW()' 2015-07-24 17:55:50,756 [WrapperSimpleAppMain] INFO MySQLBroker - Mysql access duration (MySQLBroker.executeModification method) took 0 ms (0 s) ... app was restart (by PEB) ... 2015-07-24 18:45:00,126 [WrapperSimpleAppMain] DEBUG MySQLBroker - Query: 'INSERT INTO heart_beat (pid, start_time, last_time) VALUES (?, NOW(), NOW())'
At the same time there were deadlock on MySql DB, probably the reason why watchdog stops ?
Looking at the code, I saw that if watchdog fails,
1) Mysql exceptions (if any) is trapped, but no message is 'logged'
2) case exception, code exits loop
... while (this.cont) { LOGGER.debug("Sleeping for {} milliseconds", sleep); Thread.sleep(sleep); Watchdog.getInstance().heartBeat(); } } catch (final InterruptedException e) {
What to do ?
1) get rid of watchdog ? or keep it for nagios sensor ?
2) Trap exception inside the watchdog loop and exit only after N exceptions with an ad-hoc messsage to restart app with JSW ? (I'm working on that)
3) Try to get rid of Mysql deadlock (I'm also working on that)
Updated by Chambon Bernard over 9 years ago
- Status changed from New to In progress
Changing exception management for watchdog functionality :
o Trap exception inside the watchdog loop and
o Exiting only after N exceptions, with an ad-hoc messsage for JSW usage (app restart)
Updated by Chambon Bernard about 9 years ago
- Status changed from In progress to Resolved
Last 'watchdog stop' was due to a ConcurrentModification exception
that wasn't catched in the right place (outside forever loop)
Changing to higher level exception + ad-hoc message for JSW fix the bug.