Bug #10490


watchdog stopped

Added by Chambon Bernard over 6 years ago. Updated about 6 years ago.

Assigned To:
Start date:
Due date:
% Done:


Estimated time:


On thursday 2015/07/24 around 17h55, on cctrreqsxrrotd, the watchdog stopped.
It seems that no there were impact on the service, but a nagios alert was issued and PEB restarts treqs.

Actions #1

Updated by Chambon Bernard over 6 years ago

More details :

The last DB update from watchdog :

2015-07-24 17:55:50,756 [WrapperSimpleAppMain] DEBUG MySQLBroker - Query: 'UPDATE heart_beat SET last_time = NOW()'
2015-07-24 17:55:50,756 [WrapperSimpleAppMain] INFO  MySQLBroker - Mysql access duration (MySQLBroker.executeModification method) took 0 ms (0 s) 

 app was restart (by PEB)
2015-07-24 18:45:00,126 [WrapperSimpleAppMain] DEBUG MySQLBroker - Query: 'INSERT INTO heart_beat (pid, start_time, last_time) VALUES (?, NOW(), NOW())'

At the same time there were deadlock on MySql DB, probably the reason why watchdog stops ?

Looking at the code, I saw that if watchdog fails,
1) Mysql exceptions (if any) is trapped, but no message is 'logged'
2) case exception, code exits loop


           while (this.cont) {
                LOGGER.debug("Sleeping for {} milliseconds", sleep);
        } catch (final InterruptedException e) {

What to do ?
1) get rid of watchdog ? or keep it for nagios sensor ?
2) Trap exception inside the watchdog loop and exit only after N exceptions with an ad-hoc messsage to restart app with JSW ? (I'm working on that)
3) Try to get rid of Mysql deadlock (I'm also working on that)

Actions #2

Updated by Chambon Bernard over 6 years ago

  • Status changed from New to In progress

Changing exception management for watchdog functionality :
o Trap exception inside the watchdog loop and
o Exiting only after N exceptions, with an ad-hoc messsage for JSW usage (app restart)

Actions #3

Updated by Chambon Bernard about 6 years ago

  • Status changed from In progress to Resolved

Last 'watchdog stop' was due to a ConcurrentModification exception
that wasn't catched in the right place (outside forever loop)
Changing to higher level exception + ad-hoc message for JSW fix the bug.


Also available in: Atom PDF