Project

General

Profile

Actions

Bug #10490

closed

watchdog stopped

Added by Chambon Bernard about 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
High
Assigned To:
-
Category:
-
Start date:
07/27/2015
Due date:
% Done:

0%

Estimated time:

Description

On thursday 2015/07/24 around 17h55, on cctrreqsxrrotd, the watchdog stopped.
It seems that no there were impact on the service, but a nagios alert was issued and PEB restarts treqs.

Actions #1

Updated by Chambon Bernard about 6 years ago

More details :

The last DB update from watchdog :


2015-07-24 17:55:50,756 [WrapperSimpleAppMain] DEBUG MySQLBroker - Query: 'UPDATE heart_beat SET last_time = NOW()'
2015-07-24 17:55:50,756 [WrapperSimpleAppMain] INFO  MySQLBroker - Mysql access duration (MySQLBroker.executeModification method) took 0 ms (0 s) 
...

 app was restart (by PEB)
... 
2015-07-24 18:45:00,126 [WrapperSimpleAppMain] DEBUG MySQLBroker - Query: 'INSERT INTO heart_beat (pid, start_time, last_time) VALUES (?, NOW(), NOW())'

At the same time there were deadlock on MySql DB, probably the reason why watchdog stops ?

Looking at the code, I saw that if watchdog fails,
1) Mysql exceptions (if any) is trapped, but no message is 'logged'
2) case exception, code exits loop

    ...

           while (this.cont) {
                LOGGER.debug("Sleeping for {} milliseconds", sleep);
                Thread.sleep(sleep);
                Watchdog.getInstance().heartBeat();
            }
        } catch (final InterruptedException e) {

What to do ?
1) get rid of watchdog ? or keep it for nagios sensor ?
2) Trap exception inside the watchdog loop and exit only after N exceptions with an ad-hoc messsage to restart app with JSW ? (I'm working on that)
3) Try to get rid of Mysql deadlock (I'm also working on that)

Actions #2

Updated by Chambon Bernard about 6 years ago

  • Status changed from New to In progress

Changing exception management for watchdog functionality :
o Trap exception inside the watchdog loop and
o Exiting only after N exceptions, with an ad-hoc messsage for JSW usage (app restart)

Actions #3

Updated by Chambon Bernard about 6 years ago

  • Status changed from In progress to Resolved

Last 'watchdog stop' was due to a ConcurrentModification exception
that wasn't catched in the right place (outside forever loop)
Changing to higher level exception + ad-hoc message for JSW fix the bug.

Actions

Also available in: Atom PDF