Project

General

Custom queries

Profile

Actions

Task #9815

closed

Merge Watchdog features in JSW

Added by Brinette Pierre-Emmanuel about 10 years ago. Updated almost 10 years ago.

Status:
Closed
Priority:
High
Assigned To:
Category:
Server side
Start date:
05/05/2015
Due date:
% Done:

0%

Estimated time:

Description

Merge treqs Watchdog capabilities in JSW.

Currently, this script is run as root by cron on cctres[xrootd] and check the status in the database:

see "/opt/jtreqs/bin/watchdog.sh" script on cctreqs

# watchdog 
*/5 * * * * /opt/jtreqs/bin/watchdog.sh -db | tee -a /var/log/jtreqs/watchdog.log
Actions #1

Updated by Chambon Bernard about 10 years ago

  • Tracker changed from Bug to Task
Actions #2

Updated by Chambon Bernard about 10 years ago

  • Status changed from Assigned to In progress
J'ai regardé l'exemple suivant (relance demandee par watchdog le 30 avril vers 9h)

 dans  watchdog.log
2015-04-30_09:05:01 Stopping jTReqS: [FAILED]
2015-04-30_09:05:31 Starting jTReqS: [  OK  ]

 dans jtreqs.log
2015-04-30 09:00:42,847 [Activator] WARN  f.i.c.s.t.c.activator.Activator - Activator Stopped
2015-04-30 09:01:06,500 [tape-1430376533639-7-KT000700-2884669] INFO  f.i.c.s.treqs.tools.Instantiator - Class to instantiate fr.in2p3.cc.storage.treqs.hsm.hpssJNI.HPSSJNIBridge
2015-04-30 09:05:31,629 [main] INFO  f.i.c.s.t.control.starter.Starter - Starting Server

 dans la console
2015-04-22 10:36:34,260 [tape-1429691794228-2-IT769700-2879475] WARN  General error. Retrying /hpss/in2p3.fr/group/km3net/mc/prod/v4/JTE/km3_v4_anueNC_179.JTE.root
2015-04-30 09:00:42,847 [Activator] WARN  Activator Stopped
2015-04-30 09:01:16,929 [main] ERROR No heartbeat, the application is dying.
jTReqS-Server started
2015-04-30 09:05:31,694 [main] ERROR Version: jTReqS Server 1.6.1-SNAPSHOT

Pour rappel, dans le code java on a un boucle sur 
 while (cont) 
    Watchdog.getInstance().heartBeat();

 regarde le code heartBeat() ... dans tools/Watchdog.java
 Je dirais que s'il le heart beat n'est pas mis a jour alors on est dans le else de "if (activator && dispatcher)"  et donc  on a sur la console le message "No heartbeat, the application is dying" 
  message qu'on pourrait mettre dans la conf de JSW pour demander la relance de TREQS
  style 

# To restart JVM when next message occurs on stdout
wrapper.filter.trigger.2=No heartbeat, the application is dying
wrapper.filter.action.2=RESTART

# This property controls the number of seconds to pause between 
# a JVM exiting for any reason, and a new JVM being launched. 
# The default value is "5 seconds".
wrapper.restart.delay=30

A tester !

Actions #3

Updated by Chambon Bernard almost 10 years ago

  • Status changed from In progress to Closed
Actions

Also available in: Atom PDF