Project

General

Profile

Wiki Monitoring de site

notes

Please note all files here are under the CeCILL license unless otherwise specified in the files.
The CeCILL license can be found in the repository and is compatible with the GNU GPL.
Not all the files were written by the same people, and some of these were simply copied from nagios exchange for packaging purposes.

package files description (see scripts --help or -h for details):

tests/ tests files, for future tests (maybe)
check_adaptec_storman.pl uses Adaptec's arcconf tool to check devices status
check_cciss.sh checks cciss devices for RAID defaults. REQUIRES sudo
check_CE_endedjobs checks PBS /var/spool/pbs/server_priv/accounting for black hole nodes. REQUIRES sudo
check_crk.sh uses crk-XX scripts to check for hidden processes such as (phalanx ?)
check_generic.sh wrapper script that allows to send another check script, and to have uniform monitoring command
check_gridbdii_cluster_freecpu.pl attempt to monitor cluster usage trhough ldap info, but needs rewritting
check_hidden_procs.sh same as check_crk, but using a different and slower method...
check_iptables_drop checks if packets were dropped on port X
check_iptables_inputchain.sh checks if firewall is running and contains "enough" rules
check_localhostcert checks (local) certificates remaining lifetime, and warns if they are going to expire
check_logtail_regexp search the last lines of a log for a regexp. MAY use sudo.
check_mem.pl check memory and swap usage
check_netio.sh DISPLAY network interface status (usefull for graphs)
check_nrpe_cmds USES nrpe_listcmds (through NRPE...)on remote host to compare its nrpe commands with nagios ones, and check for inconsistency
check_partspace checks for partition usage - use standard check_disk instead.
check_quattor checks if quattor daemons are running. REQUIRES sudo, because of init scripts bugs
check_remotecert attempts to check a certificate remiotely, by connecting on a service port
check_rw_filesystems.sh checks a node for most mounted filesystems, and bails out if a mounted RW filesystem is not writable. REQUIRES sudo
check_smart checks all /dev/sd([a-z][char pipe]a[a-z]) devices for SMART status. REQUIRES sudo
check_ssh_knownhosts checks a lcg-CE knownhosts file, to see if there are out of date WN SSH keys - avoids black holes...
crk-0.8.7 see check_crk.sh
crk-0.8.8 see check_crk.sh
framework.sh bash framework COPY/PASTED in most bash scripts here
Makefile run make or make rpm to create a noarch rpm
nagios-grid-plugins.spec spec file
nrpe_listcmds lists all commands available to nrpe in the main config file

ce qui doit être monitoré

Type de Noeuds Type de test Who + link URl Validé par
WNs NFS mounts failing - check that files can be written and read. In particular checking that the VO_[VONAME]_SW_DIR is readable. GRIF with scripting Non validé
WNs ssh keys - some modes of operation require unchallenged ssh between WNs and the CE, or for MPI among the WNs. First simple check is to verify that the wn can copy back a file to the ce. GRIF with Nagios Non validé
WNs Check the processes running on each WN - that the needed processes (ntpd, pbs etc) are running, and that other things (rogue processes, stuck jobs) are not ? Non validé
All service nodes Host certificates expiring - make sure they get renewed in good time ? Non validé
All nodes CRLs expiring - this can cause failures for certificates from a single CA, which can be hard to diagnose ? Non validé
All nodes Filesystem in ReadOnly Mode GRIF with Nagios Non validé
All nodes Node crashes GRIF with Nagios Non validé
All nodes Disks becoming full, or nearly so. In particular check that jobs are not filling /tmp, the home directories or other scratch space. Also check that disks don't run out of inodes GRIF with Nagios Non validé