Wiki Monitoring de site¶
notes¶
Please note all files here are under the CeCILL license unless otherwise specified in the files.
The CeCILL license can be found in the repository and is compatible with the GNU GPL.
Not all the files were written by the same people, and some of these were simply copied from nagios exchange for packaging purposes.
package files description (see scripts --help or -h for details):¶
tests/ | tests files, for future tests (maybe) |
check_adaptec_storman.pl | uses Adaptec's arcconf tool to check devices status |
check_cciss.sh | checks cciss devices for RAID defaults. REQUIRES sudo |
check_CE_endedjobs | checks PBS /var/spool/pbs/server_priv/accounting for black hole nodes. REQUIRES sudo |
check_crk.sh | uses crk-XX scripts to check for hidden processes such as (phalanx ?) |
check_generic.sh | wrapper script that allows to send another check script, and to have uniform monitoring command |
check_gridbdii_cluster_freecpu.pl | attempt to monitor cluster usage trhough ldap info, but needs rewritting |
check_hidden_procs.sh | same as check_crk, but using a different and slower method... |
check_iptables_drop | checks if packets were dropped on port X |
check_iptables_inputchain.sh | checks if firewall is running and contains "enough" rules |
check_localhostcert | checks (local) certificates remaining lifetime, and warns if they are going to expire |
check_logtail_regexp | search the last lines of a log for a regexp. MAY use sudo. |
check_mem.pl | check memory and swap usage |
check_netio.sh | DISPLAY network interface status (usefull for graphs) |
check_nrpe_cmds | USES nrpe_listcmds (through NRPE...)on remote host to compare its nrpe commands with nagios ones, and check for inconsistency |
check_partspace | checks for partition usage - use standard check_disk instead. |
check_quattor | checks if quattor daemons are running. REQUIRES sudo, because of init scripts bugs |
check_remotecert | attempts to check a certificate remiotely, by connecting on a service port |
check_rw_filesystems.sh | checks a node for most mounted filesystems, and bails out if a mounted RW filesystem is not writable. REQUIRES sudo |
check_smart | checks all /dev/sd([a-z][char pipe]a[a-z]) devices for SMART status. REQUIRES sudo |
check_ssh_knownhosts | checks a lcg-CE knownhosts file, to see if there are out of date WN SSH keys - avoids black holes... |
crk-0.8.7 | see check_crk.sh |
crk-0.8.8 | see check_crk.sh |
framework.sh | bash framework COPY/PASTED in most bash scripts here |
Makefile | run make or make rpm to create a noarch rpm |
nagios-grid-plugins.spec | spec file |
nrpe_listcmds | lists all commands available to nrpe in the main config file |
ce qui doit être monitoré¶
Type de Noeuds | Type de test | Who + link URl | Validé par |
WNs | NFS mounts failing - check that files can be written and read. In particular checking that the VO_[VONAME]_SW_DIR is readable. | GRIF with scripting | Non validé |
WNs | ssh keys - some modes of operation require unchallenged ssh between WNs and the CE, or for MPI among the WNs. First simple check is to verify that the wn can copy back a file to the ce. | GRIF with Nagios | Non validé |
WNs | Check the processes running on each WN - that the needed processes (ntpd, pbs etc) are running, and that other things (rogue processes, stuck jobs) are not | ? | Non validé |
All service nodes | Host certificates expiring - make sure they get renewed in good time | ? | Non validé |
All nodes | CRLs expiring - this can cause failures for certificates from a single CA, which can be hard to diagnose | ? | Non validé |
All nodes | Filesystem in ReadOnly Mode | GRIF with Nagios | Non validé |
All nodes | Node crashes | GRIF with Nagios | Non validé |
All nodes | Disks becoming full, or nearly so. In particular check that jobs are not filling /tmp, the home directories or other scratch space. Also check that disks don't run out of inodes | GRIF with Nagios | Non validé |