Operations Dashboard

Global presentation

The dashboard is a tool designed to follow and track problems at sites.
This tool is an integration platform and provides a synoptic display of different data sources:

  • Nagios, the monitoring tool. This tool is the official reference used to monitor sites.
  • GOCDB , sites static database
  • GGUS , the EGI helpdesk

Tickets created by Operators are part of the GGUS System. In summary, operations staff can via a single dashboard interface can track problems using different results from the various Monitoring Tools and can open or update trouble tickets.

We use also use the GOCDB to consolidate monitoring information with downtime information, and BDII to provide dynamic statuses (Storage Usage, CPU Usage, number of jobs waiting and running).

Dashboard access

When a user is accessing to the root url of ROD dashboard, its scope is systematically calculated and credentials automatically deduced from GOCDB. Consequently the user is automatically redirected his "highest" reachable view :

  • EGI responsibles will see the whole list of NGIs
  • a ROD operator will reach the list of the sites of its own NGI
  • and a site administrator will be redirected his own site overview.

Please note that only sites and NGIs related to an event are displayed. An event here means anything from Downtimes, ROD ticket, m/w retirement ticket, COD item, COD ticket, notepad to Nagios notification.

Alarm workflow

Nagios data

Data coming from Nagios probes is displayed via "issue" or "overview" tabs. Alarm with OK status will be automatically deleted by a regular clean up (12h) except if they have been assigned i.e. attached to a ticket. By clicking on the Nagios probe name column you will reach the history of all records of the related alarm lifecycle. Several views are available like table or chart.

Close alarm

Operator can close alarm using "Close alarm" action, alarms will flagged with non ok flag. Then it will be no more visible in 'operators' filter and finaly deleted by the cleanup. Please note that metrics are generated when non-ok alarm is closed.

Tickets and alarms grouping

How to create a ticket

You need to go to a site oriented view and select "issues" or "overview" tab to have the dropdown menu Action available.
Indeed ROD tickets are related to a site, this is the reason why you can't create tickets from a NGI oriented view. Once alarm(s) are assigned to a ticket they will no longer be visible in the operators filter. When a ticket is set to closed or verified using dashboard, associated group is automatically deleted and related alarms released.

Handle alarms grouping

On ROD dahsboard you can attach a set of alarms to a ticket. This alarms grouping can have its own life being updated (add/remove) independly of ticket status and entries. When you create a ticket one or more alarms, an alarm grouping is automatically created and a flag assigned is set to each alarm. A ticket and its associated alarm(s) grouping enable ROD operators to manage alarm(s) related to given service, a given host , a given site avoiding the creation of as many tickets as alarms.

Ticketing systems

Workflow and Helpdesk

ROD dashboard is handling several ticketing systems (T.S.) which are based on the same components. ROD tickets, MW Tickets, COD Tickets, Notepads and Handover are built on the same T.S. architecture, but have different components and configuration inside. Main components are :

  • Helpdesk : to handle connexion and CRUD behaviour (GGUS, ops-portal DB, ...)
  • Workflow : to describe the schema of the possible steps and the related templates

Each T.S. as a workflow which is describing the possible steps for a ticket (schema) and the matching templates. Steps are completely handled in dashboard and can be considered independent from what we call helpdesk. This system allow us to customize almost everything on demand for a given Ticketing System :

  • Step workflow
  • Message and subject templates for each Step
  • Form default values and validators (because form is automatically deduced)

GGUS case

Since 3.0.3 version all GGUS tickets with a non-terminal GGUS state (i.e. verified or unsolvable) are displayed in dashboard operators view. ROD people can close and verify tickets from ROD dashboard. Both actions will automatically release the assigned alarms.

HANDLING TICKET FROM GGUS : If a ticket is set to 'close' AND set to 'verify' on GGUS interface, assigned alarm can't be released.
Consequently a process will clean the orphan groups and release the remaining assigned alarms automatically.