Project

General

Profile

Data management working group

Welcome to the Datamanagement working group page.
The objectives of the group is to set up the rules and procedure about datas in Grand and produce at the end one or several reference document(s).
The main topics fo the group are :

  • Data inventory and flow : identify all types of datas (and format) we have to manage and determine what should be keept and referenced and what is available elsewhere and then useless to store.
  • Data organization (format, which level of splitting, directory grouping, naming conventions, updates or not rules,...)
  • Data structure (additional fields at file level e.g. unique identifier for all files related to a same observation/simulation,...)
  • Versioning (versioning of root file structure in files and code release to read/write data files, tag in code/libs and files, ... )
  • Database functionalities/Datamanager (DB content, use cases e.g. useful search, interface web/python/?, methods to retrieve datas, ...)
  • Data hosting (Where ? Access rules/security/confidentiality ? Access protocols scp,ftp,irods,http,.. ? Archiving and replication, ...)
  • Registering rules (data integrity check, validation process, ...)

Data management rules

Saved data

Data which will be officially registered are :

  • Simulations
    1. raw output from simulators (tar+gzip of directory)
    2. Grand root files after convertion
      The intermediate raw root files will be kept during a first stage for testing but will be removed when all the chain will be validated.
      Models used to run simulations are part of simulators and do not need to be saved.
  • Detectors (experiment)
    1. Row data as bin files
    2. Grand root files after convertion by gtot
  • Models
    1. Antennas response model (npy files)
      Presently it seems that all extra informations about experiment is already stored into the raw root files. Some extra monitoring (electromagnetic environement, atmosphere, ...) should be available at later stage (but do not exists now) and may be saved in another database dedicated to the monitoring. Some data quality information may be stored at some stage (created during analysis) but not yet determined.

Grand root files structure

Simulation data will contains at least : Trun, Trunefieldsim, Tshower and Tshowersim
Experiment data will contains at least : Trun, Tadc, Trawvoltage
Datas will be stored in directories.
One directory will correspond to a observation run or to a simulation.
Each directory will contains one trun file describing the run parameters and some additional root files for the events.
To limit the size of the files, files containing trees with traces may be be splitted on a event number base (e.g. events 1-1000 in file 1, events 1001 to 2000 in file 2 etc...).

Naming conventions

Raw filenames and directories :

  • Raw filenames should follow the pattern : [site]_[date]_[time]_RUN[run_number]_[mod]_[extra].bin

Where site is gp13, gaa or nancay.
Date and time are YYYYMMDD HHmmss (UTC)
mod is CD ( C oincidence D ata), MD ( M inimal bias D ata), UD ( U nit D ata)
extra can be whatever generator think can be usefull (20db_du85, etc...)

CD: data corresponding to central DAQ trigger (so called Second Level Trigger or T3), ie several DU triggers (so called First Level Trigger or T2) in coincidence
MD: data recorded with automatic, forced trigger (eg 20Hz or 10s)
UD: data corresponding to DU triggers (First Level Triggers or T2) not passing central DAQ trigger (Second Level trigger or T3).

  • When transfered at CC@IN2P3 the raw files should be stored in directories : [site]/raw/[YYYY]/[mm]/
    ex: /sps/grand/data/gp13/raw/2023/12/

Root filenames and directories :

When converted into GrandRoot format, the raw files will produce several root files (dataset) grouped into a single directory corresponding the a run.

  • Dataset directories name will match the following structure :
    [sim|exp|mod]_[site]_[date]_[time]_RUN[run_number]_[mod]_[extra]_[serial].root

Where sim is for simulations, exp for experimental data, mod for models (not clear if we will use it).
Serial is an extra serial number to distinguish different versions of a run (in case we need to compare different processing etc...).

e.g.
for experimental data : exp_nancay_20230531_123521_RUN25023_MD_test_1.root
for simulation : sim_gp300_20230420__RUN0__zhairesml_2.root

  • Root files inside the directory will match the following structure:
    [grouptreename]_[events|run]_L[analysis level]_[serial].root for run and event trees (events will be the range of events in the "event file" and run the run_number for run trees,
    e.g.
    run_25023_L0_0001.root for initial run tree of run 25023
    shower_1-100_L1_0001.root for shower trees of event 1 to 100 at level 1.
  • When stored at CC@IN2P3 in Lyon, GrandRoot files shoud go to directories : [site]/GrandRoot/[YYYY]/[mm]/
    ex: /sps/grand/data/gaa/GrandRoot/2023/11/

Analysis level has the following convention :

Sim data (efield, shower) would start at L0, and voltage and adc without noise generated from the efield would also be L0
Hardware data is with noise, so ADC, RawVoltage, Voltage and reconstructed Efield coming from hardware would be L1
ADC generated from Sim + added noise would be L1, and would correspond to the ADC from hardware. So would resulting L1 rawvoltage, voltage, reconstructed efield

So we would have analysis chains:
sim : efield/shower_L0 --> voltage_L0 --> adc_L0 --> adc_L1 (added noise) --> voltage_L1 --> efield_L1 (reconstructed)
hardware : adc_L1 --> rawvoltage_L1 --> voltage_L1 --> efield_L1 (reconstructed)

Important remarks

It is very important to respect the structure of the various paterns to be able to parse file names securely. If some parameter is not available, it should not be written BUT the underscores delimiting it's position MUST be kept (ex: sim_gp300_20230420__0__zhairesml_2 (no [hhmm] and no [mod] so double _)

Rules

Once created and registered a file is no longer modified. New analysis generate new files.
When creating files/dirs the naming convention have to be respected, i.e. if a field doesn't exists the placehoder is blank but we keep the _ to allow automatic parsing of names.
_ is not authorized in extra fields or analysis level.

useful links :

GRAND Root file format : https://box.in2p3.fr/index.php/s/ipmk4XZP87pjRnr
Github GRAND : https://github.com/grand-mother/grand
GRAND wiki@iap : https://www2-internet.iap.fr/grand/wikigrand/
Root file browser (by Jean Marc Colley) : https://github.com/luckyjim/BROOT