Second meeting (27062023)¶
Data splitting and naming (iteration over previous meeting decisions)¶
The [date] field will be splitted into [date]_[time] so it will be : YYYMMDD_hhmmss.
We should store antenna response models in the same way we store simulations and observations and use prefix "mod" for directories.
Thus the names of the directories will be : [sim|exp|mod]_[site]_[date]_[time]_[extra]_[serial]
For models, it's not obvious that site is relevant, but we could keep it to maintain the same naming convention. Possibily in the future there should be different antena models for each antenna (depending on the version of the antenna etc...).
The splitting of files will be done on the number of events (for now 1000) rather than on the size of the files.
There will be at least a Trun file in each directories. The other Ttrees should be either stored in different files (on tree per file) or grouped (efield+shower in a file, voltage in another,...) but definition of groups is not clear yet and need more reflexion (Lech ?)!
When creating files/dirs the naming convention have to be respected, i.e. if a field doesn't exists the placehoder is blank but we keep the _ to allow automatic parsing of names.
The "analysis levels" need to be determined. At least L0 is for raw datas. Other levels should correspond to some data quality levels or steps in cleaning the datas (trigger, etc...). The discussion will continue offline on this topic.
Name of files inside a directory should be modified by adding an [extra] or [user] field between the analysis level and the serial. This could allow to identify more easily the provider or the content of a file. The risk is to have some "tricky and meaningless" tags ! More discussion/reflexion about this is needed.
Following the history of files (i.e. which files and trees were used to produce a file/tree) is tricky. There are two possible approachs :- Having a mechanism to keep track of links/dependances between files/trees and having all the files in the same directory. Pro: everything in the same place, no duplication of datas, clean way to proceed. Con: Complexity and need work to imagine and implement (not sure we have the ressources).
- Duplicate files into different directories and having a single dependency within a directory. Pro: Apparently simple to implement. Con: Duplication of data and can lead quickly to an explosion of data storage.
We will iterate offline on this topic.
Versioning¶
Lech suggests to dig into the possibility to use server script in git to produce/increment a kind of tag each time a commit on root file structure is done. This tag should then be integrated into root files to know which version of the code was used to produce it. The advantage is that this process can be automated. Further investigations are needed (Francois and Lech).
Hosting¶
Data hosting can be provided by CC-IN2P3 (we already have a data management plan @CC-IN2P3). One or two other sites should be identified for replication/backup. A possibility should be in China @PMO. Ramesh will start discussions to see if it's possible.