Hello everyone, I’m writing this post asking for feedback on a dataset structure that I intend to adopt.
Context
In our lab we are doing a study on dog behavior. In this study each dog is a subject, and each session corresponds to a different scenario. One session could be the owner taking to the dog and petting it for 5 minutes, another one could be the owner giving some treat to the dog etc.
At the moment we have built an experimental setup for acquiring the following kinds of data for each dog subject:
- ecg + accelerometer + gyroscope data: saved in single file tsv format with timestamp as first column.
- video data: 4 video files per session in mp4 format. Each video is obtained by a camera located in a different corner of the experiment room.
- audio data: 2 audio files per session in wav format. Each audio is obtained by a different microphone. One is on the dog and the other one is on the owner.
- annotation from veterinarians and owner: the annotations made by one person are a tsv format file. This file has 3 columns: timestamp, valence, arousal. Since there are more persons doing the annotation there would be multiple annotation files.
Some lab members are planning to add eeg measurements of the dogs in the future, so I think it would be great to already start using the BIDS standard as soon as possible. Beacuse of this I went through all the documentation online with the goal of creating an appropriate dataset structure for our data.
Proposed dataset
The conclusion I reached is that all the data we are currently gathering should be put inside the /beh folder.
The reason is that at this page it is written the following: “In addition to logs from behavioral experiments performed alongside imaging data acquisitions, one MAY also include data from experiments performed with no neural recordings. The results of those experiments MAY be stored in the beh”
Based on this I have came up with the following dataset structure (for simplicity the json sidecars for some of the files are not listed here):
DOG_BEHAVIOR_DATASET/
├── .bidsignore
├── dataset_description.json
├── README.md
├── CITATION.cff
├── LICENSE
├── CHANGES
├── participants.tsv
├── participants.json
└── sub-01/
├── sub-01_sessions.tsv
├── sub-01_sessions.json
└── ses-scen01/
├── sub-01_ses-scen01_scans.tsv
├── sub-01_ses-scen01_scans.json
└── beh/
├── sub-01_ses-scen01_task-treat_acq-owner01_events.tsv
├── sub-01_ses-scen01_task-treat_acq-expert01_events.tsv
├── sub-01_ses-scen01_task-treat_acq-expert02_events.tsv
├── sub-01_ses-scen01_task-treat_acq-expert03_events.tsv
├── ...
├── sub-01_ses-scen01_task-treat_recording-cam1_video.mp4
├── sub-01_ses-scen01_task-treat_recording-cam2_video.mp4
├── sub-01_ses-scen01_task-treat_recording-cam3_video.mp4
├── sub-01_ses-scen01_task-treat_recording-cam4_video.mp4
├── sub-01_ses-scen01_task-treat_recording-dogmic_audio.wav
├── sub-01_ses-scen01_task-treat_recording-ownermic_audio.wav
└── sub-01_ses-scen01_task-treat_acq-ecg+acc+gyro_beh.tsv
Questions
- Do you think this is a good structure? Am I losing some details or perhaps I missed some new BEP?
- Video and audio data cannot be put on beh/, should I make the validator ignore them using the .bidsignore file? (Putting audio and video data in sourcedata/ does not seem appropriate to me since we will be using them directly, and not as source data that has to be processed, for some of our analysis using neural networks.)
- The ecg+motion tsv file has the timestamp in the first column, and related data in the other columns. This file should have the suffix _beh right?
- Since we are working with a multi-camera setup, is ‘recording-cam1’ the preferred entity, or should I use ‘acq-’?