Hello everyone,
I need some advice regarding the publication of data in the BIDS format.
We are planning to publish a collection of EEG recordings of BCI experiments on OpenNeuro.org. The collection is rather big, about 60 subjects, most of them with multiple recording sessions, and was recorded at a 5000Hz sampling frequency. As you can imagine, the raw dataset consumes a lot of storage, i.e. several terabytes.
As such the dataset is quite difficult to work with in a lot of environments and releasing it as is would also probably cause problems with accessibility because of slow download speeds for a lot of people. Usually, people working with the dataset first downsample it to something in the range of 100-1000Hz, depending on the use case, anyways.
We would like to make the data as accessible as possible, while still allowing analyses on the higher frequency ranges, if people are interested in that. Therefore our current plan is to release multiple downsampled versions, i.e. something like 250Hz, 500Hz and 1000Hz, with download sizes ranging from 50 to 300GB. Would you consider this an acceptable approach?
We are also wondering how to organise such a dataset according to BIDS? Would all these versions be considered āderivativesā or are they all āraw dataā? Other than decimating them, no processing was applied to the data.
It seems a waste of resources to release the original source files, as probably no one would use them given the availability of the smaller versions ā¦
I would really appreciate any input on this matter!
Greetings
Felix
1 Like
Therefore our current plan is to release multiple downsampled versions, i.e. something like 250Hz, 500Hz and 1000Hz, with download sizes ranging from 50 to 300GB. Would you consider this an acceptable approach?
I think thatās fine, but stricly speaking the files would probably still count as āderivativeā, because downsampling involves some filtering as well (including all ādegrees of freedomā that come with it). Having said that, I think nobody is going to stop you to share the downsampled data āas ifā it were the raw data, and then ship 3 or 4 versions of it. If you decide to do that, I think itās very important to be explicit and transparent about that, and to crosslink the versions of the data - for example in the BIDS README
file.
Not sure in how far OpenNeuro would support this though (i.e., storing essentially identical versions of the same dataset just with differen sampling frequencies).
As such the dataset is quite difficult to work with in a lot of environments and releasing it as is would also probably cause problems with accessibility because of slow download speeds for a lot of people.
OpenNeuro is accessible via DataLad, that might be of interest to you, because via that route, users can pick the specific files they want to download, instead of having to download everything. However, this is arguably for āadvanced usersā.
It seems a waste of resources to release the original source files, as probably no one would use them given the availability of the smaller versions ā¦
I would make the original raw data available as well. You never know who wants to use it
@sappelhoff Thankās a lot for your insights!
I am aware of DataLad, but in my experience a lot of people still rely on the ability to download full archives through their browser. Most tutorials and default workflows - including the default download page of OpenNeuro and the DataLad QuickStart - assume that you would want to download the whole dataset. Which would include all the different versions if put inside a derivatives/
folder.
Out of the options offered when you hit the download button on OpenNeuro only the DataLad/git-annex solution provides an option to selectively download subfolders. And that isnāt obvious as well, you would need to visit the documentation to learn about that feature.
For now, I conclude that providing separate dataset versions seems the most accessible solution.
Does anyone know where I could inquire about that? I couldnāt find any official policy regarding this so far.
I guess as long as I donāt hit any storage restrictions I guess I can do that
@franklin and/or @effigies would know about this.
At the moment, OpenNeuro only hosts ārawā datasets. We donāt really have a policy for downsamplings, as this isnāt really a motivating use case in MRIā¦
Will bring this up for discussion.
Just as a follow-up:
-
From a principled standpoint, we would prefer people store the raw data for the reasons @sappelhoff mentioned: provenance and the ability to make alternative downsampling choices.
-
In the future, we will be supporting derivative datasets that can store clear links to the original dataset, at which point more manageable derivatives would make sense to host.
-
One thing you can do until we do support these derivative datasets could be to host them on GIN. Supposing your raw dataset is ds00WXYZ on OpenNeuro:
datalad create -c yoda ds00WXYZ-1kHz
cd ds00WXYZ-1kHz
mkdir sourcedata
datalad install -d . -s https://github.com/OpenNeuroDatasets/ds00WXYZ \
sourcedata/ds00WXYZ
# Check that .gitattributes match the sourcedata .gitattributes
# Copy/downsample files from the original as needed
# Store any code for automating this in code/
datalad create-sibling-gin -s gin heilerich/ds00WXYZ-1kHz
datalad push --to gin --data anything
A derivative organized in that way should be pretty close to what we would expect when we do support them.
1 Like
Thanks! I will look into hosting on GIN in that case.