Making FS/FSL output publicly available

datalad

#1

Hey Yarik,

Hope you are doing well. I wanted to check how you organize your FS/FSL output with datalad. Would you create a new top level dir and subjects in that or a sub-dir of the datalad subjects dir. I would like to share these via datalad is why I ask. Let me know if there is anything you would like.

Its a lot of output (I have been saying 1PB for quotes with 60% glacier) so I am still looking for other options than S3 for both storage and transfer. I think Satra mentioned something potentially coming a couple months ago? COS seems to have some project hosting features that would be useful beyond downloads but you are limited to 5GB. Are there any other services you think might be helpful? I know the SDSC tries to be competitive to some degree for UC people. Any NERSC or other options I might not know about?

Are there other datastores that aren’t in datalad that I can add after I fill out some form but I can make the data available to users and results publicly shareable? HCP is my first concern but UK Biobank and others are available now too. Who best to ask about these kinds of legal questions?

Thanks in advance for your time. I still love everything you are doing.

Cheers,

-Morgan


#2

Hi Morgan! Long time no see… actually I do not think we ever actually saw each other face to face, where are you now?

As for how to organize output with datalad – major aspect is the # of files. git itself is not particularly well fit to hold more than 10k files in a single repo, and that becomes a limiting factor suggesting the split strategy.

In general output organization might be dictated by the input structure. If it is all in BIDS, then having a subdataset per each analysis under derivatives/ (e.g. derivatives/freesurfer) would be the most logical. If not BIDS, derivatives could become simple outputs/. BTW we initiated a template https://github.com/myyoda/template where we hope to outline such strategies and provide some guidance for such cases. It is in an early stage though.
If the number of subjects is large (you are talking about 1PB of data, so I assume that it is not just lots of data in volume but in # of subjects), then you might end up with a sub-dataset per subject within each of those output datasets… So something like following commands would establish you the structure

    $> datalad create --text-no-annex /tmp/testds
[INFO   ] Creating a new annex repo at /tmp/testds 
create(ok): /tmp/testds (dataset)                                                

hopa:~
$> cd /tmp/testds

(git-annex)hopa:/tmp/testds[master]
$> mkdir outputs

(git-annex)hopa:/tmp/testds[master]
$> datalad create -d . outputs/freesurfer
[INFO   ] Creating a new annex repo at /tmp/testds/outputs/freesurfer 
add(ok): outputs/freesurfer (dataset) [added new subdataset]                     
add(notneeded): .gitmodules (file) [already included in the dataset]             
add(notneeded): outputs/freesurfer (dataset) [nothing to add from /tmp/testds/outputs/freesurfer]
save(ok): /tmp/testds (dataset)                                                  
create(ok): outputs/freesurfer (dataset)
action summary:
  add (notneeded: 2, ok: 1)
  create (ok: 1)
  save (ok: 1)

(git-annex)hopa:/tmp/testds[master]
$> datalad create -d . outputs/freesurfer/sub-00001
[INFO   ] Creating a new annex repo at /tmp/testds/outputs/freesurfer/sub-00001 
add(ok): outputs/freesurfer/sub-00001 (dataset) [added new subdataset]           
add(notneeded): outputs/freesurfer/.gitmodules (file) [already included in the dataset]
add(notneeded): outputs/freesurfer/sub-00001 (dataset) [nothing to add from /tmp/testds/outputs/freesurfer/sub-00001]
add(notneeded): outputs/freesurfer (dataset) [already known subdataset]
save(ok): /tmp/testds/outputs/freesurfer (dataset)                               
save(ok): /tmp/testds (dataset)
create(ok): outputs/freesurfer/sub-00001 (dataset)
action summary:
  add (notneeded: 3, ok: 1)
  create (ok: 1)
  save (ok: 2)

(git-annex)hopa:/tmp/testds[master]
$> datalad ls -r .
.                              [annex]  master  ✗ 2017-12-21/11:41:55  ✓
outputs/freesurfer             [annex]  master  ✗ 2017-12-21/11:41:55  ✓
outputs/freesurfer/sub-00001   [annex]  master  ✗ 2017-12-21/11:41:54  ✓

I would be happy to chat at some point to get to know more details.

With 1PB of data, finding a single “free” provider who would fit it all is “tricky” to say the least :wink: if you have funds – than anything could work I guess. It should also be possible to spread it across multiple providers – e.g. to publish some sub-datasets to figshare (we have now export_to_figshare plugin to upload a tarball to figshare and link it back), but not sure how far that would scale (I know there are limits per dataset, not sure if there are per account :wink: ). supporting COS is planned but not sure when we get there

licensing is yet another tricky topic – not sure if you could freely share HCP derivatives, and clearly not UK Biobank since they, I guess, still hope to become rich by selling the data they acquired on public funds :wink: whom to ask – probably the origins of the datasets.


#3

I think it was OHBM 2017 when someone asked me if UK Biobank data was free. My answer was that there is a small fee of ~ £2,000* for the data that is basically used to pay for the computing resources and guaranteeing the long term maintenance of the resource.

The poor guy got very serious and apologised for not having any money on him at that very moment. I had to explain that I was not personally collecting the money. The money has to be paid to UK Biobank, which is a registered charity (and therefore, a non-profit organisation).

In retrospect, I think I missed the chance to get a free dinner by asking for a small retainer :smiley:

*Just for the imaging data, it would be £0.02 per subject. Hardly a way to get rich, to be honest.


#4

That was my point - you can’t get rich from charging for access to this data, but by charging regardless how small fee, you restrict availability and reuse. If it is just a matter of being able to guarantee archival and sustainability, upload to some government/institution/initiative supported data archival resource, or better to as many of them as possible :wink:


#5

My only point was against the idea of having a fee to get rich.

Despite my support of open data and open science, there are good reasons to have higher restriction levels in UK Biobank*. @KirstieJane gave an incredible talk on OHBM this year about access to UK Biobank and she went through most of those good reasons.

I like citing her because:

  • She saved me the (future) effort of explaining how to access the data :slight_smile:
  • She did it way better than I would have been able to do.
  • She is not directly involved in UK Biobank, so she is not biased.

I hope the talk is available online soon.

*There are good reasons. I do not necessary think those reasons are good enough.


#6

Thank you for the endorsement @Fidel! I’m glad you like the talk and I agree it does make sense that it comes from someone not involved in UK Biobank.

Here are the slides if anyone’s looking for them. They aren’t very clear (I don’t write down my notes so you don’t get to hear what I said on any of these slides!) but maybe you’ll get the points :smile_cat:

I think @yarikoptic, @Fidel and I would all agree that it would be GREAT if governments/institutions supported long term data archiving. It’s one of the reasons I’m so excited that the Turing Institute is housed in the British Library! But, at least in the UK, we don’t have these services coordinated for big neuroimaging/biobank projects yet.


#7

My only point was against the idea of having a fee to get rich.

That was my point too, I guess just expressed differently :wink:
And yeah, I saw a photo of Kirstie giving that talk, and will definitely listening to the talk when videos become available. Kudos @KirstieJane


#8

I still think I should have asked the guy for a £50 retainer.