'mounting' the NDA S3 buckets as a filesystem

petra · October 20, 2020, 2:20am

Is anyone else out there trying to ‘see’ the contents of their packages in the NDA without downloading or copying to another S3 location? Even with one S3 bucket, S3FS is painfully slow. With data for subjects spread across multiple buckets, I don’t think it would even be possible to use S3FS as is. Dare I ask if anyone has gone so far as to tweak something like an overlay filesystem to read data from NDA buckets but write the results of any analysis on them to another location?

Sincerely, Petra

satra · October 20, 2020, 2:23pm

i believe NDA disables the s3 listBucket operation. you need to use the s3 prefixes from the nda manifest to know what you are getting from the bucket.

petra · October 20, 2020, 5:02pm

Thanks, Satra.

Just wondering out loud if disabling the s3 listBucket operation will affect other methods developed to work with the buckets (FSx, for example, was next on my list of things to look into). Using a more visually intuitive presentation of the manifest to peer into the organization of one’s package seems a lot simpler and lightweight for arranging any read/write calls.

In case anyone else out there wants to try…to easily grab the s3 prefixes for an HCP package, I’ve used the downloadcmd python script from the command line:

create a package (without associated files) using NDA query tool and instructions,
pip install nda-tools,
run “downloadcmd 1234568 -dp -u username -p password -d /place/you/want/your/manifest” and wait for results.
Find the file that has ‘manifest’ in its name.
See the S3 links for all of the data in your package.
See that behavioral data (if you included this in your package) is a the root of the download directory and not along a S3 link.

I look forward to gaining the datalad intuition to appropriately wrap this process into a well annotated research object.

atrefo · October 20, 2020, 10:24pm

I’m working to help a researcher get the set of minimally processed imaging data from ABCD onto a cluster filesystem. The researcher wants to process the data locally, not in AWS, and because of the download restrictions this will of course take a very long time. I contacted the NDA help desk and they suggested we consider getting credits to process it in the cloud. Does anyone know if there are ever any exceptions made to the maximum speeds for downloading the whole dataset? It looks like the rules are that the restrictions are per user, do people pull subsets of the dataset down in parallel to speed things up? Am I asking a ridiculous question? Thanks!
Adam

petra · October 21, 2020, 1:35am

I only have two cents, but you can have them: I believe the NDA just increased the threshold to 20TB month of download per user, which should help a little. Related but unclear, there might be some sort of networking bottleneck to getting all the parallel downloads running (I asked a similar question about the downloadcmd tool today in a help desk ticket).

dmoracze · October 21, 2020, 1:45pm

We have downloaded the raw data and DCAN lab’s minimally preprocessed data onto our cluster. I didn’t do the download, so I pinged those who did to get their experience. I’ll post anything I learn here.

If you’d like minimally preprocessed data, @atrefo, I’d recommend you look into the DCAN lab’s collection 3165. They have their own downloader that has a parallelization option. I think you can also get BIDsified input (raw) data as well.

According to the RA I just pinged, the download of the minimally preprocessed dataset took “a few days”, but we also had a few failed tries until we got our ducks in a row.

atrefo · October 23, 2020, 7:11pm

Thanks for the $0.02 @petra, we are trying to get the DCAN minimally processed image data and also the fmriresults01 dataset, which are about ~60TB each so 6 months is a long wait. From a login node on Compute Canada’s beluga cluster it looks like I have access to 20MB/sec download speeds. I started with a single serial download ans that was topping out at about 20MB. I then fired up a 10 core parallel download and those are toppping out at about 2MB/sec. If I don’t get throttled it should take about 30 days.

atrefo · October 23, 2020, 7:25pm

@dmoracze, thanks for replying to my question! (sorry for the delayed response). The DCAN lab’s collection is exactly what I’m working on getting, first. Thanks for the link to the DCAN downloader, I came across that a while ago, but there are a number of downloaders, (java, python) and ways to use them (package with referenced data or not) that I was pretty confused for a while about which was the right way to grab the data, so thanks. I don’t think the DCAN lab’s downloader does an end run around the NDA’s throttling policy, does it? Did your RA say that they had to restart the download with a different NDA account or something like that? I suppose I’ll find out on November 2nd.