Counterparts to DEAP for other datasets

kmtrice · October 24, 2020, 1:07am

Does anyone know if other datasets from the NIMH such as the NDA dataset have similar analysis tools to DEAP either out or currently in development? My lab is working with them, too, and having a comparable tool would be fantastic. Thank you!

petra · October 24, 2020, 7:29pm

Similar in what sense? In the ‘glm models of any data that fits into the Nsubjects x 65000variables RDS’ sense? Or similar in the ‘we’ve figured out how to talk to the NDA database APIs’ sense?

kmtrice · October 25, 2020, 12:14am

Either would be helpful. No one in our lab has worked with such large data sets before, so right now we are trying to find a cloud-based software or other system that would allow us to quickly and easily subset the data and run models and the like with the NDA set, which is something DEAP seems to be able to handle well. The API part would be helpful as well, as it would save some of the background work with the programming and make it more intuitive for handling the data, but if there is already a system in place to handle other large data sets such as the NDA one in a systematic way that is easy to replicate, regardless of the APIs, that would be a good place to work from. If that makes any sense?

petra · October 25, 2020, 2:31am

My own table of ‘NDA bookmarks’ includes their Data Dictionary UI as well as their list of repositories and APIs (this page is useful but I’m really looking for the documentation on the data dictionary API talked to in the requests line of the second cell of the jupyter notebook here) that can work with these things in various ways. NDA-tools is my first place to look for updates on supported features.

Variables (called elements in the NDA) are somewhat organized into tables (called structures) of data from published batteries (like the NIH Toolbox Instruments), but things like ‘socdem01’ (sociodemographics) and ‘medhx01’ (medical history) are a bit more like a collection of all possible things contributors have ever needed. As soon as you have a package number you can download the text within any of those particular structures to your computer or to an EC2 instance in the cloud using the downloadcmd (see other thread).

After downloading the ‘flat’ rectangular data, you can go in a couple of different directions.
One direction would have you merge all of the behavioral data back into one super ‘mega’ table of all behavioral data measurements to send to your favorite statistical analysis software package and work with the usual way (I’m WAY oversimplifying the number of steps involved to create anything as close to as awesome as DEAP but Wes Thomson did mention that the merge code was available on their particular set of repositories somewhere). Aside: the ABCD has their very own set of behavioral data structures, so you’ll have to tweak it for GUID/date/etc-based merges with non-ABCD NDA data.

The other direction is where the S3 complication comes in: The big data (e.g. the actual images) are pointed to via a manifest of s3 links. But you can’t scroll through the list of these files before opening them, like you would with a mounted filesystem on your work computer because ‘listbuckets’ is disabled at the NDA for security reasons. I’ve been told a number of times that DataLad gets around this problem, which is why I’m reading their manual ahead of the ABCD-ReproNim lecture in which it will be introduced. In order to ‘touch’ any one of these files that are pointed to by this manifest you have to retrieve them through some sort of a ‘get’ statement. You could do this as needed with Datalad get and datalad drop, or you could use a tool to download your very own copy of everything in the package (may take weeks/months to get there and $$$$ to store). Apparently the DCAN lab has a tool that does this quickly (see other thread).

Once you have something that your codes can understand as a filesystem mounted to computer or supercomputer (I’m oversimplifying and probably misrepresenting concepts, so hopefully someone will correct me), NITRC_CE has a lot of analysis tools put together already in AMIs (images you can load onto an EC2) to take over. Brainlife.io is pretty stinkin’ awesome too and looks more container-focused, but I couldn’t tell you much about the infrastructure right this second, other than that getting it hooked up to the NDA data/permissions behemoth hadn’t happened in January or February 2020 when I first heard about it. Project week for enrolled students is also likely to be particularly illuminating.

This is pretty much everything I know, so hopefully others will chime in, or even better, point to a curated list somewhere.