Creating a new subject subdataset in a large superdataset takes a long time

Hoda_RJ · January 4, 2021, 5:06pm

Hi,
I am in the process of creating a superdataset with the 500k subjects. each subject is a subdataset that is populated with relevant data. The issue is, after creating 32k subdatasets the process of creating a new subject subdataset is taking a long time (from almost 1hr to 3hrs) and it seems that this time is increasing as more subjects are added to the superdataset. This makes the process very slow and inefficient. Is there any way to enhance the new subdataset creation time?

eknahm · January 5, 2021, 6:29am

The largest number of subdatasets ever attempted (that I am aware of) is 50k. So 500k is unchartered territory.

I think it would be best to head over to https://github.com/datalad/datalad/issues and document how you are approaching this, and what exactly is happening.

First thing to understand would be, whether you already have all data populating a single directory for the superdataset. 500k items in a single directory is a tough challenge for many filesystems (even if you remove datalad from the picture). Which filesystem and operating system are you on?

What code and commands are you executing exactly?

Is 1h-3h the time to add a single subdataset? Is there a sharp transition, or a gradual slowdown?

That being said, the coming 0.14 release has numerous performance improvement for large datasets.

And on a more conceptual level: consider whether there could be meaningful intermediate categorizations of your subjects. If you build a 500k-subdataset dataset, cloning/deploying it will be a challenge for any consuming system for filesystem reasons alone. If you introduce an additional dataset level, the joint superdataset becomes much more manageable.

Hoda_RJ · January 6, 2021, 1:50pm

Thanks for the reply, I created this issue and explained the project with all the steps and more information. We may need to redefine the whole project structure and I like to have the datalad team suggestions to increase the capability of our data management and consumption.