HCP dataset - Google colab running out of RAM

Hi all,

Our group is working on the HCP dataset and we’re having a problem with google colab. We run out of RAM and colab crashes. Colab has allocated 12 G RAM and when we use all the RAM we don’t see any pop-up notification for upgrading the RAM! I wonder if anyone has the same problem and if there’s a way to solve it.

Hello!

We are having the same problem! Did you solve this?

Hey,
Nope. Haven’t figured it out yet :confused:

Can you post your notebook with public view permissions?

Sure!
https://colab.research.google.com/drive/1VChnlFbLwev9vUOA1uJxk0aMksnIRSUH?usp=sharing

Seems like the option to upgrade to 25GB for free is not there anymore (probably due to the upcoming Colab Pro). Perhaps you can look at methods for lazy reading, ie only loading a chunk of the data at a time.

1 Like

I had this problem too and the only ‘solution’ I found so far was running it as an ipynb on my own computer rather than through Google collab. This does remove the collaborative aspect though, and you would still need the required RAM & python installation on your own computer.

1 Like

Yeah, I tried that as well. I have the same problem with my local computer (not enough RAM). I guess the best option would be to chunck the data🤷🏼‍♀️

The full notebook executes on colab when we test it without using up all the RAM (it uses slightly less than half after running both the example rest and task analyses).

It’s not obvious from the notebook you shared where it’s running out of memory.

One thing that I notice is that doing


import pickle
pickle.dump(timeseries_rest, open( "file2save", "wb" ))

is going to leave the file handle open. I don’t know how much extra memory that’s going to use/prevent from being garbage collected, but it is the main thing I see that’s different from the code we share.

Try doing

import pickle
with open("file2save", "wb") as f:
  pickle.dump(timeseries_rest, f)

This is a good practice even if it doesn’t solve your specific issue.

2 Likes

This part will load the data from all subjects and hold it all in memory at the same time. If your analysis is per-subject, it might be better to load the data for each subject, run the analysis for one subject’s data and store the results, then load the next subject, etc. The results are probably much smaller than the raw data and can be stored for all subjects.

timeseries_rest = []
for subject in subjects:
  ts_concat = load_timeseries(subject, "rest")
  timeseries_rest.append(ts_concat)

timeseries_rest_array = np.array(timeseries_rest)

something like

all_results = []
for subject in subjects:
  ts = load_timeseries(subject, "rest")
  results = do_some_analysis(ts)
  all_results.append(results)

make_plots(all_results)
2 Likes