Example Code for Batch (analyzing multiple suibjects)

I am running a representational similarity analysis and I have written a script for analyzing one single subject’s data. This script will extract the neural signals and calculate the coefficient score for each condition for each subject. I wonder how can I set up something like a batch to automatically run this for each subject data saved in a folder? I am using python and the main module is Nilearn, if that helps

Thank you so much!

Best,
Jacob

2 Likes

To be clear, I want to run it for each subject based on subject’s ID. If you guys can give me an example or direct me to the right link for further discussion, I would really appreciate it

One method would be to use python multiprocessing:

import multiprocessing 
import os
subs=[list of your subjects]

def your_function_here(sub):
     Your function goes here

proc_pool = multiprocessing.Pool(os.cpu_count())
run_parallel = proc_pool.map(your_function_here, subs)

Another is to use a Slurm job array (if you are on a HPC for example). This, at least how I usually do it, involves up to 3 scripts.

The submission script (submit_job_array.sh)

NOTE: the script below presumes that there is a directory that contains folders named after subjects (such as a BIDS root directory). Replace the all capitalized variables to match your needs.

#!/bin/bash
pushd $DIRECTORY_WHERE_SUBJECT_FOLDERS_ARE
subjs=($(ls sub*/ -d)) # Get list of subject directory names
subjs=("${subjs[@]///}") # Remove the lagging slash
popd

# take the length of the array
# this will be useful for indexing later
len=$(expr ${#subjs[@]} - 1) # len - 1 because 0 index
echo Spawning ${#subjs[@]} sub-jobs. # print to command line

sbatch --array=0-$len run_python_single_subject.sh ${subjs[@]}

The script that calls python for the subject (run_python_single_subject.sh)

Adjust the sbatch header as needed based on the requirements of your processing

#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --mem=8GB
#SBATCH --cpus-per-task=8
#SBATCH -J my_analysis

args=($@)
subjs=(${args[@]:0})
sub=${subjs[${SLURM_ARRAY_TASK_ID}]}
echo $sub

python $YOUR_PYTHON_FUNCTION.py $sub

Your python function

With this, your function will look mostly like how it already does, but you will need to make sure the subject name passed in to python is parsed into an argument. So you will need to add the following:

import argparse
parser = argparse.ArgumentParser(description='PUT WHAT YOUR FUNCTION DOES HERE')
parser.add_argument('sub', metavar='Subject', type=str, nargs='+', help='Subject name')
args = parser.parse_args()
sub = args.sub[0]
#PUT YOUR FUNCTION HERE

And then put the rest of your function there.

If you are on a HPC, I would prefer this second method, because you can more explicitly control how many resources are devoted to a job.

Hope this helps,
Steven

Thanks for being always such helpful Steven! I will take a look at this and adapt for my purpose. I guess the sh. might be a good one to run as I use shell before to run another type of analysis long time ago, it might take some time to refresh but I definitely will give it a try.

Hi Steven, is there any example code online that I can take a look at for python multiprocess?

Did the code I provide for that not work?

I haven’t tried it out yet but I am sure it will work, I just wanted to have more options and learn new things

Note also that joblib is a rather user-friendly solution for parallel code execution.

https://joblib.readthedocs.io/en/latest/parallel.html

Hi Steven,

Just a follow up question. So these three sbatch scripts should be separately written in .py files and run separately, correct? I guess the first and second script wont create any outcomes in the data subject folder, but after running the python function script, there should be some outcome for me, right?

also, where I can get this? Is this from the first script?

in the third script, would this be the python analysis script I have been working on?

Thanks in advance!

No, the first two are bash .sh files, the third is your python function (with the additional stuff I wrote so that you could pass in a subject from the command line).

No, the only thing you run is the first script, which calls upon the second script (which calls upon your python function).

You actually don’t change that. In a slurm job array, the job number is automatically coded to that specific variable name.

Yes. If you need to change the name of the variable sub to match what you’ve already wrote, you can do that.

Got you thank you so much! I guess what I need to do is to adapt the first .sh file for my data path, and then copy cover the third script to the very beginning of my analysis python script. By the way, I guess as long as these three are saved under a same folder, I dont need to worry about specifying the paths of these scripts?

Yes, and as I explained the code assumes that in the specified paths there are folders called sub-XXX, with one folder per subject. If that’s not the case you will have to find a new identifier or method for importing the subject names. Plenty of ways of doing that, Google will be your friend.

Yes

Yes, and your terminal should be in that directory too before running.

I don’t know about your memory quotas on your supercomputer, how many jobs you will be running, or how computationally intensive these jobs are, but update the SBATCH header in the second script to match your needs.

Yes, we have this format for each subject on a server

I think so as we have a server and a pipeline for analyzing the data, I guess the terminal is in that directory of data.
Just out of curiosity, what if I have data on my own computer - how can I set up the terminal in my folder directory?

Thanks again, Steven!

By terminal I mean your command line, that is what might look like this:

You use the cd PATH_GOES_HERE to get the terminal to a folder.

Depends on the operating system. Mac and Linux have similar syntax (cd), windows is different (I don’t use windows, you can Google it if necessary). However, I doubt you have the Slurm job scheduler on your own personal computer (used to parallelize and devote supercomputer resources to jobs), so it doesn’t really matter.

Explaining on this thread how terminals work is probably beginning to branch too far from the topic, so I’ll just leave this tutorial here.

oh got you that makes sense - yea, I used Mac terminal - so just cd into where-ever I need

Just to reiterate, for the Slurm job array option to work, the data need to be on the supercomputer and you will need to use a terminal based in the supercomputer (e.g. by SSH-ing into your supercomputer from your local Mac terminal).

Yes thanks for this reminder Steven, I guess I should have the access to use this. We have an university-wise server, I guess it is HPC as I submitted Slurm job before for analysis before.

Thanks I will take a look at it!

Hi Steven, I have a follow-up question, hope you dont mind. So I am very new to python and for my convenience, I downloaded some data on my computer while writing the script. As such, in my analysis script, I defined a single-subject data as follow

data_path_self = (’/Users/jd/nilearn_data/Self_postproc/func’)
func_self_filename = os.path.join (data_path_self, ‘sub-023sess001_task-CupsSelf_space-MNI152NLin2009cAsym_desc-preproc_bold_postproc.nii.gz’)

I wondered if I am using the three scripts you provided above, I should change something to define the path and subject ID so that it can be processed and analyzed properly while using sbatch?

Here are some data in the subject data folder:
sub-003sess001_task-CupsSelf_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz
sub-003sess001_task-CupsParent_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz
sub-003sess001_task-CupsPeer_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz

I guess I need to define some pattern to represent the data I want to use, instead of a specific subject data file name, so that the script can identify and iterate?