Problems setting up Tractoflow on an HPC system

Hi everyone,

I’m very interested in testing out the performance of Tractoflow in my perinatal stroke dataset. I’m working on setting up the tool on my local high performance computing system, but running into some issues. As a user, I’ve done my best to follow the documented instructions but may have misunderstood some language or missed some implicated steps that I didn’t know about. I have installed nextflow and built the tractoflow singularity image, I believe everything is in order there but not sure how to verify until I get my datasets properly processing.

My attempts to run test subjects so far have been unsuccessful, and I’d like input from some more experienced users who can point out what may be driving my errors. Here’s the script I’m using to submit jobs so far:

#!/bin/sh

#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16000
#SBATCH --time=48:00:00
module load java
module load fsl
module load mrtrix
module load ants

nextflow -c /home/blgeerae/tractoflow/tractoflow-2.0.1/nextflow.config run ~/tractoflow/tractoflow-2.0.1/main.nf --root ~/tractoflow_test_subjects --dti_shells “0 750” --fodf_shells “0 750” -with-singularity ~/tractoflow/tractoflow.img -resume

Processing errors occur immediately after the job begins, the error output generally looks like:

executor > local (6)
[56/9aaed3] process > README [100%] 1 of 1, failed: 1
[69/e860e0] process > Denoise_DWI [ 50%] 1 of 2, failed: 1
[0c/582e82] process > Bet_Prelim_DWI [ 33%] 1 of 3, failed: 1
[56/9aaed3] NOTE: Process README (README) terminated with an error exit status (255) – Execution is retried (1)
[5f/334fbf] NOTE: Process Denoise_DWI (01-1023) terminated with an error exit status (255) – Execution is retried (1)
[99/42c7a8] NOTE: Process Bet_Prelim_DWI (10-1005) terminated with an error exit status (255) – Execution is retried (1)

executor > local (7)
[56/9aaed3] process > README [100%] 1 of 1, failed: 1
[69/e860e0] process > Denoise_DWI [ 50%] 1 of 2, failed: 1
[f5/3336b4] process > Bet_Prelim_DWI [ 50%] 2 of 4, failed: 2
[56/9aaed3] NOTE: Process README (README) terminated with an error exit status (255) – Execution is retried (1)
[5f/334fbf] NOTE: Process Denoise_DWI (01-1023) terminated with an error exit status (255) – Execution is retried (1)
[99/42c7a8] NOTE: Process Bet_Prelim_DWI (10-1005) terminated with an error exit status (255) – Execution is retried (1)
[49/c0690d] NOTE: Process Bet_Prelim_DWI (03-3147) terminated with an error exit status (255) – Execution is retried (1)

It looks to me like Tractoflow is unable to find the relevant functions, perhaps? I hope that my “module load ___” commands are sufficient to initialize the relevant tools for Tractoflow to use, but I’m unsure how to troubleshoot this issue on a remote cluster.

If anyone can provide me with suggestions or comment on installation steps I might have missed, I would really appreciate your help!

Hi blgeerae,

Did you try to load singularity dependency using module load singularity ? If not can you try to load it ? If singularity is not available on your cluster and you have the sudo access you can install it via neurodebian (https://tractoflow-documentation.readthedocs.io/en/latest/installation/before_install.html#id3).

If you have other errors, link me the cluster website if you have one.

Guillaume

Thank you for the response! I tried adding “module load singularity” to my script (singularity 3.4.1 is available on my cluster). Rather than failing in ~1 minute the job failed after about ~15 minutes, with error output looking similar to before. Any other ideas that I can try will be greatly appreciated!

Here’s the link to my cluster’s website (if I understand you right):

https://hpc.ucalgary.ca/quickstart/arc

If I can provide any other details please let me know, I’m happy to help but only marginally experienced, so not sure what to look for.

Thanks for the details. You seems to have an other issue. Can you copy paste me the following informations:

Content of /home/blgeerae/tractoflow/tractoflow-2.0.1/nextflow.config
In your output normally you have something like that: [99/42c7a8] NOTE: Process Bet_Prelim_DWI (10-1005) terminated with an error exit status (255) – Execution is retried (1).

In this line the following numbers are a part of a path: 99/42c7a8.

In your last run, can you take this path and paste it in the following command: cat work/YOUR_PATH*/.command.log . Don’t forget the * after your path. Copy paste me the output in this discussion. If you are not able to give me this information, just copy paste me the output of TractoFlow and I will give you directly the right command.

Guillaume

Hi again Guillaume,

Here’s the contents of my nextflow.config file:

process {
publishDir = {"./results/$sid/$task.process"}
scratch = true
errorStrategy = { task.attempt <= 3 ? ‘retry’ : ‘ignore’ }
maxRetries = 3
maxErrors = -1
stageInMode = ‘copy’
stageOutMode = ‘rsync’
tag = { “$sid” }
afterScript = ‘sleep 1’
}

params {
//Global options//
b0_thr_extract_b0=10
dwi_shell_tolerance=20

//**Preliminary DWI brain extraction**//
    dilate_b0_mask_prelim_brain_extraction=5
    bet_prelim_f=0.16

//**Denoise dwi (dwidenoise in Mrtrix3)**//
    run_dwi_denoising=true
    extent=7

//**Topup**//
    run_topup=true
    config_topup="b02b0.cnf"
    encoding_direction="y"
    dwell_time=0.062
    prefix_topup="topup_results"

//**Eddy**//
    run_eddy=true
    eddy_cmd="eddy_openmp"
    bet_topup_before_eddy_f=0.16
    use_slice_drop_correction=true

//**Final DWI BET**//
    bet_dwi_final_f=0.16

//**Resample T1**//
    run_resample_t1=true
    t1_resolution=1
    t1_interpolation="lin"

//**Normalize DWI**//
    fa_mask_threshold=0.4

//**Resample DWI**//
    run_resample_dwi=true
    dwi_resolution=1
    dwi_interpolation="lin"

//**Segment tissues**//
    number_of_tissues=3

//**Compute fiber response function (frf)**//
    fa=0.7
    min_fa=0.5
    min_nvox=300
    roi_radius=20
    set_frf=false
    manual_frf="15,4,4"

//**Mean fiber response function (frf)**//
    mean_frf=true

//**Compute fODF metrics**//
    sh_order=8
    basis="descoteaux07"
    fodf_metrics_a_factor=2.0
    relative_threshold=0.1
    max_fa_in_ventricle=0.1
    min_md_in_ventricle=0.003

//**Seeding mask**//
    wm_seeding=true

//**PFT tracking**//
    compress_streamlines=true
    algo="prob"
    seeding="npv"
    nbr_seeds=10
    random=0
    step=0.1
    theta=20
    sfthres=0.1
    sfthres_init=0.5
    min_len=20
    max_len=200
    particles=15
    back=2
    front=1
    compress_value=0.2

//**Number of processes per tasks**//
    processes_brain_extraction_t1=4
    processes_denoise_dwi=4
    processes_denoise_t1=4
    processes_eddy=1
    processes_fodf=4
    processes_registration=4

//**Template T1 path**//
    template_t1="/human-data/mni_152_sym_09c/t1"

//**Output directory**//
    output_dir=false

//**Process control**//
    processes = false

Mean_FRF_Publish_Dir = "./results/Mean_FRF"
Readme_Publish_Dir = "./results/Readme"

}

env {
ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=1
MRTRIX_NTHREADS=1
OMP_NUM_THREADS=1
OPENBLAS_NUM_THREADS=1
}

if(params.output_dir) {
process.publishDir = {"$params.output_dir/$sid/$task.process"}
params.Mean_FRF_Publish_Dir = “${params.output_dir}/Mean_FRF”
params.Readme_Publish_Dir = “${params.output_dir}/Readme”
}

if(params.processes) {
if(params.processes > Runtime.runtime.availableProcessors()) {
throw new RuntimeException(“Number of processes higher than available CPUs.”)
}
else if(params.processes < 1) {
throw new RuntimeException("When set, number of processes must be >= 1 " +
“and smaller or equal to the number of CPUs.”)
}
else {
executor.$local.cpus = params.processes
}
}

singularity {
runOptions=’–bind /home/blgeerae/tractoflow_test_subjects/’
}

I changed one setting in this config file, within the “singularity {” section at the very bottom I changed --runOptions=’–nv’ to --runOptions=’–bind …’ as shown in the above block quote. I made this change on the basis of the ‘How to launch Tractoflow’ section of the documentation, to hopefully point Tractoflow to the location of the disk with my data on it.

Next, here’s the output produced by “cat/work/84/a9e7a6*/.command.log”:

nxf-scratch-dir fc25:/tmp/nxf.VENjHsN978
WARNING: passwd file doesn’t exist in container, not updating
WARNING: group file doesn’t exist in container, not updating
FATAL: shell /bin/sh doesn’t exist in container

Since I’ve got a folder set up with 9 test datasets, I had many different path options to choose from, I just picked the last one from the list.

I’ve got the full output from the tractoflow job with me, but as I’m unable to upload files I’ll hold off on copying the full thing here for now, it’s quite a long log.

Thank you!

Bryce

Hello Bryce,

To be sure, I will run TractoFlow on my side with your singularity version. Is it the first time that we see this error and we saw an issue similar to this error message on the Singularity github. The issue was due to a bug in the singularity version. I will come back after my test.

Guillaume

Perfect, thank you so much for your continued assistance! I’ll look forward to hearing from you again with your processing results.

Bryce

Hi,

just a quick additional question: it seems that /bin/sh doesn’t exist in your singularity container, which is extremely weird.

On which OS, and with which version of the singularity engine did you build your container?

Also, could you try launching your processing with our prebuilt container? It is available here: http://scil.usherbrooke.ca/en/containers_list/ with the Tractoflow 2.0.0 tag.

That would help us see where the issue comes from.

As an additional side note, normally, if you use the singularity container, FSL, Mrtrix and Ants should already be included, so you wouldn’t need to module load them.

Cheers!

I completely agree! I’m not sure why or how /bin/sh wouldn’t exist. Perhaps I’ve gotten part of the command or config files wrong.

When I built the singularity image, I used singularity 3.4.1 on my HPC system, which runs CentOS 7.

I’ve got the tractoflow 2.0.0 container with me but I could use some guidance on how to structure the command to use the container. Here’s the command I’ve used:

nextflow -c singularity.conf run tractoflow-2.0.1/main.nf --root ~/tractoflow_test_subjects --dti_shells “0 750” --fodf_shells “0 750” -with-singularity tractoflow_2.0.0_8b39aee_2019_04_26.img -resume

singularity.conf has one line:

singularity.runOptions="–bind /home/blgeerae"

I’m sure that “run tractoflow-2.0.1/main.nf” isn’t correct but if I replace it with the sample command “tractoflow/main.nf”, nextflow exits with the error “-- Make sure exists a GitHub repository at this address 'https://github.com/nextflow-io/tractoflow”. So, I don’t think I’ve got that part right. Any insight will be appreciated!

Maybe building on CentOS creates the difference. We normally build the containers on Ubuntu.

Note that a container built on Ubuntu can work on CentOS (and therefore, on your HPC system).

As for the error about tractoflow/main.nf, did you git clone or get a local copy of Tractoflow on your HPC, directly in the directory where your command is executed? Nextflow will try to look in the tractoflow directory for the main.nf file, and if it doesn’t exist, will try to fetch it from Nextflow’s Github, which is not the correct location.

In this case, the first thing to do is to validate that you have a local copy of Tractoflow, and ideally to use the full path so that Nextflow looks in the correct place.

Hi again,

Thanks, I think I understand a little better now. Based on tractoflow’s documentation I didn’t use git clone, I instead just acquired the tractoflow pipeline via wget https://github.com/scilus/tractoflow/archive/2.0.1.zip && unzip 2.0.1.zip. I’ve since applied the same command to acquire version 2.0.0 and tried out testing both versions. Errors remain the same, summary for tractoflow-2.0.0 below:

Current error description

I’ve moved all relevant tractoflow files into the same directory and spelled out the full paths within the command like so:

module load singularity
nextflow -c /home/blgeerae/tractoflow-2.0.0/singularity.conf run /home/blgeerae/tractoflow-2.0.0/main.nf --root /home/blgeerae/tractoflow_test_subjects --dti_shells “0 750” --fodf_shells “0 750” -with-singularity /home/blgeerae/tractoflow-2.0.0/tractoflow_2.0.0_8b39aee_2019_04_26.img -resume

I’m executing that command from the /home/blgeerae/tractoflow-2.0.0/ directory, although I imagine that shouldn’t matter. This is resulting in the same error output, example:


Template T1: /human-data/mni_152_sym_09c/t1

Input: /home/blgeerae/tractoflow_test_subjects
executor > local (16)
[e3/83e327] process > Denoise_DWI [ 75%] 3 of 4, failed: 3
executor > local (16)
[e3/83e327] process > Denoise_DWI [100%] 4 of 4, failed: 4
[89/e6193e] process > README [100%] 4 of 4, failed: 4 :heavy_check_mark:
[e1/090ba8] process > Bet_Prelim_DWI [100%] 4 of 4, failed: 4
[92/773184] process > Denoise_T1 [100%] 4 of 4, failed: 4 :heavy_check_mark:
Pipeline completed at: Fri May 01 15:23:44 MDT 2020
Execution status: failed
Execution duration: 17.2s
[df/9c3af5] NOTE: Process README (README) terminated with an error exit status (255) – Execution is retried (1)
[dc/7db30b] NOTE: Process README (README) terminated with an error exit status (255) – Execution is retried (2)
[67/47ca2b] NOTE: Process Denoise_T1 (01-1006) terminated with an error exit status (255) – Execution is retried (1)
[a7/d7c9c1] NOTE: Process Bet_Prelim_DWI (01-1006) terminated with an error exit status (255) – Execution is retried (1)
[0a/414484] NOTE: Process Denoise_DWI (01-1006) terminated with an error exit status (255) – Execution is retried (1)
[ae/c997e9] NOTE: Process README (README) terminated with an error exit status (255) – Execution is retried (3)

cat work/df/9c3af5*/.command.log outputs the same information as before:

[blgeerae@arc2 tractoflow-2.0.0]$ cat work/df/9c3af5*/.command.log
nxf-scratch-dir arc2:/tmp/nxf.suDDOLk4iU
WARNING: passwd file doesn’t exist in container, not updating
WARNING: group file doesn’t exist in container, not updating
FATAL: shell /bin/sh doesn’t exist in container

Current thoughts / questions

I think I need a better understanding of the purpose of a singularity container. I’m completely unfamiliar with singularity. I would have assumed that the entire Tractoflow package along with its dependencies were contained within the singularity container. If so, why do I need a local copy of Tractoflow as well? Further, does FATAL: shell /bin/sh doesn't exist in container mean that there is an issue within the singularity container that I have acquired, or is this on the end of my HPC? For the record I have both manually downloaded the container from http://scil.usherbrooke.ca/en/containers_list/ and tried wget http://scil.usherbrooke.ca/containers_list/tractoflow_2.0.0_8b39aee_2019-04-26.img. Same issue for both.

Thanks so much,

Bryce

Hi all,

I’ve got good news to share! With the ongoing help of my local HPC support team, I’ve got Tractoflow up and running! The root of the problem we have been discussing here was an incomplete copy of the Tractoflow singularity container. I had tried a couple methods to get the singularity container onto my HPC and thought everything was fine, but it turned out each method that I used resulted in an incomplete copy of the container. using the git clone method finally resulted in a complete container. Singularity v3.4 is working fine with the container now, the only other problem I had to resolve was no copy of rsync on my work nodes. Once I added rsync to my local directory and popped that into the path in my job script, Tractoflow began to run correctly.

I’m finally running a trial of perinatal stroke datasets to see how Tractoflow performs! Looking forward to the results.

Thank you again, very much, for helping me through this process.

Hi Bryce,

Good news ! If you have any other issues don’t hesitate to post them on Neurostar.

Cheers !

Hi @blgeerae,

Can you mark this thread as solved ?

Thank you in advance
Arnaud