Running Tractoflow in HPC multi-node

Summary of what happened:

I’m trying to run Tractoflow within an HCP cluster managed by SLURM, using multiple nodes, having changed the nextlow.config file to include executor = ‘slurm’.

Command used:

#!/bin/bash
#SBATCH --account=haslab
#SBATCH --job-name=tractoflow-all-run
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=48:00:00
#SBATCH --output=/projects/ricarlojo/Tractoflow-adap-workspace/run_%j_n2_cpu.out  # std out
#SBATCH --error=/projects/ricarlojo/Tractoflow-adap-workspace/run_%j_n2_cpu.err   # std err

export OMP_NUM_THREADS=8
export ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=8
export NXF_VER=21.10.6
export MPLCONFIGDIR=/projects/ricarlojo/Tractoflow-workspace/Matlab-w


nextflow run /projects/ricarlojo/tractoflow-adapt/main.nf --input /home/ricarlojo/HCP_files/T1w \
-with-singularity /projects/ricarlojo/scilus_1.6.0.sif --output_dir /projects/ricarlojo/Tractoflow-adap-workspace/results-n2 -with-tower

Version:

v.2.4.3

Environment:

Singularity

Relevant log outputs:

[b4/361e0f] NOTE: Error submitting process 'N4_T1 (S2)' for execution -- Execution is retried (2)
[9a/fadafa] NOTE: Error submitting process 'N4_T1 (S1)' for execution -- Execution is retried (2)
[b7/5ac631] NOTE: Error submitting process 'Denoise_DWI (S2)' for execution -- Execution is retried (3)
[ed/209420] NOTE: Error submitting process 'Denoise_DWI (S1)' for execution -- Execution is retried (3)
[18/3471df] NOTE: Error submitting process 'README (README)' for execution -- Execution is retried (3)
[5a/f68310] NOTE: Error submitting process 'Denoise_DWI (S3)' for execution -- Error is ignored
[b5/a45e3d] NOTE: Error submitting process 'N4_T1 (S3)' for execution -- Execution is retried (3)
[17/ec3d8b] NOTE: Error submitting process 'N4_T1 (S2)' for execution -- Execution is retried (3)
[0b/127b50] NOTE: Error submitting process 'N4_T1 (S1)' for execution -- Execution is retried (3)
[b5/9910bc] NOTE: Error submitting process 'Denoise_DWI (S2)' for execution -- Error is ignored
[e5/4f1aa4] NOTE: Error submitting process 'Denoise_DWI (S1)' for execution -- Error is ignored
[8b/471a2e] NOTE: Error submitting process 'README (README)' for execution -- Error is ignored
[27/ba8bec] NOTE: Error submitting process 'N4_T1 (S3)' for execution -- Error is ignored
[b3/505148] NOTE: Error submitting process 'N4_T1 (S2)' for execution -- Error is ignored
[2a/ffb959] NOTE: Error submitting process 'N4_T1 (S1)' for execution -- Error is ignored

Hello @Ricardo_A ,

I would need to know what’s behind these errors.
Can you check one them ? Please run this command:

cat work/18/3471df*/.command.err

Can you also give me the singularity version you have with this command line:

singularity version

Thank you
Arnaud

Hello @abore,

I checked the work folder recursively and it seems that the error file is not being created for any subprocess, just the .command.run and the .command.sh files.

My singularity version is 4.1.2-1.el9.

Thank you
Ricardo

Hello @Ricardo_A ,

First time I see something like this, I always found some *.err files.
Another thing, I’m not using singularity this way anymore. I’m using it through apptainer and now their version is close to 1.3.x . I don’t know if it can make a difference.

Can you share your .nextflow.log ?
I hope we can get more info using this file. Another option would be to give it a try with a single node and remove the executor = ‘slurm’ in the nextflow.config.

Hope it helps
Best,
Arnaud

Hi @abore,

I managed to execute the standard pipeline, so I think the problem was within the scheduler.

The requested file looks as follows for each process it was trying to lunch:


Jan-20 21:32:56.643 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.644 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.653 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.653 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.662 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.663 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.669 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.669 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.678 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.678 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.681 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.681 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.684 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.685 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.687 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.688 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.691 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - Failed to submit process Denoise_DWI (S1) > exit: 1; workDir: /projects/ricarlojo/Tractoflow-adap-workspace/results-n2/work/0c/8afc41303ff109a13dcce4900b8baf
sbatch: error: You have to specify an account. Usage of default accounts is forbidden.
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

You are right, it was definitively an issue with the way you submitted your command with slurm.
Tell me if you get any tractoflow related errors.
Best,
Arnaud