Running Tractoflow in HPC multi-node

Summary of what happened:

I’m trying to run Tractoflow within an HCP cluster managed by SLURM, using multiple nodes, having changed the nextlow.config file to include executor = ‘slurm’.

Command used:

#SBATCH --account=haslab
#SBATCH --job-name=tractoflow-all-run
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=48:00:00
#SBATCH --output=/projects/ricarlojo/Tractoflow-adap-workspace/run_%j_n2_cpu.out  # std out
#SBATCH --error=/projects/ricarlojo/Tractoflow-adap-workspace/run_%j_n2_cpu.err   # std err

export NXF_VER=21.10.6
export MPLCONFIGDIR=/projects/ricarlojo/Tractoflow-workspace/Matlab-w

nextflow run /projects/ricarlojo/tractoflow-adapt/ --input /home/ricarlojo/HCP_files/T1w \
-with-singularity /projects/ricarlojo/scilus_1.6.0.sif --output_dir /projects/ricarlojo/Tractoflow-adap-workspace/results-n2 -with-tower





Relevant log outputs:

[b4/361e0f] NOTE: Error submitting process 'N4_T1 (S2)' for execution -- Execution is retried (2)
[9a/fadafa] NOTE: Error submitting process 'N4_T1 (S1)' for execution -- Execution is retried (2)
[b7/5ac631] NOTE: Error submitting process 'Denoise_DWI (S2)' for execution -- Execution is retried (3)
[ed/209420] NOTE: Error submitting process 'Denoise_DWI (S1)' for execution -- Execution is retried (3)
[18/3471df] NOTE: Error submitting process 'README (README)' for execution -- Execution is retried (3)
[5a/f68310] NOTE: Error submitting process 'Denoise_DWI (S3)' for execution -- Error is ignored
[b5/a45e3d] NOTE: Error submitting process 'N4_T1 (S3)' for execution -- Execution is retried (3)
[17/ec3d8b] NOTE: Error submitting process 'N4_T1 (S2)' for execution -- Execution is retried (3)
[0b/127b50] NOTE: Error submitting process 'N4_T1 (S1)' for execution -- Execution is retried (3)
[b5/9910bc] NOTE: Error submitting process 'Denoise_DWI (S2)' for execution -- Error is ignored
[e5/4f1aa4] NOTE: Error submitting process 'Denoise_DWI (S1)' for execution -- Error is ignored
[8b/471a2e] NOTE: Error submitting process 'README (README)' for execution -- Error is ignored
[27/ba8bec] NOTE: Error submitting process 'N4_T1 (S3)' for execution -- Error is ignored
[b3/505148] NOTE: Error submitting process 'N4_T1 (S2)' for execution -- Error is ignored
[2a/ffb959] NOTE: Error submitting process 'N4_T1 (S1)' for execution -- Error is ignored

Hello @Ricardo_A ,

I would need to know what’s behind these errors.
Can you check one them ? Please run this command:

cat work/18/3471df*/.command.err

Can you also give me the singularity version you have with this command line:

singularity version

Thank you

Hello @abore,

I checked the work folder recursively and it seems that the error file is not being created for any subprocess, just the and the files.

My singularity version is 4.1.2-1.el9.

Thank you

Hello @Ricardo_A ,

First time I see something like this, I always found some *.err files.
Another thing, I’m not using singularity this way anymore. I’m using it through apptainer and now their version is close to 1.3.x . I don’t know if it can make a difference.

Can you share your .nextflow.log ?
I hope we can get more info using this file. Another option would be to give it a try with a single node and remove the executor = ‘slurm’ in the nextflow.config.

Hope it helps

Hi @abore,

I managed to execute the standard pipeline, so I think the problem was within the scheduler.

The requested file looks as follows for each process it was trying to lunch:

Jan-20 21:32:56.643 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.644 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.653 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.653 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.662 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.663 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.669 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.669 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.678 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.678 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.681 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.681 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.684 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.685 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.687 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jan-20 21:32:56.688 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jan-20 21:32:56.691 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - Failed to submit process Denoise_DWI (S1) > exit: 1; workDir: /projects/ricarlojo/Tractoflow-adap-workspace/results-n2/work/0c/8afc41303ff109a13dcce4900b8baf
sbatch: error: You have to specify an account. Usage of default accounts is forbidden.
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

You are right, it was definitively an issue with the way you submitted your command with slurm.
Tell me if you get any tractoflow related errors.