What controls the speed of job submission to slurm scheduler? Seems like the more jobs I want to submit the more time it takes for each job to be placed in the queue (note: I’m not talking about execution from the queue but submission to it).
I’m running a jupyter notebook server that was given 6 virtual cores. Is it a coincidence that only 6 jobs are ever running on the rest of the cluster at one time? There are hundreds of free cores on the cluster. Though it’s not always identically 6.
It’s not a question of resources: restarting the server with 12 cores and 100 GB of ram as opposed to 6 and 40 GB makes no difference. It still takes exactly 8 seconds for 2 jobs to be submitted (at the same time stamp), then another 8 seconds and two more jobs gets submitted.
Changing max_jobs from 500 to 10 makes no difference.
The cluster partition is irrelevant (all nodes or an interactive one).
Running the notebook interactively or headerless is irrelevant.
Possibly of relevance:
“Nipype determines the hash of the input state of a node. If any input contains strings that represent files on the system path, the hash evaluation mechanism will determine the timestamp or content hash of each of those files. Thus any node with an input containing huge dictionaries (or lists) of file names can cause serious performance penalties.”
wf.config[‘execution’][‘hash_method’] = ‘timestamp’
hash_method made no difference. In fact, changing hash_method = ‘abc’ does not cause an error, so this must be beside the point. ie. it’s never called.
Job submission speed is a strong function of the size in memory of the inputs into the node, not the number of iterables even that it needs to process. 20 MB of inputs leads to 7 sec job submission speed (will try for a min working example).
A single infosource iterable to a function node is sufficient to cause the delay. A join node on the function node iterables doesn’t seem to slowdown submission of the function nodes which is the major bottleneck step.
For certain, the size of the variable of an input to the first node is the bottleneck on submission speed. 20 MB variable takes several seconds before a job can be queued. Passing a file name and loading the variable within the node is faster, but still not not millisecond level fast or even 1 second fast (regardless of the hash method used, though the hash method is actually called in that case).