Latest revision as of 11:57, 13 November 2017

Sourced from CECI's Slurm tutorial.

Slurm Quick Start Tutorial

Resource sharing on a supercomputer dedicated to technical and/or scientific computing is often organized by a piece of software called a resource manager or job scheduler. Users submit jobs, which are scheduled and allocated resources (CPU time, memory, etc.) by the resource manager.

Slurm is a resource manager and job scheduler designed to do just that, and much more. It was originally created by people at the Livermore Computing Center, and has grown into a full-fledge open-source software backed up by a large community, commercially supported by the original developers, and installed in many of the Top500 supercomputers.

Gathering information

sinfo

Slurm offers many commands you can use to interact with the system. For instance, the sinfo command gives an overview of the resources offered by the cluster, while the squeue command shows to which jobs those resources are currently allocated.

By default, sinfo lists the partitions that are available. A partition is a set of compute nodes (computers dedicated to... computing) grouped logically. Typical examples include partitions dedicated to batch processing, debugging, post processing, or visualization.

# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE  NODELIST
batch     up     infinite     2 alloc  node[8-9]
batch     up     infinite     6 idle   node[10-15]
debug*    up        30:00     8 idle   node[0-7]

In the above example, we see two partitions, named batch and debug. The latter is the default partition as it is marked with an asterisk. All nodes of the debug partition are idle, while two of the batch partition are being used.

The command sinfo can output the information in a node-oriented fashion, with the argument -N

# sinfo -N -l
NODELIST    NODES PARTITION STATE  CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
node[0-1]       2 debug*    idle      2   3448    38536     16 (null)   (null)
node[2,4-7]     5 debug*    idle      2   3384    38536     16 (null)   (null)
node3           1 debug*    idle      2   3394    38536     16 (null)   (null)
node[8-9]       2 batch     allocated 2    246    82306     16 (null)   (null)
node[10-15]     6 batch     idle      2    246    82306     16 (null)   (null)

Note that with the -l argument, more information about the nodes is provided: number of CPUs, memory, temporary disk (also called scratch space), node weight (an internal parameter specifying preferences in nodes for allocations when there are multiple possibilities), features of the nodes (such as processor type for instance) and the reason, if applicable, for which a node is down.

You can actually specify precisely what information you would like sinfo to output by using its --format argument. For more details, have a look at the command manpage with man sinfo.

squeue

The squeue command shows the list of jobs which are currently running (they are in the RUNNING state, noted as ‘R’) or waiting for resources (noted as ‘PD’, short for PENDING).

# squeue
JOBID PARTITION NAME USER ST  TIME  NODES NODELIST(REASON)
12345     debug job1 dave  R   0:21     4 node[9-12]
12346     debug job2 dave PD   0:00     8 (Resources)
12348     debug job3 ed   PD   0:00     4 (Priority)

The above output show that is one job running, whose name is job1 and whose jobid is 12345. The jobid is a unique identifier that is used by many Slurm commands when actions must be taken about one particular job. For instance, to cancel job job1, you would use scancel 12345. Time is the time the job has been running until now. Node is the number of nodes which are allocated to the job, while the Nodelist column lists the nodes which have been allocated for running jobs. For pending jobs, that column gives the reason why the job is pending. In the example, job 12346 is pending because requested resources (CPUs, or other) are not available in sufficient amounts, while job 12348 is waiting for job 12346, whose priority is higher, to run. Each job is indeed assigned a priority depending on several parameters whose details are explained in section Slurm priorities. Note that the priority for pending jobs can be obtained with the sprio command.

There are many switches you can use to filter the output by user --user, by partition --partition by state --state etc. As with the sinfo command, you can choose what you want sprio to output with the --format parameter.

Creating a job

Now the question is: How do you create a job?

A job consists in two parts: resource requests and job steps. Resource requests consist in a number of CPUs, computing expected duration, amounts of RAM or disk space, etc. Job steps describe tasks that must be done, software which must be run.

The typical way of creating a job is to write a submission script. A submission script is a shell script, e.g. a Bash script, whose comments, if they are prefixed with SBATCH, are understood by Slurm as parameters describing resource requests and other submissions options. You can get the complete list of parameters from the sbatch manpage man sbatch.

Important
The SBATCH directives must appear at the top of the submission file, before any other line except for the very first line which should be the shebang (e.g. #!/bin/bash).

The script itself is a job step. Other job steps are created with the srun command.

For instance, the following script, hypothetically named submit.sh,

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res.txt
#
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

srun hostname
srun sleep 60

would request one CPU for 10 minutes, along with 100 MB of RAM, in the default queue. When started, the job would run a first job step srun hostname, which will launch the UNIX command hostname on the node on which the requested CPU was allocated. Then, a second job step will start the sleep command. Note that the --job-name parameter allows giving a meaningful name to the job and the --output parameter defines the file to which the output of the job must be sent.

Once the submission script is written properly, you need to submit it to slurm through the sbatch command, which, upon success, responds with the jobid attributed to the job. (The dollar sign below is the shell prompt)

$ sbatch submit.sh
sbatch: Submitted batch job 99999999

The job then enters the queue in the PENDING state. Once resources become available and the job has highest priority, an allocation is created for it and it goes to the RUNNING state. If the job completes correctly, it goes to the COMPLETED state, otherwise, it is set to the FAILED state.

Interestingly, you can get near-realtime information about your running program (memory consumption, etc.) with the sstat command, by introducing sstat -j jobid. You can select what you want sstat to output with the --format parameter. Refer to the manpage for more information man sstat.

Upon completion, the output file contains the result of the commands run in the script file. In the above example, you can see it with cat res.txt command.

This example illustrates a serial job which runs a single CPU on a single node. It does not take advantage of multi-processor nodes or the multiple compute nodes available with a cluster. The next sections explain how to create parallel jobs.

Going parallel

There are several ways a parallel job, one whose tasks are run simultaneously, can be created:

by running a multi-process program (SPMD paradigm, e.g. with MPI)
by running a multithreaded program (shared memory paradigm, e.g. with OpenMP or pthreads)
by running several instances of a single-threaded program (so-called embarrassingly parallel paradigm or a job array)
by running one master program controlling several slave programs (master/slave paradigm)

In the Slurm context, a task is to be understood as a process. So a multi-process program is made of several tasks. By contrast, a multithreaded program is composed of only one task, which uses several CPUs.

Tasks are requested/created with the --ntasks option, while CPUs, for the multithreaded programs, are requested with the --cpus-per-task option. Tasks cannot be split across several compute nodes, so requesting several CPUs with the --cpus-per-task option will ensure all CPUs are allocated on the same compute node. By contrast, requesting the same amount of CPUs with the --ntasks option may lead to several CPUs being allocated on several, distinct compute nodes.

More submission script examples

Here are some quick sample submission scripts. For more detailed information, make sure to have a look at the Slurm FAQ and to follow our training sessions. There is also an interactive Script Generation Wizard you can use to help you in submission scripts creation.

Message passing example (MPI)

#!/bin/bash
#
#SBATCH --job-name=test_mpi
#SBATCH --output=res_mpi.txt
#
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

module load OpenMPI
srun hello.mpi

Request four cores on the cluster for 10 minutes, using 100 MB of RAM per core. Assuming hello.mpi was compiled with MPI support, srun will create four instances of it, on the nodes allocated by Slurm.

You can try the above example by downloading the example hello world program from Wikipedia (name it for instance wiki_mpi_example.c), and compiling it with

module load openmpi
mpicc wiki_mpi_example.c -o hello.mpi

The res_mpi.txt file should contain something like

0: We have 4 processors
0: Hello 1! Processor 1 reporting for duty

0: Hello 2! Processor 2 reporting for duty

0: Hello 3! Processor 3 reporting for duty

Shared memory example (OpenMP)

#!/bin/bash
#
#SBATCH --job-name=test_omp
#SBATCH --output=res_omp.txt
#
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./hello.omp

The job will be run in an allocation where four cores have been reserved on the same compute node.

You can try it by using the hello world program from Wikipedia (name it for instance wiki_omp_example.c) and compiling it with

gcc -fopenmp wiki_omp_example.c -o hello.omp

The res_omp.txt file should contain something like

Hello World from thread 0
Hello World from thread 3
Hello World from thread 1
Hello World from thread 2
There are 4 threads

Embarrassingly parallel workload example

This setup is useful for problems based on random draws (e.g. Monte-Carlo simulations). In such cases, you can have four programs drawing 1000 random samples and combining their output afterwards (with another program) you get the equivalent of drawing 4000 samples.

Another typical use of this setting is parameter sweep. In this case the same computation is carried on several times by a given code, differing only in the initial value of some high-level parameter for each run. An example could be the optimisation of an integer-valued parameter through range scanning:

#!/bin/bash
#
#SBATCH --job-name=test_emb_arr
#SBATCH --output=res_emb_arr.txt
#
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
#
#SBATCH --array=1-8

srun ./my_program.exe $SLURM_ARRAY_TASK_ID

In that configuration, the command my_program.exe will be run eight times, creating eight distinct jobs, each time with a different argument passed with the environment variable defined by slurm SLURM_ARRAY_TASK_ID ranging from 1 to 8.

The same idea can be used to process several data files. To different instances of the program we must pass a different file to read, based upon the value set in the $SLURM_* environment variable. For instance, assuming there are exactly eight files in /path/to/data we can create the following script:

#!/bin/bash
#
#SBATCH --job-name=test_emb_arr
#SBATCH --output=res_emb_arr.txt
#
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
#
#SBATCH --array=1-8

FILES=(/path/to/data/*)

srun ./my_program.exe ${FILES[$SLURM_ARRAY_TASK_ID]}

In this case, eight jobs will be submitted, each with a different filename given as an argument to my_program.exe defined in the array FILES[].

Note that the same recipe can be used with a numerical argument that is not simply an integer sequence, by defining an array ARGS[] containing the desired values:

ARGS=(0.05 0.25 0.5 1 2 5 100)

srun ./my_program.exe ${ARGS[$SLURM_ARRAY_TASK_ID]}

Warning
If the running time of your program is small, say ten minutes or less, creating a job array will incur a lot of overhead and you should consider packing your jobs.

Packed jobs example

The srun command has a (rather counter-intuitively-named) argument --exclusive that allows scheduling independent processes inside a Slurm job allocation. As the documentation states:

This option can also be used when initiating more than one job step within an existing resource allocation, where you want separate processors to be dedicated to each job step. If sufficient processors are not available to initiate the job step, it will be deferred. This can be thought of as providing a mechanism for resource management to the job within it's allocation. As an example, the following job submission script will ask Slurm for 8 CPUs, then it will run the myprog program 1000 times with arguments passed from 1 to 1000. But with the -n1 --exclusive option, it will control that at any point in time only 8 instances are effectively running, each being allocated one CPU.

#! /bin/bash
#
#SBATCH --ntasks=8
for i in {1..1000}
do
   srun -n1 --exclusive ./myprog $i &
done
wait

The for-loop can be replaced with GNU parrallel if installed on your system:

parallel -P $SLURM_NTASKS srun  -n1 --exclusive ./myprog ::: {1..1000}

Similarly, many files can be processed with one job submission script. The following script will run myprog for every file in /path/to/data, but maximum 8 at a time, and using one CPU per task.

#! /bin/bash
#
#SBATCH --ntasks=8
for file in /path/to/data/*
do
   srun -n1 --exclusive ./myprog $file &
done
wait

Here again the for-loop can be replaced with another command, xargs:

find /path/to/data -print0 | xargs -0 -n1 -P $SLURM_NTASKS srun -n1 --exclusive ./myprog

Master/slave program example

#!/bin/bash
#
#SBATCH --job-name=test_ms
#SBATCH --output=res_ms.txt
#
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

srun --multi-prog multi.conf

With file multi.conf being, for example, as follows

0      echo I am the Master
1-3    echo I am slave %t

The above instructs Slurm to create four tasks (or processes), one running echo 'I am the Master', and the other 3 running echo I am slave %t. The %t placeholder will be replaced with the task id. This is typically used in a producer/consumer setup where one program (the master) create computing tasks for the other program (the slaves) to perform.

Upon completion of the above job, file res_ms.txt will contain

I am slave 2
I am slave 3
I am slave 1
I am the Master

though not necessarily in the same order.

Hybrid jobs

You can mix multi-processing (MPI) and multi-threading (OpenMP) in the same job, simply like this:

#! /bin/bash
#
#SBATCH --ntasks=8
#SBATCH --ncpus-per-task=4
module load OpenMPI
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./myprog

or even a job array of hybrid jobs:

#! /bin/bash
#
#SBATCH --array=1-10
#SBATCH --ntasks=8
#SBATCH --ncpus-per-task=4
module load OpenMPI
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./myprog $SLURM_ARRAY_TASK_ID

GPU jobs

Some clusters have a GPU (see cluster page).

To see the if the cluster have a GPU check the generic resources of the computes nodes.

# sinfo  -o "%P %.10G %N"
PARTITION       GRES NODELIST
Def*       (null) lmWn[001-112]
PostP      gpu:1 lmPp[001-003]

The slurm command shows 3 nodes with GPU in the post processing partition.

If you want to claim a GPU for your job, you need to specify the GRES Generic Resource Scheduling parameter in your job script. Please note that GPUs are only available in a specific partition whose name depends on the cluster.

#SBATCH --partition=PostP
#SBATCH --gres=gpu:1

A sample job file requesting a node with a GPU could look like this:

#!/bin/bash
#SBATCH --job-name=example
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=1:00:00
#SBATCH --mem-per-cpu=1000
#SBATCH --partition=PostP
#SBATCH --gres=gpu:1

module load application/version

executable input.dat

Interactive jobs

Slurm jobs are normally batch jobs in the sense that they are run unattended. If you want to have a direct view on your job, for tests or debugging, you have two options.

If you need simply to have an interactive Bash session on a compute node, with the same environment set as the batch jobs, run the following command:

srun --pty bash

Doing that, you are submitting a 1-CPU, default memory, default duration job that will return a Bash prompt when it starts.

If you need more flexibility, you will need to use the salloc command. The salloc command accepts the same parameters as sbatch as far as resource requirement are concerned. Once the allocation is granted, you can use srun the same way you would in a submission script.

@@ Line 13: / Line 13: @@
 By default, sinfo lists the partitions that are available. A partition is a set of compute nodes (computers dedicated to... computing) grouped logically. Typical examples include partitions dedicated to batch processing, debugging, post processing, or visualization.
-<console>
+<br /><console>
 # ##i##sinfo
 PARTITION AVAIL TIMELIMIT NODES STATE  NODELIST
@@ Line 25: / Line 25: @@
 The command sinfo can output the information in a node-oriented fashion, with the argument -N
-<console>
+<br /><console>
 # ##i##sinfo -N -l
 NODELIST    NODES PARTITION STATE  CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
@@ Line 42: / Line 42: @@
 The squeue command shows the list of jobs which are currently running (they are in the RUNNING state, noted as ‘R’) or waiting for resources (noted as ‘PD’, short for PENDING).
-<console>
+<br /><console>
 # ##i##squeue
 JOBID PARTITION NAME USER ST  TIME  NODES NODELIST(REASON)
@@ Line 61: / Line 61: @@
 The typical way of creating a job is to write a submission script. A submission script is a shell script, e.g. a Bash script, whose comments, if they are prefixed with SBATCH, are understood by Slurm as parameters describing resource requests and other submissions options. You can get the complete list of parameters from the sbatch manpage man sbatch.
-<center>
+<br /><center>
 {| class="contenttable-darkblue cuscosky" style="width: 65%;"
 |-
@@ Line 74: / Line 74: @@
 For instance, the following script, hypothetically named submit.sh,
-<console>
+<br /><console>
 #!/bin/bash
 #
@@ Line 92: / Line 92: @@
 Once the submission script is written properly, you need to submit it to slurm through the sbatch command, which, upon success, responds with the jobid attributed to the job. (The dollar sign below is the shell prompt)
-<console>
+<br /><console>
 $ ##i##sbatch submit.sh
 sbatch: Submitted batch job 99999999
@@ Line 120: / Line 120: @@
 ==Message passing example (MPI)==
-<console>
+<br /><console>
 #!/bin/bash
 #
@@ Line 138: / Line 138: @@
 You can try the above example by downloading the example hello world program from Wikipedia (name it for instance wiki_mpi_example.c), and compiling it with
-<console>
+<br /><console>
 module load openmpi
 mpicc wiki_mpi_example.c -o hello.mpi
@@ Line 145: / Line 145: @@
 The res_mpi.txt file should contain something like
-<console>
+<br /><console>
 : We have 4 processors
 : Hello 1! Processor 1 reporting for duty
@@ Line 155: / Line 155: @@
 ==Shared memory example (OpenMP)==
-<console>
+<br /><console>
 #!/bin/bash
 #
@@ Line 174: / Line 174: @@
 You can try it by using the hello world program from Wikipedia (name it for instance wiki_omp_example.c) and compiling it with
-<console>
+<br /><console>
 ##i##gcc -fopenmp wiki_omp_example.c -o hello.omp
 </console><br />
@@ Line 180: / Line 180: @@
 The res_omp.txt file should contain something like
-<console>
+<br /><console>
 Hello World from thread 0
 Hello World from thread 3
@@ Line 193: / Line 193: @@
 Another typical use of this setting is parameter sweep. In this case the same computation is carried on several times by a given code, differing only in the initial value of some high-level parameter for each run. An example could be the optimisation of an integer-valued parameter through range scanning:
-<console>
+<br /><console>
 #!/bin/bash
 #
@@ Line 212: / Line 212: @@
 The same idea can be used to process several data files. To different instances of the program we must pass a different file to read, based upon the value set in the $SLURM_* environment variable. For instance, assuming there are exactly eight files in /path/to/data we can create the following script:
-<console>
+<br /><console>
 #!/bin/bash
 #
@@ Line 233: / Line 233: @@
 Note that the same recipe can be used with a numerical argument that is not simply an integer sequence, by defining an array ARGS[] containing the desired values:
-<console>
+<br /><console>
 ARGS=(0.05 0.25 0.5 1 2 5 100)
@@ Line 239: / Line 239: @@
 </console><br />
-<center>
+<br /><center>
 {| class="contenttable-darkblue cuscosky" style="width: 65%;"
 |-
@@ Line 258: / Line 258: @@
 As an example, the following job submission script will ask Slurm for 8 CPUs, then it will run the myprog program 1000 times with arguments passed from 1 to 1000. But with the -n1 --exclusive option, it will control that at any point in time only 8 instances are effectively running, each being allocated one CPU.
-<console>
+<br /><console>
 #! /bin/bash
 #
@@ Line 271: / Line 271: @@
 The for-loop can be replaced with GNU parrallel if installed on your system:
-<console>
+<br /><console>
 ##i##parallel -P $SLURM_NTASKS srun  -n1 --exclusive ./myprog ::: {1..1000}
 </console><br />
@@ Line 277: / Line 277: @@
 Similarly, many files can be processed with one job submission script. The following script will run myprog for every file in /path/to/data, but maximum 8 at a time, and using one CPU per task.
-<console>
+<br /><console>
 #! /bin/bash
 #
@@ Line 290: / Line 290: @@
 Here again the for-loop can be replaced with another command, xargs:
-<console>
+<br /><console>
 ##i##find /path/to/data -print0 | xargs -0 -n1 -P $SLURM_NTASKS srun -n1 --exclusive ./myprog
 </console><br />
 ==Master/slave program example==
-<console>
+<br /><console>
 #!/bin/bash
 #
@@ Line 310: / Line 310: @@
 With file multi.conf being, for example, as follows
-<console>
+<br /><console>
       echo I am the Master
 -3    echo I am slave %t
@@ Line 319: / Line 319: @@
 Upon completion of the above job, file res_ms.txt will contain
-<console>
+<br /><console>
 I am slave 2
 I am slave 3
@@ Line 331: / Line 331: @@
 You can mix multi-processing (MPI) and multi-threading (OpenMP) in the same job, simply like this:
-<console>
+<br /><console>
 #! /bin/bash
 #
@@ Line 343: / Line 343: @@
 or even a job array of hybrid jobs:
-<console>
+<br /><console>
 #! /bin/bash
 #
@@ Line 359: / Line 359: @@
 To see the if the cluster have a GPU check the generic resources of the computes nodes.
-<console>
+<br /><console>
 # ##i##sinfo  -o "%P %.10G %N"
 PARTITION       GRES NODELIST
@@ Line 370: / Line 370: @@
 If you want to claim a GPU for your job, you need to specify the GRES Generic Resource Scheduling parameter in your job script. Please note that GPUs are only available in a specific partition whose name depends on the cluster.
-<console>
+<br /><console>
 #SBATCH --partition=PostP
 #SBATCH --gres=gpu:1
@@ Line 377: / Line 377: @@
 A sample job file requesting a node with a GPU could look like this:
-<console>
+<br /><console>
 #!/bin/bash
 #SBATCH --job-name=example
@@ Line 398: / Line 398: @@
 If you need simply to have an interactive Bash session on a compute node, with the same environment set as the batch jobs, run the following command:
-<console>
+<br /><console>
 ##i##srun --pty bash
 </console><br />

I am new to GIGA

Howto

Difference between revisions of "Linux:Slurm"