The SLURM job manager

Aether is managed by the SLURM task scheduler. This page aims to give you a list of the commonly used commands for SLURM.

This page contains content originally created by Harvard FAS Research Computing and adapted by us under the Creative Commons Attribution-NonCommercial 4.0 International License. For more information, visit https://rc.fas.harvard.edu/about/attribution/ .

Quickstart

Task

SLURM

SLURM Example

Submit a batch serial job

sbatch

sbatch runscript.sh

Run a command interatively

srun

srun mycommand

Start a shell on the cluster

srun

srun --pty -t 10 --mem 1000 /bin/bash

Kill a job

scancel

scancel 999999

View status of queues

squeue

squeue -u janedoe

Check current job by id

sacct

sacct -j 999999

Documentation

  • Beginner’s quick start guide

  • Rosetta stone (introduction for users familiar with other batch schedulers like SGE)

  • You can access the official command reference using the man pages, i.e., by the command man <command>. For example, try the commands

    man sbatch
    man squeue
    man scancel
    

Interactive jobs

Though batch submission is the best way to take full advantage of the compute power in Aether, foreground, interactive jobs can also be run. These can be useful for things like:

  • Iterative data exploration at the command line

  • Interactive “console tools” like R and iPython

  • Software development and compiling

An interactive job differs from a batch job in that they should be initiated with the srun command instead of sbatch. This command:

srun -p all --pty --mem 500 -t 0-06:00 /bin/bash

will start a command line shell (/bin/bash) on the default queue with 500 MB of RAM for 6 hours; 1 core on 1 node is assumed as these parameters (-n1 -N1) were left out. When the interactive session starts, you will notice that you are no longer on a login node, but rather one of the compute nodes dedicated to this queue. The --pty option allows the session to act like a standard terminal. In a pinch, you can also run an application directly though this is discouraged due to problems setting up bash environment variables.

Note

The process started by the srun command will inherit the environment variables from your shell. This means that you have to load the modules required for your job in the shell, before the call to srun.

Running an interactive shell via SLURM

As mentioned above, it is possible to run a shell via SLURM, i.e., on one of the compute nodes. For example, for the bash shell, you can run

srun --pty -p interactive -c 4 -t 0-08:00 --mem 4000 /bin/bash

Submitting batch jobs using the sbatch command

The main way to run jobs on Aether is by submitting a script with the sbatch command. The command to submit a job is as simple as:

sbatch runscript.sh

The commands specified in the runscript.sh file will then be run on the first available compute node that fits the resources requested in the script. sbatch returns immediately after submission; commands are not run as foreground processes and won’t stop if you disconnect from Aether.

A typical submission script, in this case using the hostname command to get the computer name, will look like this:

#!/bin/bash
#SBATCH --ntasks=1                # Number of tasks (see below)
#SBATCH --cpus-per-task=1         # Number of CPU cores per task
#SBATCH --nodes=1                 # Ensure that all cores are on one machine
#SBATCH --time=0-00:05            # Runtime in D-HH:MM
#SBATCH --partition=all           # Partition to submit to
#SBATCH --mem=100                 # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH --output=hostname_%j.out  # File to which STDOUT will be written
#SBATCH --error=hostname_%j.err   # File to which STDERR will be written
#SBATCH --mail-type=END           # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=ajk@123.com   # Email to which notifications will be sent

hostname

In general, the script is composed of 3 parts.

  • the #!/bin/bash line allows the script to be run as a bash script

  • the #SBATCH lines are technically bash comments, but they set various parameters for the SLURM scheduler

  • the command line itself.

The #SBATCH lines shown above set key parameters.

Note

It is important to keep all #SBATCH lines together and at the top of the script; no bash code or variables settings should be done until after the #SBATCH lines.

The SLURM system copies many environment variables from your current session to the compute host where the script is run including PATH and your current working directory. As a result, you can specify files relative to your current location (e.g. ./project/myfiles/myfile.txt).

#SBATCH --ntasks=1

This line sets the number of tasks that you’re requesting. Make sure that your code can use multiple cores before requesting more than one. When running MPI code, --ntasks should be set to the number of threads (i.e., the same number you give as -n parameter to mpirun). When running OpenMP code (i.e., without MPI), you do not need to set this option (if you do set it, set it to --ntasks=1). If this parameter is omitted, SLURM assumes --ntasks=1.

#SBATCH --cpus-per-task=1

This line sets the number of CPU cores per task that you’re requesting. Make sure that your code can use multiple cores before requesting more than one. When running MPI code, --cpus-per-task should be set to 1. When running OpenMP code, --cpus-per-task should be set to the value of the environment variable OMP_NUM_THREADS, i.e., the number of threads that you want to use. If this parameter is omitted, SLURM assumes --cpus-per-task=1.

#SBATCH --nodes=1

This line requests that the cores are all on node. Only change this to >1 if you know your code uses a message passing protocol like MPI. SLURM makes no assumptions on this parameter – if you request more than one task (--ntasks > 1) and your forget this parameter, your job may be scheduled across nodes; and unless your job is MPI (multinode) aware, your job will run slowly, as it is oversubscribed on the master node and wasting resources on the other(s).

#SBATCH --time=5

This line specifies the running time for the job in minutes. You can also use the convenient format D-HH:MM. If your job runs longer than the value you specify here, it will be cancelled. Jobs have a maximum run time of 7 days on Aether, though extensions can be done. There is no penalty for over-requesting time.

Todo

check slurm defults for sbatch -tN

Note

If this parameter is omitted on any partition, the your job will be given the default of 10 minutes.

#SBATCH --partition=highmem

This line specifies the SLURM partition (a.k.a. queue) under which the script will be run. The highmem partition is good for jobs with extra-high memory requirements.

#SBATCH --mem=100

The Aether cluster requires that you specify the amount of memory (in MB) that you will be using for your job. Accurate specifications allow jobs to be run with maximum efficiency on the system. There are two main options, --mem-per-cpu and --mem. The --mem option specifies the total memory pool for one or more cores, and is the recommended option to use. If you must do work across multiple compute nodes (e.g., MPI code), then you must use the --mem-per-cpu option, as this will allocate the amount specified for each of the cores you’re requested, whether it is on one node or multiple nodes. If this parameter is omitted, the smallest amount is allocated, usually 100 MB. And chances are good that your job will be killed as it will likely go over this amount.

#SBATCH --output=hostname_%j.out

This line specifies the file to which standard out will be appended. If a relative file name is used, it will be relative to your current working directory. The %j in the filename will be substituted by the jobID at runtime. If this parameter is omitted, any output will be directed to a file named SLURM-JOBID.out in the current directory.

#SBATCH --error=hostname_%j.err

This line specifies the file to which standard error will be appended. SLURM submission and processing errors will also appear in the file. The %j in the filename will be substituted by the jobID at runtime. If this parameter is omitted, any output will be directed to a file named SLURM-JOBID.out in the current directory.

Todo

check if slurm defaults for stdout and stderr are correct

#SBATCH --mail-type=END

Because jobs are processed in the “background” and can take some time to run, it is useful send an email message when the job has finished (--mail-type=END). Email can also be sent for other processing stages (START, FAIL) or at all of the times (ALL)

#SBATCH --mail-user=ajk@123.com

The email address to which the --mail-type messages will be sent. By default, the e-mail will be sent to the address the user specified when signing up for an Aether account.

SLURM partitions

Partition

node range

Default/Max time (h)

Default/Max #cores per node

Default/Max Mem per CPU (MB)

all

node[01-16]

12/48

1/28

4520/9128

all

node[17-56]

12/48

1/28

4520/4520

interactive*

node[01-16]

1/6

1/28

4520/9128

interactive*

node[17-56]

1/6

1/28

4520/4520

interactive is the default partition. This means that jobs submitted without specifying the partition via option -p, --partition=<partition_name> will run on the interactive partition. On this partition, a job can not occupy more than a single node.

Topology

Aether is placed in two racks, with the 56 computing nodes equally splitted between the two. The nodes are interconnected by an Intel Omni-Path network and there is one Omni-Path 100 Gigabit switche per rack. However, there is a 1:2 bandwidth blocking ratio when going between the rack switches. To avoid that, your job should be fitted within a single rack (maximum 28 nodes or 784 cores). Slurm has the option to control where the parallel jobs in the system are placed with the option --switches=count@[maxtime], where count is the number of switches and maxtime is the maximum time to wait for that number of switches. maxtime should be specified in the following way: “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.

Below is an example of a job requesting to be allocated nodes connected to the same switch. It will wait for maximum 60 minutes before will accept nodes spread across more than one switch.

#SBATCH --switches=1@60

Most common SLURM commands

Submitting job scripts

The following example script specifies a partition, time limit, memory allocation and number of cores. All your scripts should specify values for these four parameters. You can also set additional parameters as shown, such as jobname, output file and email notification. For This script performs a simple task — it generates of file of random numbers and then sorts it. A detailed explanation the script is available here.

#!/bin/bash
#SBATCH --ntasks=1                         # Number of tasks
#SBATCH --cpus-per-task=1                  # Number of CPU cores per task
#SBATCH --nodes=1                          # Ensure that all cores are on one machine
#SBATCH --time=0-00:05                     # Runtime in D-HH:MM
#SBATCH --partition=all                    # Partition to submit to
#SBATCH --mem=100                          # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH --output=hostname_%j.out           # File to which STDOUT will be written
#SBATCH --error=hostname_%j.err            # File to which STDERR will be written
#SBATCH --mail-type=END                    # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=myemail@uni-bremen.de  # Email to which notifications will be sent

for i in {1..100000}; do
  echo $RANDOM >> SomeRandomNumbers.txt
done

sort SomeRandomNumbers.txt

Todo

Add default values for SBATCH parameters

Now you can submit your job with the command:

sbatch myscript.sh

Information on jobs

List all current jobs for a user:

squeue -u <username>

List all running jobs for a user:

squeue -u <username> -t RUNNING

List all pending jobs for a user:

squeue -u <username> -t PENDING

List priority order of jobs for the current user (you) in a given partition:

showq-slurm -o  -U -q <partition>

List all current jobs in the highmem partition for a user:

squeue -u <username> -p highmem

List detailed information for a job (useful for troubleshooting):

scontrol show jobid -dd <jobid>

List status info for a currently running job:

sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps

Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc.

To get statistics on completed jobs by jobID:

sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed

To view the same information for all jobs of a user:

sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed

Controlling jobs

To cancel one job:

scancel <jobid>

To cancel all the jobs for a user:

scancel -u <username>

To cancel all the pending jobs for a user:

scancel -t PENDING -u <username>

To cancel one or more jobs by name:

scancel --name myJobName

To pause a particular job:

scontrol hold <jobid>

To resume a particular job:

scontrol resume <jobid>

To requeue (cancel and rerun) a particular job:

scontrol requeue <jobid>

Job arrays and useful commands

As shown in the commands above, its easy to refer to one job by its Job ID, or to all your jobs via your username. What if you want to refer to a subset of your jobs? The answer is to submit your job set as a job array. Then you can use the job array ID to refer to the set when running SLURM commands. See the following excellent resources for further information:

To cancel an indexed job in a job array:

scancel <jobid>_<index>

e.g.

scancel 1234_4

Advanced commands

The following commands work for individual jobs and for job arrays, and allow easy manipulation of large numbers of jobs. You can combine these commands with the parameters shown above to provide great flexibility and precision in job control. (Note that all of these commands are entered on one line)

Suspend all running jobs for a user (takes into account job arrays):

squeue -ho %A -t R | xargs -n 1 scontrol suspend

Resume all suspended jobs for a user:

squeue -o "%.18A %.18t" -u <username> | awk '{if ($2 =="S"){print $1}}' | xargs -n 1 scontrol resume

After resuming, check if any are still suspended:

squeue -ho %A -u $USER -t S | wc -l

The following is useful if your group has its own queue and you want to quickly see utilization.

lsload |grep 'Hostname|<partition>'

Example for the all partition:

lsload |grep 'Hostname|all'

Hostname Cores InUse Ratio Load  Mem Alloc State
node17   64    60    100.0 12.01 262 261   ALLOCATED
node18   64    64    100.0 12.00 262 240   ALLOCATED
node19   64    40    100.0 12.00 262 261   ALLOCATED
  • Note that while node19 has free cores, all its memory in use. So those cores are necessarily idle.

  • node18 has a little free memory but all the cores are in use.

  • The scheduler will shoot for 100% utilization, but jobs are generally stochastic; beginning and ending at different times with unpredictable amounts of CPU and RAM released/requested.