The SLURM job manager¶
Aether is managed by the SLURM task scheduler. This page aims to give you a list of the commonly used commands for SLURM.
This page contains content originally created by Harvard FAS Research Computing and adapted by us under the Creative Commons Attribution-NonCommercial 4.0 International License. For more information, visit https://rc.fas.harvard.edu/about/attribution/ .
Quickstart¶
Task |
SLURM |
SLURM Example |
---|---|---|
Submit a batch serial job |
|
|
Run a command interatively |
|
|
Start a shell on the cluster |
|
|
Kill a job |
|
|
View status of queues |
|
|
Check current job by id |
|
Documentation¶
Rosetta stone (introduction for users familiar with other batch schedulers like SGE)
You can access the official command reference using the
man
pages, i.e., by the commandman <command>
. For example, try the commandsman sbatch man squeue man scancel
Interactive jobs¶
Though batch submission is the best way to take full advantage of the compute power in Aether, foreground, interactive jobs can also be run. These can be useful for things like:
Iterative data exploration at the command line
Interactive “console tools” like R and iPython
Software development and compiling
An interactive job differs from a batch job in that they should be initiated
with the srun
command instead of sbatch
. This command:
srun -p all --pty --mem 500 -t 0-06:00 /bin/bash
will start a command line shell (/bin/bash
) on the default queue with 500 MB
of RAM for 6 hours; 1 core on 1 node is assumed as these parameters (-n1
-N1
) were left out. When the interactive session starts, you will notice that
you are no longer on a login node, but rather one of the compute nodes dedicated
to this queue. The --pty
option allows the session to act like a standard
terminal. In a pinch, you can also run an application directly though this is
discouraged due to problems setting up bash environment variables.
Note
The process started by the srun
command will inherit the environment
variables from your shell. This means that you have to load the modules
required for your job in the shell, before the call to srun
.
Running an interactive shell via SLURM¶
As mentioned above, it is possible to run a shell via SLURM, i.e., on one of the compute nodes. For example, for the bash shell, you can run
srun --pty -p interactive -c 4 -t 0-08:00 --mem 4000 /bin/bash
Submitting batch jobs using the sbatch
command¶
The main way to run jobs on Aether is by submitting a script with the
sbatch
command. The command to submit a job is as simple as:
sbatch runscript.sh
The commands specified in the runscript.sh
file will then be run on the
first available compute node that fits the resources requested in the
script. sbatch
returns immediately after submission; commands are not run as
foreground processes and won’t stop if you disconnect from Aether.
A typical submission script, in this case using the hostname
command to get
the computer name, will look like this:
#!/bin/bash
#SBATCH --ntasks=1 # Number of tasks (see below)
#SBATCH --cpus-per-task=1 # Number of CPU cores per task
#SBATCH --nodes=1 # Ensure that all cores are on one machine
#SBATCH --time=0-00:05 # Runtime in D-HH:MM
#SBATCH --partition=all # Partition to submit to
#SBATCH --mem=100 # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH --output=hostname_%j.out # File to which STDOUT will be written
#SBATCH --error=hostname_%j.err # File to which STDERR will be written
#SBATCH --mail-type=END # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=ajk@123.com # Email to which notifications will be sent
hostname
In general, the script is composed of 3 parts.
the
#!/bin/bash
line allows the script to be run as a bash scriptthe
#SBATCH
lines are technically bash comments, but they set various parameters for the SLURM schedulerthe command line itself.
The #SBATCH
lines shown above set key parameters.
Note
It is important to keep all #SBATCH
lines together and at the top of the
script; no bash code or variables settings should be done until after the
#SBATCH
lines.
The SLURM system copies many environment variables from your current session to
the compute host where the script is run including PATH and your current working
directory. As a result, you can specify files relative to your current location
(e.g. ./project/myfiles/myfile.txt
).
#SBATCH --ntasks=1
This line sets the number of tasks that you’re requesting. Make sure that your code can use multiple cores before requesting more than one. When running MPI code,
--ntasks
should be set to the number of threads (i.e., the same number you give as-n
parameter tompirun
). When running OpenMP code (i.e., without MPI), you do not need to set this option (if you do set it, set it to--ntasks=1
). If this parameter is omitted, SLURM assumes--ntasks=1
.
#SBATCH --cpus-per-task=1
This line sets the number of CPU cores per task that you’re requesting. Make sure that your code can use multiple cores before requesting more than one. When running MPI code,
--cpus-per-task
should be set to1
. When running OpenMP code,--cpus-per-task
should be set to the value of the environment variableOMP_NUM_THREADS
, i.e., the number of threads that you want to use. If this parameter is omitted, SLURM assumes--cpus-per-task=1
.
#SBATCH --nodes=1
This line requests that the cores are all on node. Only change this to >1 if you know your code uses a message passing protocol like MPI. SLURM makes no assumptions on this parameter – if you request more than one task (
--ntasks
> 1) and your forget this parameter, your job may be scheduled across nodes; and unless your job is MPI (multinode) aware, your job will run slowly, as it is oversubscribed on the master node and wasting resources on the other(s).
#SBATCH --time=5
This line specifies the running time for the job in minutes. You can also use the convenient format D-HH:MM. If your job runs longer than the value you specify here, it will be cancelled. Jobs have a maximum run time of 7 days on Aether, though extensions can be done. There is no penalty for over-requesting time.
Todo
check slurm defults for sbatch -tN
Note
If this parameter is omitted on any partition, the your job will be given the default of 10 minutes.
#SBATCH --partition=highmem
This line specifies the SLURM partition (a.k.a. queue) under which the script will be run. The highmem partition is good for jobs with extra-high memory requirements.
#SBATCH --mem=100
The Aether cluster requires that you specify the amount of memory (in MB) that you will be using for your job. Accurate specifications allow jobs to be run with maximum efficiency on the system. There are two main options,
--mem-per-cpu
and--mem
. The--mem
option specifies the total memory pool for one or more cores, and is the recommended option to use. If you must do work across multiple compute nodes (e.g., MPI code), then you must use the--mem-per-cpu
option, as this will allocate the amount specified for each of the cores you’re requested, whether it is on one node or multiple nodes. If this parameter is omitted, the smallest amount is allocated, usually 100 MB. And chances are good that your job will be killed as it will likely go over this amount.
#SBATCH --output=hostname_%j.out
This line specifies the file to which standard out will be appended. If a relative file name is used, it will be relative to your current working directory. The
%j
in the filename will be substituted by the jobID at runtime. If this parameter is omitted, any output will be directed to a file namedSLURM-JOBID.out
in the current directory.
#SBATCH --error=hostname_%j.err
This line specifies the file to which standard error will be appended. SLURM submission and processing errors will also appear in the file. The
%j
in the filename will be substituted by the jobID at runtime. If this parameter is omitted, any output will be directed to a file namedSLURM-JOBID.out
in the current directory.
Todo
check if slurm defaults for stdout and stderr are correct
#SBATCH --mail-type=END
Because jobs are processed in the “background” and can take some time to run, it is useful send an email message when the job has finished (
--mail-type=END
). Email can also be sent for other processing stages (START, FAIL) or at all of the times (ALL)
#SBATCH --mail-user=ajk@123.com
The email address to which the
--mail-type
messages will be sent. By default, the e-mail will be sent to the address the user specified when signing up for an Aether account.
SLURM partitions¶
Partition |
node range |
Default/Max time (h) |
Default/Max #cores per node |
Default/Max Mem per CPU (MB) |
---|---|---|---|---|
all |
node[01-16] |
12/48 |
1/28 |
4520/9128 |
all |
node[17-56] |
12/48 |
1/28 |
4520/4520 |
interactive* |
node[01-16] |
1/6 |
1/28 |
4520/9128 |
interactive* |
node[17-56] |
1/6 |
1/28 |
4520/4520 |
interactive is the default partition. This means that jobs submitted without specifying
the partition via option -p, --partition=<partition_name>
will run on the interactive
partition. On this partition, a job can not occupy more than a single node.
Topology¶
Aether is placed in two racks, with the 56 computing nodes equally splitted
between the two. The nodes are interconnected by an Intel Omni-Path network and
there is one Omni-Path 100 Gigabit switche per rack. However, there is a 1:2 bandwidth
blocking ratio when going between the rack switches. To avoid that,
your job should be fitted within a single rack (maximum 28 nodes or 784 cores).
Slurm has the option to control where the parallel jobs in the system are placed
with the option --switches=count@[maxtime]
, where count
is the number
of switches and maxtime
is the maximum time to wait for that number of switches.
maxtime should be specified in the following way: “minutes”, “minutes:seconds”,
“hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.
Below is an example of a job requesting to be allocated nodes connected to the same switch. It will wait for maximum 60 minutes before will accept nodes spread across more than one switch.
#SBATCH --switches=1@60
Most common SLURM commands¶
Submitting job scripts¶
The following example script specifies a partition, time limit, memory allocation and number of cores. All your scripts should specify values for these four parameters. You can also set additional parameters as shown, such as jobname, output file and email notification. For This script performs a simple task — it generates of file of random numbers and then sorts it. A detailed explanation the script is available here.
#!/bin/bash
#SBATCH --ntasks=1 # Number of tasks
#SBATCH --cpus-per-task=1 # Number of CPU cores per task
#SBATCH --nodes=1 # Ensure that all cores are on one machine
#SBATCH --time=0-00:05 # Runtime in D-HH:MM
#SBATCH --partition=all # Partition to submit to
#SBATCH --mem=100 # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH --output=hostname_%j.out # File to which STDOUT will be written
#SBATCH --error=hostname_%j.err # File to which STDERR will be written
#SBATCH --mail-type=END # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=myemail@uni-bremen.de # Email to which notifications will be sent
for i in {1..100000}; do
echo $RANDOM >> SomeRandomNumbers.txt
done
sort SomeRandomNumbers.txt
Todo
Add default values for SBATCH parameters
Now you can submit your job with the command:
sbatch myscript.sh
Information on jobs¶
List all current jobs for a user:
squeue -u <username>
List all running jobs for a user:
squeue -u <username> -t RUNNING
List all pending jobs for a user:
squeue -u <username> -t PENDING
List priority order of jobs for the current user (you) in a given partition:
showq-slurm -o -U -q <partition>
List all current jobs in the highmem partition for a user:
squeue -u <username> -p highmem
List detailed information for a job (useful for troubleshooting):
scontrol show jobid -dd <jobid>
List status info for a currently running job:
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps
Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc.
To get statistics on completed jobs by jobID:
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed
To view the same information for all jobs of a user:
sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed
Controlling jobs¶
To cancel one job:
scancel <jobid>
To cancel all the jobs for a user:
scancel -u <username>
To cancel all the pending jobs for a user:
scancel -t PENDING -u <username>
To cancel one or more jobs by name:
scancel --name myJobName
To pause a particular job:
scontrol hold <jobid>
To resume a particular job:
scontrol resume <jobid>
To requeue (cancel and rerun) a particular job:
scontrol requeue <jobid>
Job arrays and useful commands¶
As shown in the commands above, its easy to refer to one job by its Job ID, or to all your jobs via your username. What if you want to refer to a subset of your jobs? The answer is to submit your job set as a job array. Then you can use the job array ID to refer to the set when running SLURM commands. See the following excellent resources for further information:
To cancel an indexed job in a job array:
scancel <jobid>_<index>
e.g.
scancel 1234_4
Advanced commands¶
The following commands work for individual jobs and for job arrays, and allow easy manipulation of large numbers of jobs. You can combine these commands with the parameters shown above to provide great flexibility and precision in job control. (Note that all of these commands are entered on one line)
Suspend all running jobs for a user (takes into account job arrays):
squeue -ho %A -t R | xargs -n 1 scontrol suspend
Resume all suspended jobs for a user:
squeue -o "%.18A %.18t" -u <username> | awk '{if ($2 =="S"){print $1}}' | xargs -n 1 scontrol resume
After resuming, check if any are still suspended:
squeue -ho %A -u $USER -t S | wc -l
The following is useful if your group has its own queue and you want to quickly see utilization.
lsload |grep 'Hostname|<partition>'
Example for the all partition:
lsload |grep 'Hostname|all'
Hostname Cores InUse Ratio Load Mem Alloc State
node17 64 60 100.0 12.01 262 261 ALLOCATED
node18 64 64 100.0 12.00 262 240 ALLOCATED
node19 64 40 100.0 12.00 262 261 ALLOCATED
Note that while node19 has free cores, all its memory in use. So those cores are necessarily idle.
node18 has a little free memory but all the cores are in use.
The scheduler will shoot for 100% utilization, but jobs are generally stochastic; beginning and ending at different times with unpredictable amounts of CPU and RAM released/requested.