Quickstart Guide

This guide will provide you with the basic information needed to get up and running on Aether for simple command line access. If you’d like more detailed information, each section has a link to fuller documentation

This page contains content originally created by Harvard FAS Research Computing and adapted by us under the Creative Commons Attribution-NonCommercial 4.0 International License. For more information, visit https://rc.fas.harvard.edu/about/attribution/ .

Get an Aether account

Before you can access Aether you need to request an account by contacting lamos_it+aether@groups.uni-bremen.de.

Use a terminal to ssh to aether.iup.uni-bremen.de

For command line access to Aether, connect to aether.iup.uni-bremen.de using ssh. If you are running Linux or Mac OSX, open a terminal and type ssh USERNAME@aether.iup.uni-bremen.de, where USERNAME is the name you were assigned when you received your account. Enter the password you were given by our staff.

For Windows computers, you will need to download PuTTY.

Determine where your files will be stored

There are two storage locations that are important to consider when running software on Aether:

  1. /home/USERNAME /home is an NFS filesystem, with no parallel access. It can be regularly backed up, and users are advised to store all not easily reproducible files (code, analyses, …) there.

  2. /mnt/beegfs: /mnt/beegfs is a high-performance, parallel BeeGFS filesystem. We recommend that people use this filesystem as their primary working area (i.e., for all compute jobs), as this area is highly optimized for cluster use. Use this for processing large files, but realize that the volume is not backed up. You have your own folder at /mnt/beegfs/user/USERNAME.

Warning

Do not use your home directory for significant computation. This degrades performance for everyone on the cluster.

For details on different types of storage and how obtain more, see the storage page.

Transfer any files you may need

If you’re using a Linux-y terminal like the Mac OSX Terminal tool or a Linux xterm, you’ll want to use rsync or scp for transferring data:

scp my_data_file akitzmiller@aether.iup.uni-bremen.de:

This will transfer the data into the root of your home directory.

Windows users can use the graphical WinSCP tool, or directly access Aether’s storage system as described in Access from IUP-UB to Aether storage.

Note

If you are outside the IUP-UB network, you should first connect to the ZfN VPN service and use ports 54203 and 54204 (see SSH access for details).

Familiarize yourself with proper conduct on Aether

Aether is a largeish system of shared resources. While much effort is made to ensure that you can do your work in relative isolation, some rules must be followed to avoid interfering with other user’s work.

The most important rule on Aether is to avoid performing computations on the login nodes. Once you’ve logged in, you must either submit a batch processing script or start an interactive session (see below). Any significant processing (high memory requirements, long running time, etc.) that is attempted on the login nodes will be killed.

Determine what software you’d like to load and run

An enhanced module system called lmod is used on Aether to control the run-time environment for individual applications. To find out what modules are available you can use the module avail command. By itself, module avail will print out the entire list of packages. To find a specific tool, use the module spider command.

Todo

fix module-query examples

Once you’ve determined what software you would like to use, load the module:

module load MODULENAME

where MODULENAME is the specific software you want to use. You can use module unload MODULENAME to unload a module. To see what modules you have loaded type module list. This is very helpful information to provide when you submit help tickets.

For details on finding and using modules effectively, see the Software on Aether page.

For details on running software on the Aether cluster, including graphical applications, see the module page.

Run a batch job …

The Aether cluster is managed by a batch job control system called SLURM. Tools that you want to run are embedded in a command script and the script is submitted to the job control system using an appropriate SLURM command.

For a simple example that just prints the hostname of a compute host to both standard out and standard err, create a file called hostname.slurm with the following content:

#!/bin/bash
#SBATCH -n 1 # Number of cores requested
#SBATCH -N 1 # Number of nodes requested
#SBATCH -t 15 # Runtime in minutes
#SBATCH -p all # Partition to submit to
#SBATCH --mem=100 # Memory per cpu in MB (see also --mem-per-cpu)
#SBATCH --open-mode=append # log files will not be reset at the start of each (requeued) run
#SBATCH -o hostname_%j.out # Standard output goes to this file
#SBATCH -e hostname_%j.err # Standard error goes to this file

hostname

Then submit this job script to SLURM:

sbatch hostname.slurm

When command scripts are submitted, SLURM looks at the resources you’ve requested and waits until an acceptable compute node is available on which to run it. Once the resources are available, it runs the script as a background process (i.e., you don’t need to keep your terminal open while it is running), returning the output and error streams to the locations designated by the script.

You can monitor the progress of your job using the squeue -j JOBID command, where JOBID is the ID returned by SLURM when you submit the script. The output of this command will indicate if your job is PENDING, RUNNING, COMPLETED, FAILED etc. If the job is completed, you can get the output from the file specified by the -o option. If there are errors, they should appear in the file specified by the -e option.

If you need to terminate a job, the scancel command can be used (JOBID is the number returned when the job is submitted):

scancel JOBID

SLURM-managed resources are divided into partitions (known as queues in other batch processing systems). Normally, you will be using the all partition.

For more information about SLURM, please see the The SLURM job manager page. Specifically, there is information about partitions and running jobs.

… or an interactive job

Batch jobs are great for long-lasting computationally intensive data processing. However, many activities like one-off scripts, graphics and visualization, and exploratory analysis do not work well in a batch system, but are too resource intensive to be done on a login node.

You can start an interactive session using a specific flavor of the srun command:

srun -p interactive --pty --mem 500 -t 0-06:00 /bin/bash

srun is like sbatch, but it runs synchronously (i.e. it does not return until the job is finished). The example starts a job on the “interactive” partition, with pseudo-terminal mode on (--pty), a memory allocation of 500 MB RAM (--mem 500), and for 6 hours (-t in D-HH:MM format). It also assumes one core on one node. The final argument is the command that you want to run. In this case you’ll just get a shell prompt on a compute host. Now you can run any normal Linux commands without taking up resources on a login node. Make sure you choose a reasonable amount of memory (--mem) for your session.

Getting further help

If you have any trouble with running jobs on Aether, first check the comprehensive The SLURM job manager page. Then, if your questions aren’t answered there, feel free to contact us at lamos_it+aether@groups.uni-bremen.de. Tell us the job ID of the job in question. Also provide us with what script you ran and the error and output files as well. The output of module list is helpful, too.

A note on requesting memory (--mem or --mem-per-cpu)

In SLURM you must declare how much memory you are using for your job using the --mem or --mem-per-cpu command switches. By default SLURM assumes you need 4520 MB per CPU. If you don’t request enough the job can be terminated, often times without very useful information (error files can show segfault, file write errors, etc. that are downstream symptoms). If you request too much, it can increase your wait time (it’s harder to allocate a lot of memory than a little) and crowd out jobs for other users. Please also note that your job’s priority is calculated based on the portion of the computing resources (memory and CPU) it has requested and the computing resources your previous jobs have already consumed. Thus, large memory requests will lower your job’s priority and the scheduler may take a long time to fit your job into the cluster.

You can view the runtime and actual memory usage for a job after a stated time (specified using --starttime or -S; default is 00:00:00 of the current day)

sacct -j JOBID --format=JobID,JobName,ReqMem,MaxRSS,Elapsed --starttime YYYY-MM-DD[THH:MM[:SS]]

where JOBID is the numeric job ID of a past job:

$ sacct -j 839209 --format=JobID,JobName,ReqMem,MaxRSS,Elapsed --starttime 2019-09-05T00:00:00
       JobID    JobName     ReqMem     MaxRSS    Elapsed
------------ ---------- ---------- ---------- ----------
839209       showlimits     4520Mc              00:00:31
839209.batch      batch     4520Mc      2136K   00:00:31

The .batch portion of the job is usually what you’re looking for, but the output may vary. This job had a maximum memory footprint of about 2MB, and took a little over 30 seconds to run.