Updated 2022-11-22

Using Slurm on Phoenix

Tip

Visit our conversion guide to convert your PBS scripts from prior to October 2022 to Slurm scripts. To learn more about the move to Slurm, visit our Phoenix Slurm Conversion page.

View this very useful guide from SchedMD for additional Slurm commands and options beyond those listed below. Further guidelines on more advanced scripts are in the user documentation on this page. The sections below are covered in detail on this page, click on link to navigate:

  1. Informational Commands
  2. Job Accounting
  3. Job Submission
  4. Job Submission Examples

Informational Commands

squeue

Use squeue to check job status for pending (PD) and running (R) jobs. Many options are available to include with the command, including these:

  • Add -j <job number> to show information about specific jobs. Separate multiple job numbers with a comma.
  • Add -u <username> to show jobs belonging to a specific user, e.g., -u gburdell3.
  • Add -A <charge account> to see jobs belonging to a specific charge account, e.g., -A gts-gburdell3.
  • Add -p <partition> to see jobs submitted to a specific partition, e.g., -p cpu-small.
  • Add -q <QOS> to see jobs submitted to a specific QOS, e.g., -q inferno.
  • Run man squeue or visit the squeue documentation page for more options.

sacct

After a job has completed, use sacct to find information about it. Many of the same options for squeue are available.

  • Add -j <job number> to find information about specific jobs.
  • Add -u <username> to see all jobs belonging to a specific user.
  • Add -A <charge account> to see jobs belonging to a specific charge account, e.g., -A gts-gburdell3.
  • Add -X to show information only about the allocation, rather then steps inside it.
  • Add -S <time> to list jobs only after a specified time. Multiple time formats are accepted, including YYYY-MM-DD[HH:MM[:SS]], e.g., 2022-08-0119:05:23.
  • Add -o <fields> to specify which columns of data should appear in the output. Run squeue --helpformat to see a list of available fields.
  • Run man sacct or visit the sacct documentation page for more options.

scancel

To cancel a job, run scancel <job number>, e.g., scancel 1440 to cancel job 1440. You can use squeue to find the job number first.

pace-check-queue

The pace-check-queue utility provides an overview of current utilization of each partition's nodes. Use the name of a specific QOS or partition as the input, i.e., pace-check-queue inferno (for QOS) or pace-check-queue cpu-small (for partition). On Slurm clusters, utilized and allocated local disk (including percent utilization) are not available.

  • Add -s to see all features of each node in the partition.
  • Add -c to color-code the "Accepting Jobs?" column.

Screenshot

pace-job-summary

The pace-job-summary provides high level overview about job processed on the cluster. Usage of the utility is very simple as follows:

[gburdell3@login-phoenix-slurm-1 ~]$ pace-job-summary
Usage: `pace-job-summary <JobID>`

Output example:

[gburdell3@login-phoenix-slurm-1 ~]$ pace-job-summary 2836
---------------------------------------
Begin Slurm Job Summary for 2836
Query Executed on 2022-08-17 at 18:21:33
---------------------------------------
Job ID:     2836
User ID:    gburdell3
Account:    gts-gburdell3
Job name:   SlurmPythonExample
Resources:  cpu=4,mem=4G,node=1
Rsrc Used:  cput=00:00:08,vmem=0.8M,walltime=00:00:02,mem=0.0M,energy_used=0
Exit Code:  0:0
Partition:  cpu-small
QOS:        inferno
Nodes:      atl1-1-01-011-4-2
---------------------------------------
Batch Script for 2836
---------------------------------------
#!/bin/bash
#SBATCH -JSlurmPythonExample                    # Job name
#SBATCH --account=gts-gburdell3                  # charge account
#SBATCH -N1 -n4                                 # Number of nodes and cores required
#SBATCH --mem-per-cpu=1G                        # Memory per core
#SBATCH -t15                                    # Duration of the job (Ex: 15 mins)
#SBATCH -qinferno                               # QOS Name
#SBATCH -oReport-%j.out                         # Combined output and error messages file
#SBATCH --mail-type=BEGIN,END,FAIL              # Mail preferences
#SBATCH --mail-user=gburdell3@gatech.edu        # E-mail address for notifications
cd $SLURM_SUBMIT_DIR                            # Change to working directory

module load anaconda3/2022.05                   # Load module dependencies
srun python test.py                             # Example Process
---------------------------------------

pace-quota

To find your Phoenix-Slurm charge accounts, run pace-quota while logged into the Phoenix-Slurm cluster. Most charge accounts will be of the form gts-<PI username>[-<descriptor>], e.g., gts-gburdell3 for researchers in Prof. Burdell's group using the free tier account.

Running pace-quota will also report on utilization of your storage allocations.

Job Accounting

Phoenix's accounting system is based on the most significant processing unit on the compute node:

  • On CPU and CPU-SAS nodes, charge rates are based on CPU-hours (total number of procs * walltime) allocated for the job
  • On GPU-V100, GPU-RTX6000, and GPU-A100 nodes, charge rates are based on GPU-hours (total number of GPUs * walltime) allocated for the job

The rates for each of the node classes can be found in this table using the [GEN] lines, and a description for each of the compute node classes on Phoenix-Slurm can be found on this page.

When submitting a job, the account to which the job should be charged must be specified using the -A flag, either on the command line or as part of the Slurm batch file. The scheduler will verify that the account has sufficient funds available to run the full length of the job before accepting it, and a corresponding lien will be placed on the account once the job starts running. If the job finishes early, the excess funds will be released.

To see the accounts to which you can submit jobs and their current balances, run the pace-quota command and read the "Job Charge Account Balances" section:

[puser32@login-phoenix-slurm-3 ~]$ pace-quota
...
====================================================================================================
                                    Job Charge Account Balances
====================================================================================================
Name                           Balance    Reserved   Available
gts-gburdell3-CODA20           291798.90  3329.35    288469.67
gts-gburdell3-phys             241264.01  69.44      241194.66
gts-gburdell3                  41.72      0.00       41.72

The balance column shows the current total based on completed transactions; the reserved column lists the sum of liens based on running jobs; and the available column displays the total funds available for new job submissions.

There are several types of accounts currently available to researchers; the appropriate choice depends on your computational needs and the preferences of the researcher responsible for the account.

The following table summarizes the various accounts that may be available on Phoenix.

Account Name Syntax Description Example
gts-<PI UID> An institute-sponsored account that provides 10k CPU hours on a base CPU-192GB node, although the credits can be used on any node class. These credits reset on the 1st of the month. gts-gburdell3
gts-<PI UID>-CODA20 Account for 2020 hardware refresh with the move to Coda; credits are determined based on 5 years of cycles on the refreshed hardware that was determined based on the "equivalent or better" refreshment rubric using the SPECfp_rate benchmark gts-gburdell3-CODA20
gts-<PI UID>-FY20PhaseN Account for compute resources purchased in FY20; credited with the maximum of FY20 expenditures or credits equivalent to 5 years of cycles on the purchased hardware gts-gburdell3-FY20Phase2
gts-<PI UID>-<group> PI-specific child account for a shared (multi-PI or school-owned) account. This is the account to which jobs should be charged. Depending on the arrangement made by the shared account's managers, there may be a fixed value assigned to each PI, or you may have access to the full shared balance. The visible balance may be a total lifetime value or a value reset each month, depending on the managers' preference. gts-gburdell3-phys
gts-<group>-CODA20 Parent account for a (multi-PI) shared account; child accounts are either allocated a fixed percentage of deposits to this account or draw down from the parent balance. This unseen account cannot be used for job submissions, but instead provides the funds from which child accounts may draw. gts-phys-CODA20
gts-<PI UID>-<custom> Account opened in Phoenix on the postpaid billing model. PIs are billed based on actual usage each month and may set limits if preferred. gts-gburdell3-paid
gts-<PI UID>-<custom> Account opened in Phoenix on the prepaid billing model for state funds. PIs deposit funds in advance. gts-gburdell3-startup

Job Submission

QOS on the Phoenix-Slurm Cluster

There are two QOS levels on the Phoenix-Slurm cluster: inferno and embers. Although the two QOS levels provide access to the same resource pool, the job policies are quite different. Jobs are automatically assigned to the appropriate nodes based on the requested resources.

Inferno: The Primary QOS

Inferno is the main, and default, QOS for the Phoenix cluster. Jobs in this QOS will consume account credits, but will benefit from a larger job limit, higher priority, and longer wallclock limits. This QOS should be the main production mechanism for workflow, as jobs here will start sooner and cannot be pre-empted. For jobs in the inferno QOS, the following policies apply:

  • Base priority = 250,000
  • Max jobs per user = 500
  • Max eligible jobs per user = 500*
  • Wallclock limit = The minimum of the following:
  • 21 days for CPU resources (e.g. CPU-192GB or CPU-768GB-SAS node classes)
  • 3 days for GPU resources (e.g. GPU-192GB-V100 or GPU-384GB-RTX6000 node classes)
  • 264,960 CPU-hours ÷ Number of Requested Processors
  • To submit jobs to the inferno QOS, you can use the -q inferno flag or omit it when submitting jobs, as this is the default QOS for all jobs.

Note

The scheduler will reject a job if the job submission exceeds the 264,960 CPU-hours ÷ Number of Requested Processors. If your job is listed as complete with no output, please check that the nodes * cores per node * walltime < 264,960 processor-hours.

Embers: The Backfill QOS

Embers is the backfill QOS on the Phoenix cluster - jobs submitted here are meant to take advantage of opportunistic cycles remaining after the inferno jobs have been accommodated. Jobs submitted here have a small job limit, the lowest priority, and shorter wallclock limits. Additionally, jobs submitted to this QOS are eligible for pre-emption; after the first hour that the job is running, if an inferno job is waiting for the resources being consumed, the running embers job will be killed. You can resubmit the job if you would like to try again. However, while jobs submitted to this QOS still require an associated account to run, no credits will be consumed from the account. As such, jobs in the embers QOS should not be critical workflow that faces an imminent deadline. For jobs in the embers QOS, the following policies apply:

  • Base priority = 0
  • Max jobs per user = 50
  • Max eligible jobs per user = 1
  • Wallclock limit = 8 hours
  • Eligible for preemption after 1 hour
  • To submit jobs to the embers QOS, use the -q embers flag when submitting jobs.

Tip

The embers QOS is ideal for exploratory work as you develop, compile, and debug your applications.

Additional Constraints on Running Jobs

In addition the above per-job limits, the scheduler is also configured with the following limits on concurrently running jobs to provide balanced utilization of the resource by all. These limits apply to jobs submitted in the inferno QOS, and jobs that violate these limits will be held in the queue until currently running jobs complete and the total number of utilized processors and GPUs, and the remaining CPU-time fall below the thresholds.

  • Per-charge-account processors = 6000
  • Per-user GPUs = 32
  • Per-charge-account CPU-time = 300,000 CPU-hours

Job Submission Examples

Interactive Jobs

A Slurm interactive job reserves resources on compute nodes to use interactively.

We recommend using the salloc to allocate resources. At minimum, charge account (--account or -A) and QOS (--qos or -q, inferno or embers) are required to start an interactive job. Additionally, the number of nodes (--nodes or -N), CPU cores (--ntasks-per-node for cores per node or -n for total cores), and wall time requested (--time or -t using the format D-HH:MM:SS for days, hours, minutes, and seconds) may also be designated. Run man salloc or visit the salloc documentation page for more options.

In this example, use salloc to allocate 1 node with 4 cores for an interactive job using the gts-gburdell3 account on the inferno QOS:

[gburdell3@login-phoenix-slurm-1 ~]$ salloc -A gts-gburdell3 -qinferno -N1 --ntasks-per-node=4 -t1:00:00
salloc: Pending job allocation 1464
salloc: job 1464 queued and waiting for resources

After pending status, your job will start after resources are granted with the following prompt:

[gburdell3@login-phoenix-slurm-1 ~]$ salloc -A gts-gburdell3 -qinferno -N1 --ntasks-per-node=4 -t1:00:00
salloc: Granted job allocation 1464
salloc: Waiting for resource configuration
salloc: Nodes atl1-1-02-007-30-2 are ready for job
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:49
Job ID:    1464
User ID:   gburdell3
Account:   gts-gburdell3
Job name:  interactive
Partition: cpu-small
QOS:       inferno
---------------------------------------
[gburdell3@atl1-1-02-007-30-2 ~]$

Once resources are available for the job, you should be automatically logged into an interactive job on a compute node with the resources requested from the login node. Here, in this interactive session, use srun with hostname:

[gburdell3@atl1-1-02-007-30-2 ~]$ srun hostname
atl1-1-02-007-30-2.pace.gatech.edu
atl1-1-02-007-30-2.pace.gatech.edu
atl1-1-02-007-30-2.pace.gatech.edu
atl1-1-02-007-30-2.pace.gatech.edu

Note that there are 4 instances of the login node hostname because we requested 1 node with 4 cores. To exit the interactive job, you can wait for the allotted time to expire in your session (in this example, 1 hour) or you can exit manually using exit:

[gburdell3@atl1-1-02-007-30-2 ~]$ exit
exit
salloc: Relinquishing job allocation 1464
salloc: Job allocation 1464 has been revoked.

Batch Jobs

Write a Slurm script as a plain text file, then submit it with the sbatch command. Any computationally-intensive command should be prefixed with srun for best performance using Slurm.

  • (Required) Start the script with #!/bin/bash.
  • (Required) Include a charge account with #SBATCH -A <account>. To find your Phoenix-Slurm charge accounts, run pace-quota while logged into the Phoenix-Slurm cluster.
  • Select a QOS, either inferno (paid) or embers (free backfill), with #SBATCH -q <QOS>. If a QOS is not provided, the inferno QOS will be selected.
  • Name a job with #SBATCH -J <job name>.
  • Include resource requests:
    • For requesting cores, we recommend 1 of 2 options:
      1. #SBATCH -n or #SBATCH --ntasks specifies the number of cores for the entire job. The default is 1 core.
      2. #SBATCH -N specifies the number of nodes, combined with #SBATCH --ntasks-per-node, which specifies the number of cores per node. For GPU jobs, #SBATCH --ntasks-per-node does not need to be specified because the default is 6 cores per GPU for RTX6000 or 12 cores per GPU for V100.
    • For requesting memory, we recommend 1 of 2 options:
      1. For CPU-only jobs, use #SBATCH --mem-per-cpu=<request with units>, which specifies the amount of memory per core. To request all the memory on a node, include #SBATCH --mem=0. The default is 1 GB/core.
      2. For GPU jobs, use #SBATCH --mem-per-gpu=<request with units>, which specifies the amount of memory per GPU.
  • Request walltime with #SBATCH -t. Job walltime requests (#SBATCH -t) should use the format D-HH:MM:SS for days, hours, minutes, and seconds requested. Alternatively, include just an integer that represents minutes. The default is 1 hour.
  • Name your output file, which will include both STDOUT and STDERR, with #SBATCH -o <file name>.
  • If you would like to receive email notifications, include #SBATCH --mail-type=NONE,BEGIN,END,FAIL,ARRAY_TASKS,ALL with only the conditions you prefer.
    • If you wish to use a non-default email address, add #SBATCH --mail-user=<preferred email>.
  • When listing commands to run inside the job, any computationally-intensive command should be prefixed with srun for best performance.
  • Run man sbatch or visit the sbatch documentation page for more options.

Basic Python Example

  • The guide will focus on providing a full runthrough of loading software and submitting a job on Phoenix-Slurm
  • In this guide, we'll load anaconda3 and run a simple python script, submitting to the inferno QOS with a GPU

While logged into Phoenix-Slurm, use a text editor such as nano, vi, or emacs to create the following python script, call it test.py

#simple test script
result = 2 ** 2
print("Result of 2 ^ 2: {}".format(result))

Now, create a job submission script SlurmPythonExample.sbatch with the commands below:

#!/bin/bash
#SBATCH -JSlurmPythonExample                    # Job name
#SBATCH --account=gts-gburdell3                 # charge account
#SBATCH -N1 --ntasks-per-node=4                 # Number of nodes and cores per node required
#SBATCH --mem-per-cpu=1G                        # Memory per core
#SBATCH -t15                                    # Duration of the job (Ex: 15 mins)
#SBATCH -qinferno                               # QOS Name
#SBATCH -oReport-%j.out                         # Combined output and error messages file
#SBATCH --mail-type=BEGIN,END,FAIL              # Mail preferences
#SBATCH --mail-user=gburdell3@gatech.edu        # E-mail address for notifications
cd $SLURM_SUBMIT_DIR                            # Change to working directory

module load anaconda3/2022.05                   # Load module dependencies
srun python test.py                             # Example Process
  • Use the pace-quota command to find the name of your charge account(s).
  • Make sure that test.py and SlurmPythonExample.sbatch are in the same folder. It is important that you submit the job from this directory. $SLURM_SUBMIT_DIR is a variable that contains path for this directory where job is submitted.
  • An account name is required to submit a job on the Phoenix-Slurm cluster. This is for charge usage for payment.
  • module load anaconda3/2019.07 loads anaconda3, which includes python.
  • srun python test.py runs the python script. srun runs the program as many times as specified by the -n or --ntasks option. If we have just python test.py, then the program will run only once.

You can submit the script by running sbatch SlurmPythonExample.sbatch from command line. For checking job status, use squeue -u gburdell3. For deleting a job, use scancel <jobid>. Once the job is completed, you'll see a Report-<jobid>.out file, which contains the results of the job. It will look something like this:

#Output file
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:04
Job ID:    1470
User ID:   gburdell3
Account:   gts-gburdell3
Job name:  SlurmPythonExample
Partition: cpu-small
QOS:       inferno
---------------------------------------
Result of 2 ^ 2: 4
Result of 2 ^ 2: 4
Result of 2 ^ 2: 4
Result of 2 ^ 2: 4
---------------------------------------
Begin Slurm Epilog: Oct-07-2022 16:10:06
Job ID:        1470
Array Job ID:  _4294967294
User ID:       gburdell3
Account:       gts-gburdell3
Job name:      SlurmPythonExample
Resources:     cpu=4,mem=4G,node=1
Rsrc Used:     cput=00:00:12,vmem=8K,walltime=00:00:03,mem=0,energy_used=0
Partition:     cpu-small
QOS:           inferno
Nodes:         atl1-1-02-007-30-2
---------------------------------------

MPI Jobs

Warning

Do not use mpirun or mpiexec with Slurm. Use srun instead.

You may want to run Message Passing Interface (MPI) jobs, which utilize a message-passing standard designed for parallel computing on the cluster.

In this set of examples, we will compile "hello world" MPI code from MPI Tutorial and run the program using srun.

To set up our environment for both MPI job examples, follow the following steps to create a new directory and download the MPI code:

[gburdell3@login-phoenix-slurm-1 ~]$ mkdir slurm_mpi_example
[gburdell3@login-phoenix-slurm-1 ~]$ cd slurm_mpi_example
[gburdell3@login-phoenix-slurm-1 slurm_mpi_example]$ wget https://raw.githubusercontent.com/mpitutorial/mpitutorial/gh-pages/tutorials/mpi-hello-world/code/mpi_hello_world.c

Interactive MPI Example

For running MPI in Slurm using an interactive job, follow the steps for Interactive Jobs to enter an interactive session:

  • First, as in the interactive job example, use salloc to allocate 1 node with 4 cores for an interactive job using the gts-gburdell3 account on the inferno QOS:
[gburdell3@login-phoenix-slurm-1 ~]$ salloc -A gts-gburdell3 -qinferno -N2 --ntasks-per-node=4 -t1:00:00
salloc: Pending job allocation 1471
salloc: job 1471 queued and waiting for resources
  • Next, after pending status, your job will start after resources are granted with the following prompt:
salloc: job 1902 has been allocated resources
salloc: Granted job allocation 1471
salloc: Waiting for resource configuration
salloc: Nodes atl1-1-02-007-30-2,atl1-1-02-018-24-2 are ready for job
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:09
Job ID:    1471
User ID:   gburdell3
Account:   gts-gburdell3
Job name:  interactive
Partition: cpu-small
QOS:       inferno
---------------------------------------
[gburdell3@atl1-1-02-007-30-2 ~]$
  • Next, within your interactive session and in the slurm_mpi_example directory created earlier with the mpi_hello_world.c example code, load the relevant modules and compile the MPI code using mpicc:
[gburdell3@atl1-1-02-007-30-2 ~]$ cd slurm_mpi_example
[gburdell3@atl1-1-02-007-30-2 slurm_mpi_example]$ module load gcc/10.3.0 mvapich2/2.3.6
[gburdell3@atl1-1-02-007-30-2 slurm_mpi_example]$ mpicc mpi_hello_world.c -o mpi_hello_world
  • Next run the MPI job using srun:
[gburdell3@atl1-1-02-007-30-2 slurm_mpi_example]$ srun mpi_hello_world
  • Finally, the following should be output from this interactive MPI example:
Hello world from processor atl1-1-02-007-30-2.pace.gatech.edu, rank 0 out of 8 processors
Hello world from processor atl1-1-02-007-30-2.pace.gatech.edu, rank 2 out of 8 processors
Hello world from processor atl1-1-02-007-30-2.pace.gatech.edu, rank 3 out of 8 processors
Hello world from processor atl1-1-02-018-24-2.pace.gatech.edu, rank 4 out of 8 processors
Hello world from processor atl1-1-02-018-24-2.pace.gatech.edu, rank 7 out of 8 processors
Hello world from processor atl1-1-02-007-30-2.pace.gatech.edu, rank 1 out of 8 processors
Hello world from processor atl1-1-02-018-24-2.pace.gatech.edu, rank 5 out of 8 processors
Hello world from processor atl1-1-02-018-24-2.pace.gatech.edu, rank 6 out of 8 processors

Batch MPI Example

For running MPI in Slurm using a batch job, follow the steps in Batch Jobs and Basic Python Example to set up and run a batch job.

  • First, in the slurm_mpi_example directory created earlier with the mpi_hello_world.c example code, create a file named SlurmBatchMPIExample.sbatch with the following content:
#!/bin/bash
#SBATCH -JSlurmBatchMPIExample                  # Job name
#SBATCH --account=gts-gburdell3                 # charge account
#SBATCH -N2 --ntasks-per-node=4                 # Number of nodes and cores per node required
#SBATCH --mem-per-cpu=1G                        # Memory per core
#SBATCH -t1:00:00                               # Duration of the job (Ex: 1 hour)
#SBATCH -qinferno                               # QOS Name
#SBATCH -oReport-%j.out                         # Combined output and error messages file
#SBATCH --mail-type=BEGIN,END,FAIL              # Mail preferences
#SBATCH --mail-user=gburdell3@gatech.edu        # E-mail address for notifications

cd $HOME/slurm_mpi_example                      # Change to working directory created in $HOME

# Compile MPI Code
module load gcc/10.3.0 mvapich2/2.3.6
mpicc mpi_hello_world.c -o mpi_hello_world

# Run MPI Code
srun mpi_hello_world
  • This batch file combines the configuration for the Slurm batch job submission, the compilation for the MPI code, and running the MPI job using srun.
  • Next run the MPI batch job using sbatch in the slurm_mpi_example directory:
[gburdell3@login-phoenix-slurm-1 ~]$ cd slurm_mpi_example
[gburdell3@login-phoenix-slurm-1 slurm_mpi_example]$ sbatch SlurmBatchMPIExample.sbatch
Submitted batch job 1473
  • This example should not take long, but it may take time to run depending on how busy the Slurm queue is.

  • Finally, after the batch MPI job example has run, the following should be output in the file created in the same directory named Report-<job id>.out:

---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:09
Job ID:    1473
User ID:   gburdell3
Account:   gts-gburdell3
Job name:  SlurmBatchMPIExample
Partition: cpu-small
QOS:       inferno
---------------------------------------
Hello world from processor atl1-1-02-007-30-2.pace.gatech.edu, rank 1 out of 8 processors
Hello world from processor atl1-1-02-018-24-2.pace.gatech.edu, rank 4 out of 8 processors
Hello world from processor atl1-1-02-007-30-2.pace.gatech.edu, rank 2 out of 8 processors
Hello world from processor atl1-1-02-007-30-2.pace.gatech.edu, rank 3 out of 8 processors
Hello world from processor atl1-1-02-007-30-2.pace.gatech.edu, rank 0 out of 8 processors
Hello world from processor atl1-1-02-018-24-2.pace.gatech.edu, rank 5 out of 8 processors
Hello world from processor atl1-1-02-018-24-2.pace.gatech.edu, rank 7 out of 8 processors
Hello world from processor atl1-1-02-018-24-2.pace.gatech.edu, rank 6 out of 8 processors
---------------------------------------
Begin Slurm Epilog: Oct-07-2022 16:10:11
Job ID:        1473
Array Job ID:  _4294967294
User ID:       gburdell3
Account:       gts-gburdell3
Job name:      SlurmBatchMPIExample
Resources:     cpu=8,mem=8G,node=2
Rsrc Used:     cput=00:00:16,vmem=1104K,walltime=00:00:02,mem=0,energy_used=0
Partition:     cpu-small
QOS:           inferno
Nodes:         atl1-1-02-007-30-2,atl1-1-02-018-24-2
---------------------------------------

Array Jobs

To submit a number of identical jobs without having drive the submission with an external script use the SLURM's feature of array jobs.

:bulb: There is a maximum limit of 500 jobs (queued plus running) per user on Phoenix.

  • A job array can be submitted simply by adding #SBATCH --array=x-y to the job script where x and y are the array bounds. A job array can also be specified at the command line with sbatch --array=x-y job_script.sbatch
  • A job array will then be created with a number of independent jobs a.k.a. array tasks that correspond to the defined array.
  • SLURM's job array handling is very versatile. Instead of providing a task range a comma-separated list of task numbers can be provided, for example, to rerun a few failed jobs from a previously completed job array as in sbatch --array=4,8,15,16,23,42 job_script.sbatch which can be used to quickly rerun the lost tasks from a previous job array for example. Command line options override options in the script, so those can be left unchanged.
Limiting the number of tasks that run at once

To throttle a job array by keeping only a certain number of tasks active at a time use the %N suffix where N is the number of active tasks. For example #SBATCH -a 1-200%5 will produce a 200 task job array with only 5 tasks active at any given time.

Note that while the symbol used is the % sign, this is the actual number of tasks to be submitted at once.

Using scontrol to modify throttling of running array jobs

If you want to change the number of simultaneous tasks of an active job, you can use scontrol: scontrol update ArrayTaskThrottle=<count> JobId=<jobID> e.g. scontrol update ArrayTaskThrottle=50 JobId=123456.

Set ArrayTaskThrottle=0 to eliminate any limit.

:bulb: Reducing the "ArrayTaskThrottle" count on a running job array will not affect the tasks that have already entered the "RUNNING" state. It will only prevent new tasks from starting until the number or running tasks drops below the new lower threshold.

Naming output and error files

SLURM uses the %A and %a replacement strings for the master job ID and task ID, respectively.

For example:

#SBATCH --output=Array_test.%A\_%a.out
#SBATCH --error=Array_test.%A\_%a.error

The error log is optional as both types of logs can be written to the 'output' log:

#SBATCH --output=Array_test.%A\_%a.log

:bulb: If you only use %A in the log all array tasks will try to write to a single file. The performance of the run will approach zero asymptotically. Make sure to use both %A and %a in the log file name specification.

Using the array ID Index

SLURM will provide a $SLURM_ARRAY_TASK_ID variable to each task. It can be used inside the job script to handle input and output files for that task.

One common application of array jobs is to run many input files. While it is easy if the files are numbered as in the example above, this is not needed. If for example you have a folder of 100 files that end in .txt, you can use the following approach to get the name of the file for each task automatically:

file=$(ls *.txt \| sed -n ${SLURM_ARRAY_TASK_ID}p) myscript -in $file

If, alternatively, you use an input file (e.g. 'input.list') with a list of samples/datasets (one per line) to process you can pick an item from the list as follows:

SAMPLE_LIST=($(<input.list))
SAMPLE=${SAMPLE_LIST[${SLURM_ARRAY_TASK_ID}]}
Running many short tasks

While SLURM array jobs make it easy to run many similar tasks, if each task is short (seconds or even a few minutes), array jobs quickly bog down the scheduler and more time is spent managing jobs than actually doing any work for you. This also negatively impacts other users.

If you have hundreds or thousands of tasks, it is unlikely that a simple array job is the best solution. That does not mean that array jobs are not helpful in these cases, but that a little more thought needs to go into them for efficient use of the resources.

As an example let's imagine I have 500 runs of a program to do, with each run taking about 30 seconds to complete. Rather than running an array job with 500 tasks, it would be much more efficient to run 5 tasks where each completes 100 runs. Here's a sample script to accomplish this by combining array jobs with bash loops. Create submit file inside the directory slurm_array_example/SlurmArrayExample.sbatch with the following content:

#!/bin/bash
#SBATCH --job-name=SlurmArrayExample        # Job name
#SBATCH --account=gts-gburdell3             # charge account
#SBATCH --mail-type=ALL                     # Mail events (NONE, BEGIN, END, FAIL, ARRAY_TASKS, ALL)
#SBATCH --mail-user=gburdell3@gatech.edu    # Where to send mail
#SBATCH --nodes=1                           # Use one node
#SBATCH --ntasks=1                          # Run a single task
#SBATCH --mem-per-cpu=1gb                   # Memory per processor
#SBATCH --time=00:10:00                     # Time limit hrs:min:sec
#SBATCH --output=Report_%A-%a.out           # Standard output and error log
#SBATCH --array=1-5                         # Array range
# This is an example script that combines array tasks with
# bash loops to process many short runs. Array jobs are convenient
# for running lots of tasks, but if each task is short, they
# quickly become inefficient, taking more time to schedule than
# they spend doing any work and bogging down the scheduler for
# all users.

#Set the number of runs that each SLURM task should do
PER_TASK=100

# Calculate the starting and ending values for this task based
# on the SLURM task and the number of runs per task.
START_NUM=$(( ($SLURM_ARRAY_TASK_ID - 1) * $PER_TASK + 1 ))
END_NUM=$(( $SLURM_ARRAY_TASK_ID * $PER_TASK ))

# Print the task and run range
echo This is task $SLURM_ARRAY_TASK_ID, which will do runs $START_NUM to $END_NUM

# Run the loop of runs for this task.
for (( run=$START_NUM; run<=END_NUM; run++ )); do
  echo This is SLURM task $SLURM_ARRAY_TASK_ID, run number $run
  #Do your stuff here
  #e.g. run test.py as array job from the Basic Python Example section above
  srun python test.py
done

date
  • When ready with submit file, run the array job using sbatch in the slurm_array_example directory:
[gburdell3@login-phoenix-slurm-1 ~]$ cd slurm_array_example
[gburdell3@login-phoenix-slurm-1 slurm_array_example]$ sbatch SlurmArrayExample.sbatch
Submitted batch job 1479
  • After the array job example has run, the following should be output in the file created in the same directory named Report_<job id>-<array id>.out:
[gburdell3@login-phoenix-slurm-1 slurm_array_example]$ vi Report_1479-1.out
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:12
Job ID:    1480
User ID:   gburdell3
Account:   gts-gburdell3
Job name:  SlurmArrayExample
Partition: cpu-small
QOS:       inferno
---------------------------------------
This is task 1, which will do runs 1 to 100
Result of 2 ^ 2: 4
This is SLURM task 1, run number 2
Result of 2 ^ 2: 4
This is SLURM task 1, run number 3
Result of 2 ^ 2: 4
This is SLURM task 1, run number 4
...
...
---------------------------------------
Begin Slurm Epilog: Oct-07-2022 16:10:46
Job ID:        1480
Array Job ID:  1479_1
User ID:       gburdell3
Account:       gts-gburdell3
Job name:      SlurmArrayExample
Resources:     cpu=1,mem=1G,node=1
Rsrc Used:     cput=00:00:33,vmem=4512K,walltime=00:00:33,mem=4500K,energy_used=0
Partition:     cpu-small
QOS:           inferno
Nodes:         atl1-1-02-007-30-2
---------------------------------------

Warning

Each array job is mapped to a unique JID, but both forms of Job ID and Array Job ID in prolog and epilog are valid IDs when querying the results

Deleting job arrays and tasks

To delete all of the tasks of an array job, use scancel with the job ID:

scancel 123456

To delete a single task, add the task ID:

scancel 123456_1

GPU Jobs

Note

The default GPU type is the Nvidia Tesla V100 GPU. If you want to use an Nvidia Quadro RTX 6000 GPU, you must specify RTX_6000 (e.g. --gres=gpu:RTX_6000:1) or RTX6000 (e.g. -C RTX6000) as a GPU type.

Let's take a look at running a tensorflow example on GPU resource. We have a test example in the $TENSORFLOWGPUROOT directory.

Interactive GPU Example

For running GPUs in Slurm using an interactive job, follow the steps for Interactive Jobs to enter an interactive session:

GPUs on Slurm-Phoenix
  • Note that the GPU resource can be requested 2 different ways with salloc in interactive mode. For both approaches, the <gpu type> is optional, but on Phoenix can specify one of multiple GPU types: Nvidia Tesla V100 16GB or 32GB GPU (V100), Nvidia RTX 6000 (RTX_6000), or Nvidia Tensor Core A100 40GB GPUs (A100). If you need to request a GPU that has a different memory type (i.e. Nvidia Tesla V100 16GB or 32GB GPU), you will need to use the -G, gpus option with -C, --constraint:
    • --gres=gpu:<gpu type>:<number of gpus per node>. This specifies GPUs per node. Note that the number provided here is for number of gpus per node.
      • --gres=gpu:V100:2 - Allocates 2 Nvidia Tesla V100 GPUs per node
      • --gres=gpu:RTX_6000:2 - Allocates 2 Nvidia Quadro RTX_6000 GPUs per node
      • --gres=gpu:A100:2 - Allocates 2 Nvidia Tensor Core A100 GPUs per node
    • -G, --gpus=<gpu type>:<total number of gpus>. This specifies GPUs per job. Note that the number provided here is for the total number of gpus. Slurm requires a minimum of 1 GPU per node, so the total number of GPUs requested must be greater than or equal to the number of nodes requested.
      • -C, --constraint. This specifies features that include GPU type (including GPUs with specific memory types, i.e. -C V100-32GB) and SAS storage (i.e. -C localSAS). For GPUs, the -C, --constraint are available for our users:
        • -C V100-16GB - Nvidia Tesla V100 16GB
        • -C V100-32GB - Nvidia Tesla V100 32GB
        • -C gpu-v100 - Nvidia Tesla V100 16GB or 32GB
        • -C RTX6000 - Nvidia Quadro RTX 6000
        • -C gpu-rtx6000 - Nvidia Quadro RTX 6000
        • -C A100-40GB - Nvidia Tensor Core A100 40GB
        • -C gpu-a100 - Nvidia Tensor Core A100
      • With Slurm, users can also take advantage of using the following variations of --gpus* for greater control over how GPUs are allocated:
        • --gpus-per-node=<gpu type>:<number of gpus> - Specify the number of GPUs required for the job on each node in the job resource allocation. More information for this option can be found for salloc or sbatch.
        • --gpus-per-socket=<gpu type>:<number of gpus> - Specify the number of GPUs required for the job on each socket in the job resource allocation. More information for this option can be found for salloc or sbatch.
        • --gpus-per-task=<gpu type>:<number of gpus> - Specify the number of GPUs required for the job on each task in the job resource allocation. More information for this option can be found for salloc or sbatch.
  • The scheduler is configured to assign 6, 12, or 32 cores per GPU by default depending on the GPU, so there is no need to specify --ntasks-per-node in your request.
    • The default 6 (for RTX 6000), 12 (for V100), and 32 (for A100) cores per GPU.
  • We strongly recommend the the use of the specification --mem-per-gpu=<memory allocated per GPU> to allocate memory per GPU.

  • First, start a Slurm interactive session with GPUs with the following command, allocating for 1 node with an Nvidia Tesla V100 GPU. Note you do not need to specify --ntasks-per-node because 6 cores (RTX6000), 12 cores (V100), or 32 cores (A100) are assigned per GPU by default.

[gburdell3@login-phoenix-slurm-1 ~]$ salloc -A gts-gburdell3 -N1 --mem-per-gpu=12G -qinferno -t0:15:00 --gres=gpu:V100:1
salloc: Pending job allocation 1484
salloc: job 1484 queued and waiting for resources
  • Next, after pending status, your job will start after resources are granted with the following prompt:
salloc: Granted job allocation 1484
salloc: Waiting for resource configuration
salloc: Nodes atl1-1-03-006-33-0 are ready for job
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:57
Job ID:    1484
User ID:   gburdell3
Account:   gts-gburdell3
Job name:  interactive
Partition: gpu-v100
QOS:       inferno
---------------------------------------
[gburdell3@atl1-1-03-006-33-0 ~]$
  • Next, within your interactive session, load the tensorflow-gpu module and run the test.py example:
[gburdell3@atl1-1-03-006-33-0 ~]$ cd slurm_gpu_example
[gburdell3@atl1-1-03-006-33-0 slurm_gpu_example]$ module load tensorflow-gpu/2.9.0
(/usr/local/pace-apps/manual/packages/tensorflow-gpu/2.9.0) [gburdell3@atl1-1-03-006-33-0 slurm_gpu_example]$ srun python $TENSORFLOWGPUROOT/testgpu.py gpu 1000
  • Finally, the sample output from the interactive session should be:
(/usr/local/pace-apps/manual/packages/tensorflow-gpu/2.9.0) [gburdell3@atl1-1-03-006-33-0 slurm_gpu_example]$ srun python $TENSORFLOWGPUROOT/testgpu.py gpu 1000
2022-10-07 16:34:20.000892: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-10-07 16:34:29.749228: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-07 16:34:30.358799: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30987 MB memory:  -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0
Num GPUs Available:  1






tf.Tensor(250106050.0, shape=(), dtype=float32)
Shape: (1000, 1000) Device: /gpu:0
Time taken: 0:00:01.312392






(/usr/local/pace-apps/manual/packages/tensorflow-gpu/2.9.0) [gburdell3@atl1-1-03-006-33-0 slurm_gpu_example]$

Batch GPU Example

For running GPUs in Slurm using a batch job, follow the steps in Batch Jobs and Basic Python Example to set up and run a batch job:

  • First, create a directory named slurm_gpu_example:
[gburdell3@login-phoenix-slurm-1 ~]$ mkdir slurm_gpu_example
  • Next, create a batch script named SlurmBatchGPUExample.sbatch with the following content:
#!/bin/bash
#SBATCH -JGPUExample                                # Job name
#SBATCH -Agts-gburdell3                             # Charge account
#SBATCH -N1 --gres=gpu:V100:1                       # Number of nodes and GPUs required
#SBATCH --mem-per-gpu=12G                           # Memory per gpu
#SBATCH -t15                                        # Duration of the job (Ex: 15 mins)
#SBATCH -qinferno                                   # QOS name
#SBATCH -oReport-%j.out                             # Combined output and error messages file
#SBATCH --mail-type=BEGIN,END,FAIL                  # Mail preferences
#SBATCH --mail-user=gburdell3@gatech.edu            # e-mail address for notifications

cd $HOME/slurm_gpu_example                          # Change to working directory created in $HOME

module load tensorflow-gpu/2.9.0                    # Load module dependencies
srun python $TENSORFLOWGPUROOT/testgpu.py gpu 1000  # Run test example
  • Note that the GPU resource can be requested 2 different ways with sbatch in batch mode. The details for GPU resources for GPU batch jobs is similar to interactive jobs here.

  • Next, run the GPU batch job using sbatch in the slurm_gpu_example directory:

[gburdell3@login-phoenix-slurm-1 ~]$ cd slurm_gpu_example
[gburdell3@login-phoenix-slurm-1 slurm_gpu_example]$ sbatch SlurmBatchGPUExample.sbatch
Submitted batch job 1491
  • Finally, after the batch MPI job example has run, the following should be output in the file created in the same directory named Report-<job id>.out:
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:07
Job ID:    1491
User ID:   gburdell3
Account:   gts-gburdell3
Job name:  GPUExample
Partition: gpu-v100
QOS:       inferno
---------------------------------------
2022-10-07 16:37:14.541726: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-10-07 16:37:24.629080: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-07 16:37:25.227169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30987 MB memory:  -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0
Num GPUs Available:  1






tf.Tensor(249757090.0, shape=(), dtype=float32)
Shape: (1000, 1000) Device: /gpu:0
Time taken: 0:00:01.302872






---------------------------------------
Begin Slurm Epilog: Oct-07-2022 16:10:26
Job ID:        1491
Array Job ID:  _4294967294
User ID:       gburdell3
Account:       gts-gburdell3
Job name:      GPUExample
Resources:     cpu=12,gres/gpu:v100=1,mem=12G,node=1
Rsrc Used:     cput=00:03:48,vmem=120K,walltime=00:00:19,mem=0,energy_used=0
Partition:     gpu-v100
QOS:           inferno
Nodes:         atl1-1-03-006-33-0
---------------------------------------

Local Disk Jobs

Every Phoenix node has local disk storage available for temporary use in a job, which is automatically cleared upon job completion. Some applications can benefit from this storage for faster I/O than network storage (home, project, and scratch). Standard Phoenix nodes have 1 TB of NVMe storage, while the "localSAS" nodes have 8 TB of SAS storage.

  • Use the ${TMPDIR} variable in your Slurm script or interactive session to access the temporary directory for your job on local disk, which is automatically created for every job.
  • When requesting a partial node, guarantee availability of local disk space with #SBATCH --tmp=<size>[units, default MB].
  • To request a node with SAS storage, add #SBATCH -C localSAS.

AMD CPU Jobs

The new Phoenix-Slurm cluster also provides nodes with AMD CPUs (Dual AMD Epyc 7713 CPUs @ 2.0 GHz for a total of 128 cores/node) as part of PACE's goal to provide heterogenous compute resources for our users.

  • To request a node with an AMD CPU, add #SBATCH -C amd.