Updated 2023-07-05
Using Slurm on ICE¶
Accessing Computational Resources via Jobs¶
Request resources from the scheduler to be assigned space on a compute node. For all types of job submissions, the scheduler will assign space to you when it becomes available. Batch and interactive jobs both wait in the same queues for available space.
Tip
Visit our conversion guide to convert your PBS scripts from prior to May 2023 to Slurm scripts.
Tip
For graphical interactive jobs, including Jupyter notebooks, use Open OnDemand.
View this very useful guide from SchedMD for additional Slurm commands and options beyond those listed below. Further guidelines on more advanced scripts are in the user documentation on this page. The sections below are covered in detail on this page, click on link to navigate:
Informational Commands ¶
squeue¶
Use squeue
to check job status for pending (PD) and running (R) jobs. Many options are available to include with the command, including these:
- Add
-j <job number>
to show information about specific jobs. Separate multiple job numbers with a comma. - Add
-u <username>
to show jobs belonging to a specific user, e.g.,-u gburdell3
. - Add
-p <partition>
to see jobs submitted to a specific partition, e.g.,-p ice-cpu
. - Add
-q <QOS>
to see jobs submitted to a specific QOS, e.g.,-q coc-grade
. - Run
man squeue
or visit the squeue documentation page for more options.
sacct¶
After a job has completed, use sacct
to find information about it. Many of the same options for squeue
are available.
- Add
-j <job number>
to find information about specific jobs. - Add
-u <username>
to see all jobs belonging to a specific user. - Add
-X
to show information only about the allocation, rather then steps inside it. - Add
-S <time>
to list jobs only after a specified time. Multiple time formats are accepted, including YYYY-MM-DD[HH:MM[:SS]], e.g., 2022-08-0119:05:23. - Add
-o <fields>
to specify which columns of data should appear in the output. Runsqueue --helpformat
to see a list of available fields. - Run
man sacct
or visit the sacct documentation page for more options.
scancel¶
To cancel a job, run scancel <job number>
, e.g., scancel 1440
to cancel job 1440. You can use squeue
to find the job number first.
pace-check-queue¶
The pace-check-queue
utility provides an overview of current utilization of each partition's nodes. Use the name of a specific partition as the input, e.g., pace-check-queue ice-cpu
. On Slurm clusters, utilized and allocated local disk (including percent utilization) are not available.
- Add
-s
to see all features of each node in the partition. - Add
-c
to color-code the "Accepting Jobs?" column.
pace-job-summary¶
The pace-job-summary
provides high level overview about job processed on the cluster. Usage of the utility is very simple as follows:
$ pace-job-summary
Usage: `pace-job-summary <JobID>`
Output example:
$ pace-job-summary 2836
---------------------------------------
Begin Slurm Job Summary for 2836
Query Executed on 2022-08-17 at 18:21:33
---------------------------------------
Job ID: 2836
User ID: gburdell3
Job name: SlurmPythonExample
Resources: cpu=4,mem=4G,node=1
Rsrc Used: cput=00:00:08,vmem=0.8M,walltime=00:00:02,mem=0.0M,energy_used=0
Exit Code: 0:0
Partition: ice-cpu
Nodes: atl1-0-00-000-0-0
---------------------------------------
Batch Script for 2836
---------------------------------------
#!/bin/bash
#SBATCH -JSlurmPythonExample # Job name
#SBATCH -N1 -n4 # Number of nodes and cores required
#SBATCH --mem-per-cpu=1G # Memory per core
#SBATCH -t15 # Duration of the job (Ex: 15 mins)
#SBATCH -oReport-%j.out # Combined output and error messages file
#SBATCH --mail-type=BEGIN,END,FAIL # Mail preferences
#SBATCH --mail-user=gburdell3@gatech.edu # E-mail address for notifications
cd $SLURM_SUBMIT_DIR # Change to working directory
module load anaconda3/2022.05 # Load module dependencies
srun python test.py # Example Process
---------------------------------------
Job Submission ¶
Each job can request a maximum of 512 CPU hours and 16 GPU hours. The maximum walltime is 8 hours, unless further restricted by the CPU hour or GPU hour maximum.
Jobs that do not include a resource request will receive 1 core and 1 GB of memory/core for 1 hour.
Assignment of partitions and QOSs is generally handled automatically on ICE, so there's no need to specify them in most cases. All nodes are accessible to all students on ICE, and priority for different subsets of nodes is handled behind the scenes.
(Optional) Students in multiple courses¶
Jobs submitted by students enrolled in multiple ICE courses including both CoC and non-CoC courses will default to prioritizing CoC nodes, but all ICE nodes are always accessible to all students. Optionally, you can add -q pace-ice
to your sbatch or salloc directives to prioritize non-CoC nodes for work associated with another course. Adding -q coc-ice
will maintain priority for CoC nodes. Use of these flags is optional.
Grading Priority¶
Instructors and TAs have access to a high-priority QOS for grading assignments. For CoC courses, add -q coc-grade
to sbatch or salloc directives. For other courses, add -q pace-grade
. Unlike ordinary jobs, these jobs can run for 12 hours of walltime, requesting a maximum of 768 CPU hours and 24 GPU hours each. Each instructor or TA can submit up to 10 jobs at a time to the grading QOS.
Job Submission Examples ¶
Interactive Jobs¶
Interactive jobs allow interactive use, so you can work "live" and provide additional input as your computations run. Please use interactive jobs instead of the login nodes for intensive computations. ICE offers both command-line interactive jobs and graphical interactive jobs with Open OnDemand (including Jupyter). Graphical interactive jobs are required if you need a graphical user interface (GUI).
We recommend using the salloc
to allocate resources for a command-line interactive job to work on the command line on a compute node. This is ideal for avoiding overuse of the login node while compiling and running test codes.
The number of nodes (--nodes or -N), CPU cores (--ntasks-per-node for cores per node or -n for total cores), and wall time requested (--time or -t using the format D-HH:MM:SS for days, hours, minutes, and seconds) may be designated. Run man salloc
or visit the salloc documentation page for more options.
In this example, use salloc
to allocate 1 node with 4 cores for an interactive job:
$ salloc -N1 --ntasks-per-node=4 -t1:00:00
salloc: Pending job allocation 1464
salloc: job 1464 queued and waiting for resources
After pending status, your job will start after resources are granted with the following prompt:
$ salloc -N1 --ntasks-per-node=4 -t1:00:00
salloc: Granted job allocation 1464
salloc: Waiting for resource configuration
salloc: Nodes atl1-1-02-007-30-2 are ready for job
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:49
Job ID: 1464
User ID: gburdell3
Job name: interactive
Partition: ice-cpu
---------------------------------------
[gburdell3@atl0 ~]$
Once resources are available for the job, you should be automatically logged into an interactive job on a compute node with the resources requested from the login node. Here, in this interactive session, use srun
with hostname
:
[gburdell3@atl0 ~]$ srun hostname
atl0.pace.gatech.edu
atl0.pace.gatech.edu
atl0.pace.gatech.edu
atl0.pace.gatech.edu
Note that there are 4 instances of the node hostname because we requested 1 node with 4 cores. To exit the interactive job, you can wait for the allotted time to expire in your session (in this example, 1 hour) or you can exit manually using exit
:
[gburdell3@atl0 ~]$ exit
exit
salloc: Relinquishing job allocation 1464
salloc: Job allocation 1464 has been revoked.
Batch Jobs¶
Batch jobs are for "submit and forget" workflows. Batch jobs are ideal for larger (many CPU) and longer (many hour) computations.
Write a Slurm script as a plain text file, then submit it with the sbatch
command. Any computationally-intensive command should be prefixed with srun
for best performance using Slurm.
- On PACE, you can use a text editor such as
nano
,vi
, oremacs
to create a plain text file. For beginners,nano
is recommended. Type the commandnano
to launch it. Typenano <filename>
to open an existing file or create a new one. - (Required) Start the script with
#!/bin/bash
. - Name a job with
#SBATCH -J <job name>
. - Include resource requests:
- For requesting cores, we recommend 1 of 2 options:
#SBATCH -n
or#SBATCH --ntasks
specifies the number of cores for the entire job. The default is 1 core.#SBATCH -N
specifies the number of nodes, combined with#SBATCH --ntasks-per-node
, which specifies the number of cores per node.
- For requesting memory, we recommend 1 of 2 options:
- For CPU-only jobs, use
#SBATCH --mem-per-cpu=<request with units>
, which specifies the amount of memory per core. To request all the memory on a node, include#SBATCH --mem=0
. The default is 4 GB/core. - For GPU jobs, you can instead use
#SBATCH --mem-per-gpu=<request with units>
, which specifies the amount of memory per GPU.
- For CPU-only jobs, use
- For requesting cores, we recommend 1 of 2 options:
- Request walltime with
#SBATCH -t
. Job walltime requests (#SBATCH -t
) should use the format D-HH:MM:SS for days, hours, minutes, and seconds requested. Alternatively, include just an integer that represents minutes. The default is 1 hour. - Name your output file, which will include both STDOUT and STDERR, with
#SBATCH -o <file name>
. - If you would like to receive email notifications, include
#SBATCH --mail-type=NONE,BEGIN,END,FAIL,ARRAY_TASKS,ALL
with only the conditions you prefer.- If you wish to use a non-default email address, add
#SBATCH --mail-user=<preferred email>
.
- If you wish to use a non-default email address, add
- When listing commands to run inside the job, any computationally-intensive command should be prefixed with
srun
for best performance. - Run
man sbatch
or visit the sbatch documentation page for more options.
Basic Python Example¶
- The guide will focus on providing a full runthrough of loading software and submitting a job
- In this guide, we'll load
anaconda3
and run a simple python script
While logged into ICE, use a text editor such as nano
, vi
, or emacs
to create the following python script, call it test.py
#simple test script
result = 2 ** 2
print("Result of 2 ^ 2: {}".format(result))
Now, create a job submission script SlurmPythonExample.sbatch
with the commands below:
#!/bin/bash
#SBATCH -JSlurmPythonExample # Job name
#SBATCH -N1 --ntasks-per-node=4 # Number of nodes and cores per node required
#SBATCH --mem-per-cpu=1G # Memory per core
#SBATCH -t15 # Duration of the job (Ex: 15 mins)
#SBATCH -oReport-%j.out # Combined output and error messages file
#SBATCH --mail-type=BEGIN,END,FAIL # Mail preferences
#SBATCH --mail-user=gburdell3@gatech.edu # E-mail address for notifications
cd $SLURM_SUBMIT_DIR # Change to working directory
module load anaconda3 # Load module dependencies
srun python test.py # Example Process
- Make sure that
test.py
andSlurmPythonExample.sbatch
are in the same folder. It is important that you submit the job from this directory.$SLURM_SUBMIT_DIR
is a variable that contains path for this directory where job is submitted. module load anaconda3
loads anaconda3, which includes python.srun python test.py
runs the python script.srun
runs the program as many times as specified by the-n
or--ntasks
option. If we have justpython test.py
, then the program will run only once.
You can submit the script by running sbatch SlurmPythonExample.sbatch
from command line. For checking job status, use squeue -u gburdell3
. For deleting a job, use scancel <jobid>
. Once the job is completed, you'll see a Report-<jobid>.out
file, which contains the results of the job. It will look something like this:
#Output file
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:04
Job ID: 1470
User ID: gburdell3
Job name: SlurmPythonExample
Partition: ice-cpu
---------------------------------------
Result of 2 ^ 2: 4
Result of 2 ^ 2: 4
Result of 2 ^ 2: 4
Result of 2 ^ 2: 4
---------------------------------------
Begin Slurm Epilog: Oct-07-2022 16:10:06
Job ID: 1470
Array Job ID: _4294967294
User ID: gburdell3
Job name: SlurmPythonExample
Resources: cpu=4,mem=4G,node=1
Rsrc Used: cput=00:00:12,vmem=8K,walltime=00:00:03,mem=0,energy_used=0
Partition: ice-cpu
Nodes: atl0
---------------------------------------
Choosing a CPU Architecture ¶
The cluster provides nodes with either Intel or AMD CPUs. By default, jobs are assigned to the first available resource.
- To request a node with an Intel CPU, add
#SBATCH -C intel
. - To request a node with an AMD CPU, add
#SBATCH -C amd
.
MPI Jobs ¶
Warning
Do not use mpirun
or mpiexec
with Slurm. Use srun
instead.
You may want to run Message Passing Interface (MPI) jobs, which utilize a message-passing standard designed for parallel computing on the cluster.
In this set of examples, we will compile "hello world" MPI code from MPI Tutorial and run the program using srun
.
To set up our environment for both MPI job examples, follow the following steps to create a new directory and download the MPI code:
$ mkdir slurm_mpi_example
$ cd slurm_mpi_example
$ wget https://raw.githubusercontent.com/mpitutorial/mpitutorial/gh-pages/tutorials/mpi-hello-world/code/mpi_hello_world.c
Interactive MPI Example¶
For running MPI in Slurm using an interactive job, follow the steps for Interactive Jobs to enter an interactive session:
- First, as in the interactive job example, use
salloc
to allocate 1 node with 4 cores for an interactive job:
$ salloc -N2 --ntasks-per-node=4 -t1:00:00
salloc: Pending job allocation 1471
salloc: job 1471 queued and waiting for resources
- Next, after pending status, your job will start after resources are granted with the following prompt:
salloc: job 1902 has been allocated resources
salloc: Granted job allocation 1471
salloc: Waiting for resource configuration
salloc: Nodes atl0,atl1 are ready for job
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:09
Job ID: 1471
User ID: gburdell3
Job name: interactive
Partition: ice-cpu
---------------------------------------
[gburdell3@atl0 ~]$
- Next, within your interactive session and in the
slurm_mpi_example
directory created earlier with thempi_hello_world.c
example code, load the relevant modules and compile the MPI code usingmpicc
:
$ cd slurm_mpi_example
$ module load gcc/10.3.0 mvapich2/2.3.6
$ mpicc mpi_hello_world.c -o mpi_hello_world
- Next run the MPI job using
srun
:
$ srun mpi_hello_world
- Finally, the following should be output from this interactive MPI example:
Hello world from processor atl0, rank 0 out of 8 processors
Hello world from processor atl0, rank 2 out of 8 processors
Hello world from processor atl0, rank 3 out of 8 processors
Hello world from processor atl1, rank 4 out of 8 processors
Hello world from processor atl1, rank 7 out of 8 processors
Hello world from processor atl0, rank 1 out of 8 processors
Hello world from processor atl1, rank 5 out of 8 processors
Hello world from processor atl1, rank 6 out of 8 processors
Batch MPI Example¶
For running MPI in Slurm using a batch job, follow the steps in Batch Jobs and Basic Python Example to set up and run a batch job.
- First, in the
slurm_mpi_example
directory created earlier with thempi_hello_world.c
example code, create a file namedSlurmBatchMPIExample.sbatch
with the following content:
#!/bin/bash
#SBATCH -JSlurmBatchMPIExample # Job name
#SBATCH -N2 --ntasks-per-node=4 # Number of nodes and cores per node required
#SBATCH --mem-per-cpu=1G # Memory per core
#SBATCH -t1:00:00 # Duration of the job (Ex: 1 hour)
#SBATCH -oReport-%j.out # Combined output and error messages file
#SBATCH --mail-type=BEGIN,END,FAIL # Mail preferences
#SBATCH --mail-user=gburdell3@gatech.edu # E-mail address for notifications
cd $HOME/slurm_mpi_example # Change to working directory created in $HOME
# Compile MPI Code
module load gcc/10.3.0 mvapich2/2.3.6
mpicc mpi_hello_world.c -o mpi_hello_world
# Run MPI Code
srun mpi_hello_world
- This batch file combines the configuration for the Slurm batch job submission, the compilation for the MPI code, and running the MPI job using
srun
. - Next run the MPI batch job using
sbatch
in theslurm_mpi_example
directory:
$ cd slurm_mpi_example
$ sbatch SlurmBatchMPIExample.sbatch
Submitted batch job 1473
-
This example should not take long, but it may take time to run depending on how busy the cluster is.
-
Finally, after the batch MPI job example has run, the following should be output in the file created in the same directory named
Report-<job id>.out
:
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:09
Job ID: 1473
User ID: gburdell3
Job name: SlurmBatchMPIExample
Partition: ice-cpu
---------------------------------------
Hello world from processor atl0, rank 0 out of 8 processors
Hello world from processor atl0, rank 2 out of 8 processors
Hello world from processor atl0, rank 3 out of 8 processors
Hello world from processor atl1, rank 4 out of 8 processors
Hello world from processor atl1, rank 7 out of 8 processors
Hello world from processor atl0, rank 1 out of 8 processors
Hello world from processor atl1, rank 5 out of 8 processors
Hello world from processor atl1, rank 6 out of 8 processors
---------------------------------------
Begin Slurm Epilog: Oct-07-2022 16:10:11
Job ID: 1473
Array Job ID: _4294967294
User ID: gburdell3
Job name: SlurmBatchMPIExample
Resources: cpu=8,mem=8G,node=2
Rsrc Used: cput=00:00:16,vmem=1104K,walltime=00:00:02,mem=0,energy_used=0
Partition: ice-cpu
Nodes: atl0,atl1
---------------------------------------
GPU Jobs ¶
Note
By default, your job will be assigned to the first available Nvidia GPU. If you want to use a specific Nvidia architecture, or if you wish to use an AMD GPU, you must specify the type.
Requesting GPUs¶
- Note that the GPU resource can be requested 2 different ways. For both approaches, the
<gpu type>
is optional, if a specific architecture is needed.--gres=gpu:<gpu type>:<number of gpus per node>
. This specifies GPUs per node. Note that the number provided here is for number of gpus per node.-G, --gpus=<gpu type>:<total number of gpus>
. This specifies GPUs per job. Note that the number provided here is for the total number of gpus. Slurm requires a minimum of 1 GPU per node, so the total number of GPUs requested must be greater than or equal to the number of nodes requested.
Examples for requesting 1 GPU:
- Nvidia Tesla V100
--gres=gpu:V100:1
or-G V100:1
for any V100--gres=gpu:1 -C V100-16GB
or-G1 -C V100-16GB
for a V100 with 16 GB of memory--gres=gpu:1 -C V100-32GB
or-G1 -C V100-32GB
for a V100 with 32 GB of memory- maximum 4 V100 per node
- Nvidia Quadro Pro RTX6000 (note underscore in some syntax)
--gres=gpu:RTX_6000:1
or-G RTX_6000:1
or--gres=gpu:1 -C RTX6000
-G 1 -C RTX6000
- maximum 4 RTX6000 per node
- Nvidia A40
--gres=gpu:A40:1
or-G A40:1
or--gres=gpu:1 -C A40
or-G 1 -C A40
- maximum 2 A40 per node with AMD CPUs
- Nvidia A100
--gres=gpu:A100:1
or-G A100:1
for any A100--gres=gpu:1 -C A100-40GB
or-G 1 -C A100-40GB
for an A100 with 40 GB of memory--gres=gpu:1 -C A100-80GB
or-G 1 -C A100-80GB
for an A100 with 80 GB of memory- maximum 2 A100 per node with AMD CPUs
- AMD MI210
--gres=gpu:MI210:1
or-G MI210:1
or--gres=gpu:1 -C MI210
or-G 1 -C MI210
- maximum 2 MI210 per node with AMD CPUs
- For details, see Using AMD GPUs
Memory can be requested with --mem-per-cpu
or --mem-per-gpu
.
With Slurm, users can also take advantage of using the following variations of --gpus*
for greater control over how GPUs are allocated:
--gpus-per-node=<gpu type>:<number of gpus>
- Specify the number of GPUs required for the job on each node in the job resource allocation. More information for this option can be found for salloc or sbatch.--gpus-per-socket=<gpu type>:<number of gpus>
- Specify the number of GPUs required for the job on each socket in the job resource allocation. More information for this option can be found for salloc or sbatch.--gpus-per-task=<gpu type>:<number of gpus>
- Specify the number of GPUs required for the job on each task in the job resource allocation. More information for this option can be found for salloc or sbatch.
Let's take a look at running a tensorflow example on GPU resource. We have a test example in the $TENSORFLOWGPUROOT
directory.
Interactive GPU Example¶
For running GPUs in Slurm using an interactive job, follow the steps for Interactive Jobs to enter an interactive session:
- First, start a Slurm interactive session with GPUs with the following command, allocating for 1 node with an Nvidia Tesla V100 GPU.
$ salloc -N1 --mem-per-gpu=12G -t0:15:00 --gres=gpu:V100:1 --ntasks-per-node=6
salloc: Pending job allocation 1484
salloc: job 1484 queued and waiting for resources
- Next, after pending status, your job will start after resources are granted with the following prompt:
salloc: Granted job allocation 1484
salloc: Waiting for resource configuration
salloc: Nodes atl0 are ready for job
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:57
Job ID: 1484
User ID: gburdell3
Job name: interactive
Partition: ice-gpu
---------------------------------------
[gburdell3@atl0 ~]$
- Next, within your interactive session, load the tensorflow-gpu module and run the
test.py
example:
$ cd slurm_gpu_example
$ module load tensorflow-gpu/2.9.0
$ srun python $TENSORFLOWGPUROOT/testgpu.py gpu 1000
- Finally, the sample output from the interactive session should be:
$ srun python $TENSORFLOWGPUROOT/testgpu.py gpu 1000
2022-10-07 16:34:20.000892: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-10-07 16:34:29.749228: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-07 16:34:30.358799: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30987 MB memory: -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0
Num GPUs Available: 1
tf.Tensor(250106050.0, shape=(), dtype=float32)
Shape: (1000, 1000) Device: /gpu:0
Time taken: 0:00:01.312392
$
Batch GPU Example¶
For running GPUs in Slurm using a batch job, follow the steps in Batch Jobs and Basic Python Example to set up and run a batch job:
- First, create a directory named
slurm_gpu_example
:
$ mkdir slurm_gpu_example
- Next, create a batch script named
SlurmBatchGPUExample.sbatch
with the following content:
#!/bin/bash
#SBATCH -JGPUExample # Job name
#SBATCH -N1 --gres=gpu:V100:1 --ntasks-per-node=6 # Number of nodes, GPUs, and cores required
#SBATCH --mem-per-gpu=12G # Memory per gpu
#SBATCH -t15 # Duration of the job (Ex: 15 mins)
#SBATCH -oReport-%j.out # Combined output and error messages file
#SBATCH --mail-type=BEGIN,END,FAIL # Mail preferences
#SBATCH --mail-user=gburdell3@gatech.edu # e-mail address for notifications
cd $HOME/slurm_gpu_example # Change to working directory created in $HOME
module load tensorflow-gpu/2.9.0 # Load module dependencies
srun python $TENSORFLOWGPUROOT/testgpu.py gpu 1000 # Run test example
-
Note that the GPU resource can be requested 2 different ways with
sbatch
in batch mode. The details for GPU resources for GPU batch jobs is similar to interactive jobs here. -
Next, run the GPU batch job using
sbatch
in theslurm_gpu_example
directory:
$ cd slurm_gpu_example
$ sbatch SlurmBatchGPUExample.sbatch
Submitted batch job 1491
- Finally, after the batch MPI job example has run, the following should be output in the file created in the same directory named
Report-<job id>.out
:
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:07
Job ID: 1491
User ID: gburdell3
Job name: GPUExample
Partition: gpu-v100
---------------------------------------
2022-10-07 16:37:14.541726: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-10-07 16:37:24.629080: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-07 16:37:25.227169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30987 MB memory: -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0
Num GPUs Available: 1
tf.Tensor(249757090.0, shape=(), dtype=float32)
Shape: (1000, 1000) Device: /gpu:0
Time taken: 0:00:01.302872
---------------------------------------
Begin Slurm Epilog: Oct-07-2022 16:10:26
Job ID: 1491
Array Job ID: _4294967294
User ID: gburdell3
Job name: GPUExample
Resources: cpu=12,gres/gpu:v100=1,mem=12G,node=1
Rsrc Used: cput=00:03:48,vmem=120K,walltime=00:00:19,mem=0,energy_used=0
Partition: ice-gpu
Nodes: atl0
---------------------------------------
Using AMD GPUs ¶
- The AMD GPUs can be monitored with the
rocm-smi
command. - When compiling for these GPUs, it is essential to specify the architecture, or an error will occur. With the
hipcc
compiler, usehipcc --offload-arch=gfx90a
. - An example
vectoradd_hip.cpp
code can be found on AMD's site. make
can be used if preferred.CMake
can also be used.
Local Disk Jobs ¶
Every ICE compute node has local disk storage available for temporary use in a job, which is automatically cleared upon job completion. Some applications can benefit from this storage for faster I/O than network storage (home and scratch). Most ICE CPU nodes and some GPU nodes have large NVMe local disks, while a few have SAS storage. See ICE resources for details.
- Use the
${TMPDIR}
variable in your Slurm script or interactive session to access the temporary directory for your job on local disk, which is automatically created for every job. - When requesting a partial node, guarantee availability of local disk space with
#SBATCH --tmp=<size>[units, default MB]
. - To request a node with SAS storage, add
#SBATCH -C localSAS
. - To request a node with NVMe storage, add
#SBATCH -C localNVMe
.