Updated 2023-07-05

Using Slurm on ICE

Accessing Computational Resources via Jobs

Request resources from the scheduler to be assigned space on a compute node. For all types of job submissions, the scheduler will assign space to you when it becomes available. Batch and interactive jobs both wait in the same queues for available space.

Tip

Visit our conversion guide to convert your PBS scripts from prior to May 2023 to Slurm scripts.

Tip

For graphical interactive jobs, including Jupyter notebooks, use Open OnDemand.

View this very useful guide from SchedMD for additional Slurm commands and options beyond those listed below. Further guidelines on more advanced scripts are in the user documentation on this page. The sections below are covered in detail on this page, click on link to navigate:

  1. Informational Commands
  2. Job Submission
  3. Job Submission Examples

Informational Commands

squeue

Use squeue to check job status for pending (PD) and running (R) jobs. Many options are available to include with the command, including these:

  • Add -j <job number> to show information about specific jobs. Separate multiple job numbers with a comma.
  • Add -u <username> to show jobs belonging to a specific user, e.g., -u gburdell3.
  • Add -p <partition> to see jobs submitted to a specific partition, e.g., -p ice-cpu.
  • Add -q <QOS> to see jobs submitted to a specific QOS, e.g., -q coc-grade.
  • Run man squeue or visit the squeue documentation page for more options.

sacct

After a job has completed, use sacct to find information about it. Many of the same options for squeue are available.

  • Add -j <job number> to find information about specific jobs.
  • Add -u <username> to see all jobs belonging to a specific user.
  • Add -X to show information only about the allocation, rather then steps inside it.
  • Add -S <time> to list jobs only after a specified time. Multiple time formats are accepted, including YYYY-MM-DD[HH:MM[:SS]], e.g., 2022-08-0119:05:23.
  • Add -o <fields> to specify which columns of data should appear in the output. Run squeue --helpformat to see a list of available fields.
  • Run man sacct or visit the sacct documentation page for more options.

scancel

To cancel a job, run scancel <job number>, e.g., scancel 1440 to cancel job 1440. You can use squeue to find the job number first.

pace-check-queue

The pace-check-queue utility provides an overview of current utilization of each partition's nodes. Use the name of a specific partition as the input, e.g., pace-check-queue ice-cpu. On Slurm clusters, utilized and allocated local disk (including percent utilization) are not available.

  • Add -s to see all features of each node in the partition.
  • Add -c to color-code the "Accepting Jobs?" column.

pace-job-summary

The pace-job-summary provides high level overview about job processed on the cluster. Usage of the utility is very simple as follows:

$ pace-job-summary
Usage: `pace-job-summary <JobID>`

Output example:

$ pace-job-summary 2836
---------------------------------------
Begin Slurm Job Summary for 2836
Query Executed on 2022-08-17 at 18:21:33
---------------------------------------
Job ID:     2836
User ID:    gburdell3
Job name:   SlurmPythonExample
Resources:  cpu=4,mem=4G,node=1
Rsrc Used:  cput=00:00:08,vmem=0.8M,walltime=00:00:02,mem=0.0M,energy_used=0
Exit Code:  0:0
Partition:  ice-cpu
Nodes:      atl1-0-00-000-0-0
---------------------------------------
Batch Script for 2836
---------------------------------------
#!/bin/bash
#SBATCH -JSlurmPythonExample                    # Job name
#SBATCH -N1 -n4                                 # Number of nodes and cores required
#SBATCH --mem-per-cpu=1G                        # Memory per core
#SBATCH -t15                                    # Duration of the job (Ex: 15 mins)
#SBATCH -oReport-%j.out                         # Combined output and error messages file
#SBATCH --mail-type=BEGIN,END,FAIL              # Mail preferences
#SBATCH --mail-user=gburdell3@gatech.edu        # E-mail address for notifications
cd $SLURM_SUBMIT_DIR                            # Change to working directory

module load anaconda3/2022.05                   # Load module dependencies
srun python test.py                             # Example Process
---------------------------------------

Job Submission

Each job can request a maximum of 512 CPU hours and 16 GPU hours. The maximum walltime is 8 hours, unless further restricted by the CPU hour or GPU hour maximum.

Jobs that do not include a resource request will receive 1 core and 1 GB of memory/core for 1 hour.

Assignment of partitions and QOSs is generally handled automatically on ICE, so there's no need to specify them in most cases. All nodes are accessible to all students on ICE, and priority for different subsets of nodes is handled behind the scenes.

(Optional) Students in multiple courses

Jobs submitted by students enrolled in multiple ICE courses including both CoC and non-CoC courses will default to prioritizing CoC nodes, but all ICE nodes are always accessible to all students. Optionally, you can add -q pace-ice to your sbatch or salloc directives to prioritize non-CoC nodes for work associated with another course. Adding -q coc-ice will maintain priority for CoC nodes. Use of these flags is optional.

Grading Priority

Instructors and TAs have access to a high-priority QOS for grading assignments. For CoC courses, add -q coc-grade to sbatch or salloc directives. For other courses, add -q pace-grade. Unlike ordinary jobs, these jobs can run for 12 hours of walltime, requesting a maximum of 768 CPU hours and 24 GPU hours each. Each instructor or TA can submit up to 10 jobs at a time to the grading QOS.

Job Submission Examples

Interactive Jobs

Interactive jobs allow interactive use, so you can work "live" and provide additional input as your computations run. Please use interactive jobs instead of the login nodes for intensive computations. ICE offers both command-line interactive jobs and graphical interactive jobs with Open OnDemand (including Jupyter). Graphical interactive jobs are required if you need a graphical user interface (GUI).

We recommend using the salloc to allocate resources for a command-line interactive job to work on the command line on a compute node. This is ideal for avoiding overuse of the login node while compiling and running test codes.

The number of nodes (--nodes or -N), CPU cores (--ntasks-per-node for cores per node or -n for total cores), and wall time requested (--time or -t using the format D-HH:MM:SS for days, hours, minutes, and seconds) may be designated. Run man salloc or visit the salloc documentation page for more options.

In this example, use salloc to allocate 1 node with 4 cores for an interactive job:

$ salloc -N1 --ntasks-per-node=4 -t1:00:00
salloc: Pending job allocation 1464
salloc: job 1464 queued and waiting for resources

After pending status, your job will start after resources are granted with the following prompt:

$ salloc -N1 --ntasks-per-node=4 -t1:00:00
salloc: Granted job allocation 1464
salloc: Waiting for resource configuration
salloc: Nodes atl1-1-02-007-30-2 are ready for job
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:49
Job ID:    1464
User ID:   gburdell3
Job name:  interactive
Partition: ice-cpu
---------------------------------------
[gburdell3@atl0 ~]$

Once resources are available for the job, you should be automatically logged into an interactive job on a compute node with the resources requested from the login node. Here, in this interactive session, use srun with hostname:

[gburdell3@atl0 ~]$ srun hostname
atl0.pace.gatech.edu
atl0.pace.gatech.edu
atl0.pace.gatech.edu
atl0.pace.gatech.edu

Note that there are 4 instances of the node hostname because we requested 1 node with 4 cores. To exit the interactive job, you can wait for the allotted time to expire in your session (in this example, 1 hour) or you can exit manually using exit:

[gburdell3@atl0 ~]$ exit
exit
salloc: Relinquishing job allocation 1464
salloc: Job allocation 1464 has been revoked.

Batch Jobs

Batch jobs are for "submit and forget" workflows. Batch jobs are ideal for larger (many CPU) and longer (many hour) computations.

Write a Slurm script as a plain text file, then submit it with the sbatch command. Any computationally-intensive command should be prefixed with srun for best performance using Slurm.

  • On PACE, you can use a text editor such as nano, vi, or emacs to create a plain text file. For beginners, nano is recommended. Type the command nano to launch it. Type nano <filename> to open an existing file or create a new one.
  • (Required) Start the script with #!/bin/bash.
  • Name a job with #SBATCH -J <job name>.
  • Include resource requests:
    • For requesting cores, we recommend 1 of 2 options:
      1. #SBATCH -n or #SBATCH --ntasks specifies the number of cores for the entire job. The default is 1 core.
      2. #SBATCH -N specifies the number of nodes, combined with #SBATCH --ntasks-per-node, which specifies the number of cores per node.
    • For requesting memory, we recommend 1 of 2 options:
      1. For CPU-only jobs, use #SBATCH --mem-per-cpu=<request with units>, which specifies the amount of memory per core. To request all the memory on a node, include #SBATCH --mem=0. The default is 4 GB/core.
      2. For GPU jobs, you can instead use #SBATCH --mem-per-gpu=<request with units>, which specifies the amount of memory per GPU.
  • Request walltime with #SBATCH -t. Job walltime requests (#SBATCH -t) should use the format D-HH:MM:SS for days, hours, minutes, and seconds requested. Alternatively, include just an integer that represents minutes. The default is 1 hour.
  • Name your output file, which will include both STDOUT and STDERR, with #SBATCH -o <file name>.
  • If you would like to receive email notifications, include #SBATCH --mail-type=NONE,BEGIN,END,FAIL,ARRAY_TASKS,ALL with only the conditions you prefer.
    • If you wish to use a non-default email address, add #SBATCH --mail-user=<preferred email>.
  • When listing commands to run inside the job, any computationally-intensive command should be prefixed with srun for best performance.
  • Run man sbatch or visit the sbatch documentation page for more options.

Basic Python Example

  • The guide will focus on providing a full runthrough of loading software and submitting a job
  • In this guide, we'll load anaconda3 and run a simple python script

While logged into ICE, use a text editor such as nano, vi, or emacs to create the following python script, call it test.py

#simple test script
result = 2 ** 2
print("Result of 2 ^ 2: {}".format(result))

Now, create a job submission script SlurmPythonExample.sbatch with the commands below:

#!/bin/bash
#SBATCH -JSlurmPythonExample                    # Job name
#SBATCH -N1 --ntasks-per-node=4                 # Number of nodes and cores per node required
#SBATCH --mem-per-cpu=1G                        # Memory per core
#SBATCH -t15                                    # Duration of the job (Ex: 15 mins)
#SBATCH -oReport-%j.out                         # Combined output and error messages file
#SBATCH --mail-type=BEGIN,END,FAIL              # Mail preferences
#SBATCH --mail-user=gburdell3@gatech.edu        # E-mail address for notifications
cd $SLURM_SUBMIT_DIR                            # Change to working directory

module load anaconda3                           # Load module dependencies
srun python test.py                             # Example Process
  • Make sure that test.py and SlurmPythonExample.sbatch are in the same folder. It is important that you submit the job from this directory. $SLURM_SUBMIT_DIR is a variable that contains path for this directory where job is submitted.
  • module load anaconda3 loads anaconda3, which includes python.
  • srun python test.py runs the python script. srun runs the program as many times as specified by the -n or --ntasks option. If we have just python test.py, then the program will run only once.

You can submit the script by running sbatch SlurmPythonExample.sbatch from command line. For checking job status, use squeue -u gburdell3. For deleting a job, use scancel <jobid>. Once the job is completed, you'll see a Report-<jobid>.out file, which contains the results of the job. It will look something like this:

#Output file
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:04
Job ID:    1470
User ID:   gburdell3
Job name:  SlurmPythonExample
Partition: ice-cpu
---------------------------------------
Result of 2 ^ 2: 4
Result of 2 ^ 2: 4
Result of 2 ^ 2: 4
Result of 2 ^ 2: 4
---------------------------------------
Begin Slurm Epilog: Oct-07-2022 16:10:06
Job ID:        1470
Array Job ID:  _4294967294
User ID:       gburdell3
Job name:      SlurmPythonExample
Resources:     cpu=4,mem=4G,node=1
Rsrc Used:     cput=00:00:12,vmem=8K,walltime=00:00:03,mem=0,energy_used=0
Partition:     ice-cpu
Nodes:         atl0
---------------------------------------

Choosing a CPU Architecture

The cluster provides nodes with either Intel or AMD CPUs. By default, jobs are assigned to the first available resource.

  • To request a node with an Intel CPU, add #SBATCH -C intel.
  • To request a node with an AMD CPU, add #SBATCH -C amd.

MPI Jobs

Warning

Do not use mpirun or mpiexec with Slurm. Use srun instead.

You may want to run Message Passing Interface (MPI) jobs, which utilize a message-passing standard designed for parallel computing on the cluster.

In this set of examples, we will compile "hello world" MPI code from MPI Tutorial and run the program using srun.

To set up our environment for both MPI job examples, follow the following steps to create a new directory and download the MPI code:

$ mkdir slurm_mpi_example
$ cd slurm_mpi_example
$ wget https://raw.githubusercontent.com/mpitutorial/mpitutorial/gh-pages/tutorials/mpi-hello-world/code/mpi_hello_world.c

Interactive MPI Example

For running MPI in Slurm using an interactive job, follow the steps for Interactive Jobs to enter an interactive session:

  • First, as in the interactive job example, use salloc to allocate 1 node with 4 cores for an interactive job:
$ salloc -N2 --ntasks-per-node=4 -t1:00:00
salloc: Pending job allocation 1471
salloc: job 1471 queued and waiting for resources
  • Next, after pending status, your job will start after resources are granted with the following prompt:
salloc: job 1902 has been allocated resources
salloc: Granted job allocation 1471
salloc: Waiting for resource configuration
salloc: Nodes atl0,atl1 are ready for job
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:09
Job ID:    1471
User ID:   gburdell3
Job name:  interactive
Partition: ice-cpu
---------------------------------------
[gburdell3@atl0 ~]$
  • Next, within your interactive session and in the slurm_mpi_example directory created earlier with the mpi_hello_world.c example code, load the relevant modules and compile the MPI code using mpicc:
$ cd slurm_mpi_example
$ module load gcc/10.3.0 mvapich2/2.3.6
$ mpicc mpi_hello_world.c -o mpi_hello_world
  • Next run the MPI job using srun:
$ srun mpi_hello_world
  • Finally, the following should be output from this interactive MPI example:
Hello world from processor atl0, rank 0 out of 8 processors
Hello world from processor atl0, rank 2 out of 8 processors
Hello world from processor atl0, rank 3 out of 8 processors
Hello world from processor atl1, rank 4 out of 8 processors
Hello world from processor atl1, rank 7 out of 8 processors
Hello world from processor atl0, rank 1 out of 8 processors
Hello world from processor atl1, rank 5 out of 8 processors
Hello world from processor atl1, rank 6 out of 8 processors

Batch MPI Example

For running MPI in Slurm using a batch job, follow the steps in Batch Jobs and Basic Python Example to set up and run a batch job.

  • First, in the slurm_mpi_example directory created earlier with the mpi_hello_world.c example code, create a file named SlurmBatchMPIExample.sbatch with the following content:
#!/bin/bash
#SBATCH -JSlurmBatchMPIExample                  # Job name
#SBATCH -N2 --ntasks-per-node=4                 # Number of nodes and cores per node required
#SBATCH --mem-per-cpu=1G                        # Memory per core
#SBATCH -t1:00:00                               # Duration of the job (Ex: 1 hour)
#SBATCH -oReport-%j.out                         # Combined output and error messages file
#SBATCH --mail-type=BEGIN,END,FAIL              # Mail preferences
#SBATCH --mail-user=gburdell3@gatech.edu        # E-mail address for notifications

cd $HOME/slurm_mpi_example                      # Change to working directory created in $HOME

# Compile MPI Code
module load gcc/10.3.0 mvapich2/2.3.6
mpicc mpi_hello_world.c -o mpi_hello_world

# Run MPI Code
srun mpi_hello_world
  • This batch file combines the configuration for the Slurm batch job submission, the compilation for the MPI code, and running the MPI job using srun.
  • Next run the MPI batch job using sbatch in the slurm_mpi_example directory:
$ cd slurm_mpi_example
$ sbatch SlurmBatchMPIExample.sbatch
Submitted batch job 1473
  • This example should not take long, but it may take time to run depending on how busy the cluster is.

  • Finally, after the batch MPI job example has run, the following should be output in the file created in the same directory named Report-<job id>.out:

---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:09
Job ID:    1473
User ID:   gburdell3
Job name:  SlurmBatchMPIExample
Partition: ice-cpu
---------------------------------------
Hello world from processor atl0, rank 0 out of 8 processors
Hello world from processor atl0, rank 2 out of 8 processors
Hello world from processor atl0, rank 3 out of 8 processors
Hello world from processor atl1, rank 4 out of 8 processors
Hello world from processor atl1, rank 7 out of 8 processors
Hello world from processor atl0, rank 1 out of 8 processors
Hello world from processor atl1, rank 5 out of 8 processors
Hello world from processor atl1, rank 6 out of 8 processors
---------------------------------------
Begin Slurm Epilog: Oct-07-2022 16:10:11
Job ID:        1473
Array Job ID:  _4294967294
User ID:       gburdell3
Job name:      SlurmBatchMPIExample
Resources:     cpu=8,mem=8G,node=2
Rsrc Used:     cput=00:00:16,vmem=1104K,walltime=00:00:02,mem=0,energy_used=0
Partition:     ice-cpu
Nodes:         atl0,atl1
---------------------------------------

GPU Jobs

Note

By default, your job will be assigned to the first available Nvidia GPU. If you want to use a specific Nvidia architecture, or if you wish to use an AMD GPU, you must specify the type.

Requesting GPUs

  • Note that the GPU resource can be requested 2 different ways. For both approaches, the <gpu type> is optional, if a specific architecture is needed.
    • --gres=gpu:<gpu type>:<number of gpus per node>. This specifies GPUs per node. Note that the number provided here is for number of gpus per node.
    • -G, --gpus=<gpu type>:<total number of gpus>. This specifies GPUs per job. Note that the number provided here is for the total number of gpus. Slurm requires a minimum of 1 GPU per node, so the total number of GPUs requested must be greater than or equal to the number of nodes requested.

Examples for requesting 1 GPU:

  • Nvidia Tesla V100
    • --gres=gpu:V100:1 or -G V100:1 for any V100
    • --gres=gpu:1 -C V100-16GB or -G1 -C V100-16GB for a V100 with 16 GB of memory
    • --gres=gpu:1 -C V100-32GB or -G1 -C V100-32GB for a V100 with 32 GB of memory
    • maximum 4 V100 per node
  • Nvidia Quadro Pro RTX6000 (note underscore in some syntax)
    • --gres=gpu:RTX_6000:1 or -G RTX_6000:1 or --gres=gpu:1 -C RTX6000 -G 1 -C RTX6000
    • maximum 4 RTX6000 per node
  • Nvidia A40
    • --gres=gpu:A40:1 or -G A40:1 or --gres=gpu:1 -C A40 or -G 1 -C A40
    • maximum 2 A40 per node with AMD CPUs
  • Nvidia A100
    • --gres=gpu:A100:1 or -G A100:1 for any A100
    • --gres=gpu:1 -C A100-40GB or -G 1 -C A100-40GB for an A100 with 40 GB of memory
    • --gres=gpu:1 -C A100-80GB or -G 1 -C A100-80GB for an A100 with 80 GB of memory
    • maximum 2 A100 per node with AMD CPUs
  • AMD MI210
    • --gres=gpu:MI210:1 or -G MI210:1 or --gres=gpu:1 -C MI210 or -G 1 -C MI210
    • maximum 2 MI210 per node with AMD CPUs
    • For details, see Using AMD GPUs

Memory can be requested with --mem-per-cpu or --mem-per-gpu.

With Slurm, users can also take advantage of using the following variations of --gpus* for greater control over how GPUs are allocated:

  • --gpus-per-node=<gpu type>:<number of gpus> - Specify the number of GPUs required for the job on each node in the job resource allocation. More information for this option can be found for salloc or sbatch.
  • --gpus-per-socket=<gpu type>:<number of gpus> - Specify the number of GPUs required for the job on each socket in the job resource allocation. More information for this option can be found for salloc or sbatch.
  • --gpus-per-task=<gpu type>:<number of gpus> - Specify the number of GPUs required for the job on each task in the job resource allocation. More information for this option can be found for salloc or sbatch.

Let's take a look at running a tensorflow example on GPU resource. We have a test example in the $TENSORFLOWGPUROOT directory.

Interactive GPU Example

For running GPUs in Slurm using an interactive job, follow the steps for Interactive Jobs to enter an interactive session:

  • First, start a Slurm interactive session with GPUs with the following command, allocating for 1 node with an Nvidia Tesla V100 GPU.
$ salloc -N1 --mem-per-gpu=12G -t0:15:00 --gres=gpu:V100:1 --ntasks-per-node=6
salloc: Pending job allocation 1484
salloc: job 1484 queued and waiting for resources
  • Next, after pending status, your job will start after resources are granted with the following prompt:
salloc: Granted job allocation 1484
salloc: Waiting for resource configuration
salloc: Nodes atl0 are ready for job
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:57
Job ID:    1484
User ID:   gburdell3
Job name:  interactive
Partition: ice-gpu
---------------------------------------
[gburdell3@atl0 ~]$
  • Next, within your interactive session, load the tensorflow-gpu module and run the test.py example:
$ cd slurm_gpu_example
$ module load tensorflow-gpu/2.9.0
$ srun python $TENSORFLOWGPUROOT/testgpu.py gpu 1000
  • Finally, the sample output from the interactive session should be:
$ srun python $TENSORFLOWGPUROOT/testgpu.py gpu 1000
2022-10-07 16:34:20.000892: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-10-07 16:34:29.749228: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-07 16:34:30.358799: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30987 MB memory:  -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0
Num GPUs Available:  1


tf.Tensor(250106050.0, shape=(), dtype=float32)
Shape: (1000, 1000) Device: /gpu:0
Time taken: 0:00:01.312392


$

Batch GPU Example

For running GPUs in Slurm using a batch job, follow the steps in Batch Jobs and Basic Python Example to set up and run a batch job:

  • First, create a directory named slurm_gpu_example:
$ mkdir slurm_gpu_example
  • Next, create a batch script named SlurmBatchGPUExample.sbatch with the following content:
#!/bin/bash
#SBATCH -JGPUExample                                # Job name
#SBATCH -N1 --gres=gpu:V100:1 --ntasks-per-node=6   # Number of nodes, GPUs, and cores required
#SBATCH --mem-per-gpu=12G                           # Memory per gpu
#SBATCH -t15                                        # Duration of the job (Ex: 15 mins)
#SBATCH -oReport-%j.out                             # Combined output and error messages file
#SBATCH --mail-type=BEGIN,END,FAIL                  # Mail preferences
#SBATCH --mail-user=gburdell3@gatech.edu            # e-mail address for notifications

cd $HOME/slurm_gpu_example                          # Change to working directory created in $HOME

module load tensorflow-gpu/2.9.0                    # Load module dependencies
srun python $TENSORFLOWGPUROOT/testgpu.py gpu 1000  # Run test example
  • Note that the GPU resource can be requested 2 different ways with sbatch in batch mode. The details for GPU resources for GPU batch jobs is similar to interactive jobs here.

  • Next, run the GPU batch job using sbatch in the slurm_gpu_example directory:

$ cd slurm_gpu_example
$ sbatch SlurmBatchGPUExample.sbatch
Submitted batch job 1491
  • Finally, after the batch MPI job example has run, the following should be output in the file created in the same directory named Report-<job id>.out:
---------------------------------------
Begin Slurm Prolog: Oct-07-2022 16:10:07
Job ID:    1491
User ID:   gburdell3
Job name:  GPUExample
Partition: gpu-v100
---------------------------------------
2022-10-07 16:37:14.541726: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-10-07 16:37:24.629080: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-07 16:37:25.227169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30987 MB memory:  -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0
Num GPUs Available:  1



tf.Tensor(249757090.0, shape=(), dtype=float32)
Shape: (1000, 1000) Device: /gpu:0
Time taken: 0:00:01.302872



---------------------------------------
Begin Slurm Epilog: Oct-07-2022 16:10:26
Job ID:        1491
Array Job ID:  _4294967294
User ID:       gburdell3
Job name:      GPUExample
Resources:     cpu=12,gres/gpu:v100=1,mem=12G,node=1
Rsrc Used:     cput=00:03:48,vmem=120K,walltime=00:00:19,mem=0,energy_used=0
Partition:     ice-gpu
Nodes:         atl0
---------------------------------------

Using AMD GPUs

  • The AMD GPUs can be monitored with the rocm-smi command.
  • When compiling for these GPUs, it is essential to specify the architecture, or an error will occur. With the hipcc compiler, use hipcc --offload-arch=gfx90a.
  • An example vectoradd_hip.cpp code can be found on AMD's site.
  • make can be used if preferred.
  • CMake can also be used.

Local Disk Jobs

Every ICE compute node has local disk storage available for temporary use in a job, which is automatically cleared upon job completion. Some applications can benefit from this storage for faster I/O than network storage (home and scratch). Most ICE CPU nodes and some GPU nodes have large NVMe local disks, while a few have SAS storage. See ICE resources for details.

  • Use the ${TMPDIR} variable in your Slurm script or interactive session to access the temporary directory for your job on local disk, which is automatically created for every job.
  • When requesting a partial node, guarantee availability of local disk space with #SBATCH --tmp=<size>[units, default MB].
  • To request a node with SAS storage, add #SBATCH -C localSAS.
  • To request a node with NVMe storage, add #SBATCH -C localNVMe.