Updated 2023-03-31
Run Cuda on the Cluster¶
Overview¶
- CUDA is a parallel computing platform and programming model developed by Nvidia for computational tasks on GPUs. With CUDA, programmers can speed up the computations significantly by making use of the GPUs.
- This guide will cover how to run Cuda on the Cluster.
Walkthrough: Run Cuda on the Cluster¶
- This walkthrough will cover a simple example that shows how to perform addition of two arrays using CUDA.
add.cu
can be found hereSBATCH
Script can be found here- You can transfer the files to your account on the cluster to follow along. The file transfer guide may be helpful.
Part 1: The SBTACH Script¶
#!/bin/bash
#SBATCH -JcudaTest
#SBATCH -A [Account]
#SBATCH -N1 --gres=gpu:1
#SBATCH -t10
#SBATCH -qinferno
#SBATCH -oReport-%j.out
cd $SLURM_SUBMIT_DIR
module load cuda/11.7.0-7sdye3
nvcc -o add add.cu
srun -n 1 add 1 &
srun -n 1 add 2 &
wait
- The
#SBATCH
directives are standard, requesting 10 minutes of walltime and 1 node with 1 gpu. More on#SBATCH
directives can be found in the Using Slurm on Phoenix Guide $SLURM_SUBMIT_DIR
is a variable that represents the directory you submit the SBATCH script from. Make sure the files you want to use are in the same directory you put the SBATCH script.`nvcc
compiler is used to compile the cuda script.- Output Files will also show up in this directory as well
- To see what Cuda versions are available, run
module spider cuda
, and load the one you want.
Part 2: Submit Job and Check Status¶
- Make sure you're in the dir that contains the
SBATCH
Script as well asadd.cu
- Submit as normal, with
sbatch <script name>
. In this casesbatch cuda.sbatch
- Check job status with
squeue --job <jobID>
, replacing with the jobid returned after running sbatch - You can delete the job with
scancel <jobID>
, replacing with the jobid returned after running sbatch
Part 3: Collecting Results¶
- In the directory where you submitted the
SBATCH
script, you should see aReport-<jobID>.out
file which contains the results of the job and aadd
executable file Report-<jobID>.out
should look like this:
---------------------------------------
Begin Slurm Prolog: Feb-15-2023 20:09:17
Job ID: 714060
User ID: svangala3
Account: phx-pace-staff
Job name: cudaTest
Partition: gpu-v100
QOS: inferno
---------------------------------------
Addition of two arrays = {111, 212, 313, 414, 515}
Addition of two arrays = {11, 22, 33, 44, 55}
---------------------------------------
Begin Slurm Epilog: Feb-15-2023 20:09:25
Job ID: 714060
Array Job ID: _4294967294
User ID: svangala3
Account: phx-pace-staff
Job name: cudaTest
Resources: cpu=12,gres/gpu:v100=1,mem=12G,node=1
Rsrc Used: cput=00:02:00,vmem=1176K,walltime=00:00:10,mem=0,energy_used=0
Partition: gpu-v100
QOS: inferno
Nodes: atl1-1-01-004-36-0
---------------------------------------
- After the result files are produced, you can move the files off the cluster, refer to the file transfer guide for help.
- Congratulations! You successfully ran Cuda on the cluster.