Updated 2023-03-31

Run Cuda on the Cluster

Overview

  • CUDA is a parallel computing platform and programming model developed by Nvidia for computational tasks on GPUs. With CUDA, programmers can speed up the computations significantly by making use of the GPUs.
  • This guide will cover how to run Cuda on the Cluster.

Walkthrough: Run Cuda on the Cluster

  • This walkthrough will cover a simple example that shows how to perform addition of two arrays using CUDA.
  • add.cu can be found here
  • SBATCH Script can be found here
  • You can transfer the files to your account on the cluster to follow along. The file transfer guide may be helpful.

Part 1: The SBTACH Script

#!/bin/bash
#SBATCH -JcudaTest
#SBATCH -A [Account]
#SBATCH -N1 --gres=gpu:1
#SBATCH -t10
#SBATCH -qinferno
#SBATCH -oReport-%j.out

cd $SLURM_SUBMIT_DIR
module load cuda/11.7.0-7sdye3

nvcc -o add add.cu
srun -n 1 add 1 &
srun -n 1 add 2 &
wait
  • The #SBATCH directives are standard, requesting 10 minutes of walltime and 1 node with 1 gpu. More on #SBATCH directives can be found in the Using Slurm on Phoenix Guide
  • $SLURM_SUBMIT_DIR is a variable that represents the directory you submit the SBATCH script from. Make sure the files you want to use are in the same directory you put the SBATCH script.
  • `nvcc compiler is used to compile the cuda script.
  • Output Files will also show up in this directory as well
  • To see what Cuda versions are available, run module spider cuda, and load the one you want.

Part 2: Submit Job and Check Status

  • Make sure you're in the dir that contains the SBATCH Script as well as add.cu
  • Submit as normal, with sbatch <script name>. In this case sbatch cuda.sbatch
  • Check job status with squeue --job <jobID>, replacing with the jobid returned after running sbatch
  • You can delete the job with scancel <jobID> , replacing with the jobid returned after running sbatch

Part 3: Collecting Results

  • In the directory where you submitted the SBATCH script, you should see a Report-<jobID>.out file which contains the results of the job and a add executable file
  • Report-<jobID>.out should look like this:
---------------------------------------
Begin Slurm Prolog: Feb-15-2023 20:09:17
Job ID:    714060
User ID:   svangala3
Account:   phx-pace-staff
Job name:  cudaTest
Partition: gpu-v100
QOS:       inferno
---------------------------------------
Addition of two arrays = {111, 212, 313, 414, 515}
Addition of two arrays = {11, 22, 33, 44, 55}
---------------------------------------
Begin Slurm Epilog: Feb-15-2023 20:09:25
Job ID:        714060
Array Job ID:  _4294967294
User ID:       svangala3
Account:       phx-pace-staff
Job name:      cudaTest
Resources:     cpu=12,gres/gpu:v100=1,mem=12G,node=1
Rsrc Used:     cput=00:02:00,vmem=1176K,walltime=00:00:10,mem=0,energy_used=0
Partition:     gpu-v100
QOS:           inferno
Nodes:         atl1-1-01-004-36-0
---------------------------------------
  • After the result files are produced, you can move the files off the cluster, refer to the file transfer guide for help.
  • Congratulations! You successfully ran Cuda on the cluster.