Updated 2023-03-31
Run PyTorch on the Cluster (RHEL Only)¶
Overview¶
- PyTorch is an open source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing.
- This guide will cover how to run PyTorch on RHEL7 on the Cluster.
- You can find more information about PyTorch on their homepage
Things to Note¶
- This guide only works with the pytorch module on RHEL7.
- You must submit the job to a queue like testflight-gpu, ece-gpu, etc that has access to GPUs and the pytorch module to run this example.
- More information about the example below can be found by opening the README located at
$PYTORCHROOT/examples/word_language_model/README.md
on the Cluster or here
Walkthrough: Run PyTorch on the Cluster¶
- This example trains a multi-layer RNN (Elman, GRU, or LSTM) on a language modeling task.
The files used in this example can be found on the Cluster at
$PYTORCHROOT/examples/word_language_model
. SBATCH
Script can be found here- You can transfer the files to your account on the cluster to follow along. The file transfer guide may be helpful.
Part 1: The SBATCH Script¶
#!/bin/bash
#SBATCH -JpytorchTest
#SBATCH --account=[Account]
#SBATCH -N1 --ntasks-per-node=2 --gres=gpu:1
#SBATCH --mem-per-cpu=2G
#SBATCH -t10
#SBATCH -qtestflight-gpu
#SBATCH -oReport-%j.out
cd $SLURM_SUBMIT_DIR
module load pytorch/1.12.0
source activate $PYTORCHROOT
cp -r $PYTORCHROOT/examples/word_language_model .
cd word_language_model
python main.py --cuda --epochs 6
- The
#SBATCH
directives request 10 minutes of walltime, 1 node with 2 cores, and 1 gpu. More on #SBATCH directives can be found in the Using Slurm on Phoenix Guide - $SLURM_SUBMIT_DIR is a variable that represents the directory you submit the PBS script from. Input and output files for the script should be found in the same directory you put the SBATCH script.
module load pytorch/1.12.0
loads the version 1.2 of PyTorch. To see what PyTorch versions are available, run`module spider pytorch
, and load the one you want.source activate $PYTORCHROOT
creates an anaconda environment from the contents of $PYTORCHROOT.- The next two instructions will copy the
word_language_model
example directory to your working directory and cd into it. python main.py --cuda --epochs 6
will train a LSTM on Wikitext-2 with CUDA with an upper epoch limit of 6.
Part 2: Submit Job and Check Status¶
- Make sure you're in the dir that contains the PBS Script as well as the
pytorch
program - Submit as normal, with
sbatch <sbatch script name>
. In this case sbatch pytorch.sbatch - Check job status with
squeue --job <jobID>
, replacingwith the job id returned after running sbatch - You can delete the job with
scancel <jobID>
, replacing with the jobid returned after running sbatch
Part 3: Collecting Results¶
- In the directory where you submitted the
SBATCH
script, you should see aReport-%<jodID>.out
file which contains the results of the job. Use cat or open the file in a text editor to take a look. Report-%<jodID>.out
---------------------------------------
Begin Slurm Prolog: Fri Nov-11-2022 10:45:24
Job ID: 68504
User ID: svangala3
Account: phx-pace-staff
Job name: pytorchTest
Partition: gpu-v100
QOS: testflight-gpu
---------------------------------------
| epoch 1 | 200/ 2983 batches | lr 20.00 | ms/batch 12.62 | loss 7.63 | ppl 2048.93
| epoch 1 | 400/ 2983 batches | lr 20.00 | ms/batch 11.57 | loss 6.86 | ppl 949.01
| epoch 1 | 600/ 2983 batches | lr 20.00 | ms/batch 11.52 | loss 6.48 | ppl 651.67
| epoch 1 | 800/ 2983 batches | lr 20.00 | ms/batch 11.46 | loss 6.30 | ppl 545.05
| epoch 1 | 1000/ 2983 batches | lr 20.00 | ms/batch 11.46 | loss 6.15 | ppl 469.78
| epoch 1 | 1200/ 2983 batches | lr 20.00 | ms/batch 11.63 | loss 6.07 | ppl 430.61
| epoch 1 | 1400/ 2983 batches | lr 20.00 | ms/batch 11.53 | loss 5.96 | ppl 385.86
| epoch 1 | 1600/ 2983 batches | lr 20.00 | ms/batch 11.63 | loss 5.95 | ppl 383.91
| epoch 1 | 1800/ 2983 batches | lr 20.00 | ms/batch 11.73 | loss 5.80 | ppl 331.57
| epoch 1 | 2000/ 2983 batches | lr 20.00 | ms/batch 11.60 | loss 5.78 | ppl 324.11
| epoch 1 | 2200/ 2983 batches | lr 20.00 | ms/batch 11.61 | loss 5.68 | ppl 292.09
| epoch 1 | 2400/ 2983 batches | lr 20.00 | ms/batch 11.69 | loss 5.68 | ppl 294.27
| epoch 1 | 2600/ 2983 batches | lr 20.00 | ms/batch 11.66 | loss 5.66 | ppl 287.08
| epoch 1 | 2800/ 2983 batches | lr 20.00 | ms/batch 11.62 | loss 5.55 | ppl 257.19
-----------------------------------------------------------------------------------------
| end of epoch 1 | time: 36.23s | valid loss 5.53 | valid ppl 251.96
-----------------------------------------------------------------------------------------
| epoch 2 | 200/ 2983 batches | lr 20.00 | ms/batch 11.80 | loss 5.55 | ppl 256.59
| epoch 2 | 400/ 2983 batches | lr 20.00 | ms/batch 11.72 | loss 5.53 | ppl 253.35
| epoch 2 | 600/ 2983 batches | lr 20.00 | ms/batch 11.85 | loss 5.36 | ppl 213.31
| epoch 2 | 800/ 2983 batches | lr 20.00 | ms/batch 11.80 | loss 5.38 | ppl 217.89
| epoch 2 | 1000/ 2983 batches | lr 20.00 | ms/batch 11.82 | loss 5.36 | ppl 213.11
| epoch 2 | 1200/ 2983 batches | lr 20.00 | ms/batch 11.81 | loss 5.34 | ppl 208.17
| epoch 2 | 1400/ 2983 batches | lr 20.00 | ms/batch 11.81 | loss 5.34 | ppl 207.94
| epoch 2 | 1600/ 2983 batches | lr 20.00 | ms/batch 11.84 | loss 5.39 | ppl 220.03
| epoch 2 | 1800/ 2983 batches | lr 20.00 | ms/batch 11.84 | loss 5.26 | ppl 193.22
| epoch 2 | 2000/ 2983 batches | lr 20.00 | ms/batch 11.80 | loss 5.28 | ppl 195.50
| epoch 2 | 2200/ 2983 batches | lr 20.00 | ms/batch 11.84 | loss 5.18 | ppl 177.08
| epoch 2 | 2400/ 2983 batches | lr 20.00 | ms/batch 11.86 | loss 5.21 | ppl 183.81
| epoch 2 | 2600/ 2983 batches | lr 20.00 | ms/batch 11.78 | loss 5.23 | ppl 186.74
| epoch 2 | 2800/ 2983 batches | lr 20.00 | ms/batch 11.82 | loss 5.14 | ppl 170.28
-----------------------------------------------------------------------------------------
| end of epoch 2 | time: 36.56s | valid loss 5.27 | valid ppl 195.11
-----------------------------------------------------------------------------------------
| epoch 3 | 200/ 2983 batches | lr 20.00 | ms/batch 12.32 | loss 5.19 | ppl 179.34
| epoch 3 | 400/ 2983 batches | lr 20.00 | ms/batch 12.26 | loss 5.20 | ppl 181.92
| epoch 3 | 600/ 2983 batches | lr 20.00 | ms/batch 12.12 | loss 5.02 | ppl 152.10
| epoch 3 | 800/ 2983 batches | lr 20.00 | ms/batch 11.87 | loss 5.07 | ppl 159.61
| epoch 3 | 1000/ 2983 batches | lr 20.00 | ms/batch 11.99 | loss 5.06 | ppl 157.82
| epoch 3 | 1200/ 2983 batches | lr 20.00 | ms/batch 11.96 | loss 5.05 | ppl 156.11
| epoch 3 | 1400/ 2983 batches | lr 20.00 | ms/batch 11.95 | loss 5.08 | ppl 161.37
| epoch 3 | 1600/ 2983 batches | lr 20.00 | ms/batch 11.95 | loss 5.15 | ppl 172.53
| epoch 3 | 1800/ 2983 batches | lr 20.00 | ms/batch 11.95 | loss 5.02 | ppl 150.77
| epoch 3 | 2000/ 2983 batches | lr 20.00 | ms/batch 11.82 | loss 5.04 | ppl 154.34
| epoch 3 | 2200/ 2983 batches | lr 20.00 | ms/batch 11.91 | loss 4.94 | ppl 140.38
| epoch 3 | 2400/ 2983 batches | lr 20.00 | ms/batch 11.85 | loss 4.99 | ppl 147.31
| epoch 3 | 2600/ 2983 batches | lr 20.00 | ms/batch 11.88 | loss 5.01 | ppl 149.24
| epoch 3 | 2800/ 2983 batches | lr 20.00 | ms/batch 12.10 | loss 4.93 | ppl 138.46
-----------------------------------------------------------------------------------------
| end of epoch 3 | time: 37.08s | valid loss 5.16 | valid ppl 175.00
-----------------------------------------------------------------------------------------
| epoch 4 | 200/ 2983 batches | lr 20.00 | ms/batch 12.32 | loss 5.00 | ppl 148.84
| epoch 4 | 400/ 2983 batches | lr 20.00 | ms/batch 11.94 | loss 5.02 | ppl 151.21
| epoch 4 | 600/ 2983 batches | lr 20.00 | ms/batch 11.93 | loss 4.84 | ppl 126.34
| epoch 4 | 800/ 2983 batches | lr 20.00 | ms/batch 11.97 | loss 4.89 | ppl 132.94
| epoch 4 | 1000/ 2983 batches | lr 20.00 | ms/batch 11.92 | loss 4.89 | ppl 132.35
| epoch 4 | 1200/ 2983 batches | lr 20.00 | ms/batch 11.89 | loss 4.89 | ppl 132.45
| epoch 4 | 1400/ 2983 batches | lr 20.00 | ms/batch 11.93 | loss 4.93 | ppl 137.78
| epoch 4 | 1600/ 2983 batches | lr 20.00 | ms/batch 11.98 | loss 5.00 | ppl 148.85
| epoch 4 | 1800/ 2983 batches | lr 20.00 | ms/batch 11.94 | loss 4.87 | ppl 130.03
| epoch 4 | 2000/ 2983 batches | lr 20.00 | ms/batch 11.95 | loss 4.90 | ppl 133.86
| epoch 4 | 2200/ 2983 batches | lr 20.00 | ms/batch 12.09 | loss 4.81 | ppl 122.17
| epoch 4 | 2400/ 2983 batches | lr 20.00 | ms/batch 12.17 | loss 4.85 | ppl 127.52
| epoch 4 | 2600/ 2983 batches | lr 20.00 | ms/batch 11.99 | loss 4.87 | ppl 130.16
| epoch 4 | 2800/ 2983 batches | lr 20.00 | ms/batch 12.05 | loss 4.80 | ppl 121.41
-----------------------------------------------------------------------------------------
| end of epoch 4 | time: 37.12s | valid loss 5.07 | valid ppl 159.65
-----------------------------------------------------------------------------------------
| epoch 5 | 200/ 2983 batches | lr 20.00 | ms/batch 12.38 | loss 4.86 | ppl 129.45
| epoch 5 | 400/ 2983 batches | lr 20.00 | ms/batch 12.02 | loss 4.89 | ppl 133.48
| epoch 5 | 600/ 2983 batches | lr 20.00 | ms/batch 12.01 | loss 4.71 | ppl 110.85
| epoch 5 | 800/ 2983 batches | lr 20.00 | ms/batch 12.07 | loss 4.78 | ppl 118.55
| epoch 5 | 1000/ 2983 batches | lr 20.00 | ms/batch 12.10 | loss 4.77 | ppl 117.38
| epoch 5 | 1200/ 2983 batches | lr 20.00 | ms/batch 12.01 | loss 4.77 | ppl 117.71
| epoch 5 | 1400/ 2983 batches | lr 20.00 | ms/batch 12.14 | loss 4.82 | ppl 123.91
| epoch 5 | 1600/ 2983 batches | lr 20.00 | ms/batch 12.09 | loss 4.89 | ppl 132.68
| epoch 5 | 1800/ 2983 batches | lr 20.00 | ms/batch 12.01 | loss 4.76 | ppl 117.30
| epoch 5 | 2000/ 2983 batches | lr 20.00 | ms/batch 12.15 | loss 4.80 | ppl 120.99
| epoch 5 | 2200/ 2983 batches | lr 20.00 | ms/batch 12.12 | loss 4.69 | ppl 109.35
| epoch 5 | 2400/ 2983 batches | lr 20.00 | ms/batch 12.16 | loss 4.74 | ppl 114.76
| epoch 5 | 2600/ 2983 batches | lr 20.00 | ms/batch 12.16 | loss 4.77 | ppl 117.37
| epoch 5 | 2800/ 2983 batches | lr 20.00 | ms/batch 12.03 | loss 4.70 | ppl 109.65
-----------------------------------------------------------------------------------------
| end of epoch 5 | time: 37.41s | valid loss 5.04 | valid ppl 153.98
-----------------------------------------------------------------------------------------
| epoch 6 | 200/ 2983 batches | lr 20.00 | ms/batch 12.41 | loss 4.76 | ppl 117.17
| epoch 6 | 400/ 2983 batches | lr 20.00 | ms/batch 12.16 | loss 4.80 | ppl 121.60
| epoch 6 | 600/ 2983 batches | lr 20.00 | ms/batch 12.22 | loss 4.61 | ppl 100.98
| epoch 6 | 800/ 2983 batches | lr 20.00 | ms/batch 12.22 | loss 4.68 | ppl 107.31
| epoch 6 | 1000/ 2983 batches | lr 20.00 | ms/batch 12.19 | loss 4.68 | ppl 107.50
| epoch 6 | 1200/ 2983 batches | lr 20.00 | ms/batch 12.13 | loss 4.68 | ppl 107.84
| epoch 6 | 1400/ 2983 batches | lr 20.00 | ms/batch 12.09 | loss 4.73 | ppl 113.68
| epoch 6 | 1600/ 2983 batches | lr 20.00 | ms/batch 12.14 | loss 4.80 | ppl 121.91
| epoch 6 | 1800/ 2983 batches | lr 20.00 | ms/batch 12.15 | loss 4.68 | ppl 108.10
| epoch 6 | 2000/ 2983 batches | lr 20.00 | ms/batch 12.15 | loss 4.71 | ppl 111.39
| epoch 6 | 2200/ 2983 batches | lr 20.00 | ms/batch 12.11 | loss 4.62 | ppl 101.04
| epoch 6 | 2400/ 2983 batches | lr 20.00 | ms/batch 12.18 | loss 4.67 | ppl 106.34
| epoch 6 | 2600/ 2983 batches | lr 20.00 | ms/batch 12.19 | loss 4.69 | ppl 108.52
| epoch 6 | 2800/ 2983 batches | lr 20.00 | ms/batch 12.15 | loss 4.62 | ppl 101.45
-----------------------------------------------------------------------------------------
| end of epoch 6 | time: 37.66s | valid loss 5.00 | valid ppl 148.12
-----------------------------------------------------------------------------------------
=========================================================================================
| End of training | test loss 4.93 | test ppl 138.48
=========================================================================================
---------------------------------------
Begin Slurm Prolog: Fri Nov-11-2022 10:45:24
Job ID: 68504
User ID: svangala3
Account: phx-pace-staff
Job name: pytorchTest
Resources: cpu=12,gres/gpu:v100=1,mem=24G,node=1
Rsrc Used: cput=00:01:24,vmem=3460K,walltime=00:00:07,mem=0,energy_used=0
Partition: gpu-v100
QOS: testflight-gpu
Nodes: atl1-1-03-002-23-0
---------------------------------------
- After the result files are produced, you can move the files off the cluster, refer to the file transfer guide for help.
- Congratulations! You successfully ran PyTorch on the cluster.