Updated 2021-05-17
Run PyTorch on the Cluster (RHEL7 Only)¶
Overview¶
- PyTorch is an open source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing.
- This guide will cover how to run PyTorch on RHEL7 on the Cluster.
- You can find more information about PyTorch on their homepage.
Things to Note¶
- This guide only works with the pytorch module on RHEL7.
- You must submit the job to a queue like testflight-gpu, ece-gpu, etc that has access to GPUs and the pytorch module to run this example.
- More information about the example below can be found by opening the README located at
$PYTORCHROOT/examples/word_language_model/README.md
on the Cluster or here.
Walkthrough: Run PyTorch on the Cluster¶
- This example trains a multi-layer RNN (Elman, GRU, or LSTM) on a language modeling task.
- The files used in this example can be found on the Cluster at
$PYTORCHROOT/examples/word_language_model
. - PBS Script can be found here.
- You can transfer the files to your account on the cluster to follow along. The file transfer guide may be helpful.
Part 1: The PBS Script¶
#PBS -N pytorchTest
#PBS -A [Account]
#PBS -l nodes=1:ppn=2:gpus=1
#PBS -l pmem=2gb
#PBS -l walltime=10:00
#PBS -q testflight-gpu
#PBS -j oe
#PBS -o pytorchTest.out
cd $PBS_O_WORKDIR
module load pytorch/1.2
source activate $PYTORCHROOT
cp -r $PYTORCHROOT/examples/word_language_model .
cd word_language_model
python main.py --cuda --epochs 6
- The
#PBS
directives request 10 minutes of walltime, 1 node with 2 cores, and 1 gpu. More on#PBS
directives can be found in the PBS guide $PBS_O_WORKDIR
is a variable that represents the directory you submit the PBS script from. Input and output files for the script should be found in the same directory you put the PBS script.module load pytorch/1.2
loads the version 1.2 of PyTorch. To see what PyTorch versions are available, runmodule avail pytorch
, and load the one you want.source activate $PYTORCHROOT
creates an anaconda environment from the contents of $PYTORCHROOT.- The next two instructions will copy the
word_language_model
example directory to your working directory and cd into it. python main.py --cuda --epochs 6
will train a LSTM on Wikitext-2 with CUDA with an upper epoch limit of 6.
Part 2: Submit Job and Check Status¶
- Make sure you're in the dir that contains the
PBS
Script as well as thepytorch
program - Submit as normal, with
qsub <pbs script name>
. In this caseqsub pytorch.pbs
- Check job status with
qstat -t 22182721
, replacing the number with the job id returned after running qsub - You can delete the job with
qdel 22182721
, again replacing the number with the jobid returned after running qsub
Part 3: Collecting Results¶
- In the directory where you submitted the
PBS
script, you should see apytorchTest.out
file which contains the results of the job. Usecat
or open the file in a text editor to take a look. pytorchTest.out
should look like this:
---------------------------------------
Begin PBS Prologue Fri Oct 18 17:07:00 EDT 2019
Job ID: 12419.testflight-sched.pace.gatech.edu
User ID: svemuri8
Job name: pytorchTest
Queue: testflight-gpu
End PBS Prologue Fri Oct 18 17:07:00 EDT 2019
---------------------------------------
| epoch 1 | 200/ 2983 batches | lr 20.00 | ms/batch 12.62 | loss 7.63 | ppl 2048.93
| epoch 1 | 400/ 2983 batches | lr 20.00 | ms/batch 11.57 | loss 6.86 | ppl 949.01
| epoch 1 | 600/ 2983 batches | lr 20.00 | ms/batch 11.52 | loss 6.48 | ppl 651.67
| epoch 1 | 800/ 2983 batches | lr 20.00 | ms/batch 11.46 | loss 6.30 | ppl 545.05
| epoch 1 | 1000/ 2983 batches | lr 20.00 | ms/batch 11.46 | loss 6.15 | ppl 469.78
| epoch 1 | 1200/ 2983 batches | lr 20.00 | ms/batch 11.63 | loss 6.07 | ppl 430.61
| epoch 1 | 1400/ 2983 batches | lr 20.00 | ms/batch 11.53 | loss 5.96 | ppl 385.86
| epoch 1 | 1600/ 2983 batches | lr 20.00 | ms/batch 11.63 | loss 5.95 | ppl 383.91
| epoch 1 | 1800/ 2983 batches | lr 20.00 | ms/batch 11.73 | loss 5.80 | ppl 331.57
| epoch 1 | 2000/ 2983 batches | lr 20.00 | ms/batch 11.60 | loss 5.78 | ppl 324.11
| epoch 1 | 2200/ 2983 batches | lr 20.00 | ms/batch 11.61 | loss 5.68 | ppl 292.09
| epoch 1 | 2400/ 2983 batches | lr 20.00 | ms/batch 11.69 | loss 5.68 | ppl 294.27
| epoch 1 | 2600/ 2983 batches | lr 20.00 | ms/batch 11.66 | loss 5.66 | ppl 287.08
| epoch 1 | 2800/ 2983 batches | lr 20.00 | ms/batch 11.62 | loss 5.55 | ppl 257.19
-----------------------------------------------------------------------------------------
| end of epoch 1 | time: 36.23s | valid loss 5.53 | valid ppl 251.96
-----------------------------------------------------------------------------------------
| epoch 2 | 200/ 2983 batches | lr 20.00 | ms/batch 11.80 | loss 5.55 | ppl 256.59
| epoch 2 | 400/ 2983 batches | lr 20.00 | ms/batch 11.72 | loss 5.53 | ppl 253.35
| epoch 2 | 600/ 2983 batches | lr 20.00 | ms/batch 11.85 | loss 5.36 | ppl 213.31
| epoch 2 | 800/ 2983 batches | lr 20.00 | ms/batch 11.80 | loss 5.38 | ppl 217.89
| epoch 2 | 1000/ 2983 batches | lr 20.00 | ms/batch 11.82 | loss 5.36 | ppl 213.11
| epoch 2 | 1200/ 2983 batches | lr 20.00 | ms/batch 11.81 | loss 5.34 | ppl 208.17
| epoch 2 | 1400/ 2983 batches | lr 20.00 | ms/batch 11.81 | loss 5.34 | ppl 207.94
| epoch 2 | 1600/ 2983 batches | lr 20.00 | ms/batch 11.84 | loss 5.39 | ppl 220.03
| epoch 2 | 1800/ 2983 batches | lr 20.00 | ms/batch 11.84 | loss 5.26 | ppl 193.22
| epoch 2 | 2000/ 2983 batches | lr 20.00 | ms/batch 11.80 | loss 5.28 | ppl 195.50
| epoch 2 | 2200/ 2983 batches | lr 20.00 | ms/batch 11.84 | loss 5.18 | ppl 177.08
| epoch 2 | 2400/ 2983 batches | lr 20.00 | ms/batch 11.86 | loss 5.21 | ppl 183.81
| epoch 2 | 2600/ 2983 batches | lr 20.00 | ms/batch 11.78 | loss 5.23 | ppl 186.74
| epoch 2 | 2800/ 2983 batches | lr 20.00 | ms/batch 11.82 | loss 5.14 | ppl 170.28
-----------------------------------------------------------------------------------------
| end of epoch 2 | time: 36.56s | valid loss 5.27 | valid ppl 195.11
-----------------------------------------------------------------------------------------
| epoch 3 | 200/ 2983 batches | lr 20.00 | ms/batch 12.32 | loss 5.19 | ppl 179.34
| epoch 3 | 400/ 2983 batches | lr 20.00 | ms/batch 12.26 | loss 5.20 | ppl 181.92
| epoch 3 | 600/ 2983 batches | lr 20.00 | ms/batch 12.12 | loss 5.02 | ppl 152.10
| epoch 3 | 800/ 2983 batches | lr 20.00 | ms/batch 11.87 | loss 5.07 | ppl 159.61
| epoch 3 | 1000/ 2983 batches | lr 20.00 | ms/batch 11.99 | loss 5.06 | ppl 157.82
| epoch 3 | 1200/ 2983 batches | lr 20.00 | ms/batch 11.96 | loss 5.05 | ppl 156.11
| epoch 3 | 1400/ 2983 batches | lr 20.00 | ms/batch 11.95 | loss 5.08 | ppl 161.37
| epoch 3 | 1600/ 2983 batches | lr 20.00 | ms/batch 11.95 | loss 5.15 | ppl 172.53
| epoch 3 | 1800/ 2983 batches | lr 20.00 | ms/batch 11.95 | loss 5.02 | ppl 150.77
| epoch 3 | 2000/ 2983 batches | lr 20.00 | ms/batch 11.82 | loss 5.04 | ppl 154.34
| epoch 3 | 2200/ 2983 batches | lr 20.00 | ms/batch 11.91 | loss 4.94 | ppl 140.38
| epoch 3 | 2400/ 2983 batches | lr 20.00 | ms/batch 11.85 | loss 4.99 | ppl 147.31
| epoch 3 | 2600/ 2983 batches | lr 20.00 | ms/batch 11.88 | loss 5.01 | ppl 149.24
| epoch 3 | 2800/ 2983 batches | lr 20.00 | ms/batch 12.10 | loss 4.93 | ppl 138.46
-----------------------------------------------------------------------------------------
| end of epoch 3 | time: 37.08s | valid loss 5.16 | valid ppl 175.00
-----------------------------------------------------------------------------------------
| epoch 4 | 200/ 2983 batches | lr 20.00 | ms/batch 12.32 | loss 5.00 | ppl 148.84
| epoch 4 | 400/ 2983 batches | lr 20.00 | ms/batch 11.94 | loss 5.02 | ppl 151.21
| epoch 4 | 600/ 2983 batches | lr 20.00 | ms/batch 11.93 | loss 4.84 | ppl 126.34
| epoch 4 | 800/ 2983 batches | lr 20.00 | ms/batch 11.97 | loss 4.89 | ppl 132.94
| epoch 4 | 1000/ 2983 batches | lr 20.00 | ms/batch 11.92 | loss 4.89 | ppl 132.35
| epoch 4 | 1200/ 2983 batches | lr 20.00 | ms/batch 11.89 | loss 4.89 | ppl 132.45
| epoch 4 | 1400/ 2983 batches | lr 20.00 | ms/batch 11.93 | loss 4.93 | ppl 137.78
| epoch 4 | 1600/ 2983 batches | lr 20.00 | ms/batch 11.98 | loss 5.00 | ppl 148.85
| epoch 4 | 1800/ 2983 batches | lr 20.00 | ms/batch 11.94 | loss 4.87 | ppl 130.03
| epoch 4 | 2000/ 2983 batches | lr 20.00 | ms/batch 11.95 | loss 4.90 | ppl 133.86
| epoch 4 | 2200/ 2983 batches | lr 20.00 | ms/batch 12.09 | loss 4.81 | ppl 122.17
| epoch 4 | 2400/ 2983 batches | lr 20.00 | ms/batch 12.17 | loss 4.85 | ppl 127.52
| epoch 4 | 2600/ 2983 batches | lr 20.00 | ms/batch 11.99 | loss 4.87 | ppl 130.16
| epoch 4 | 2800/ 2983 batches | lr 20.00 | ms/batch 12.05 | loss 4.80 | ppl 121.41
-----------------------------------------------------------------------------------------
| end of epoch 4 | time: 37.12s | valid loss 5.07 | valid ppl 159.65
-----------------------------------------------------------------------------------------
| epoch 5 | 200/ 2983 batches | lr 20.00 | ms/batch 12.38 | loss 4.86 | ppl 129.45
| epoch 5 | 400/ 2983 batches | lr 20.00 | ms/batch 12.02 | loss 4.89 | ppl 133.48
| epoch 5 | 600/ 2983 batches | lr 20.00 | ms/batch 12.01 | loss 4.71 | ppl 110.85
| epoch 5 | 800/ 2983 batches | lr 20.00 | ms/batch 12.07 | loss 4.78 | ppl 118.55
| epoch 5 | 1000/ 2983 batches | lr 20.00 | ms/batch 12.10 | loss 4.77 | ppl 117.38
| epoch 5 | 1200/ 2983 batches | lr 20.00 | ms/batch 12.01 | loss 4.77 | ppl 117.71
| epoch 5 | 1400/ 2983 batches | lr 20.00 | ms/batch 12.14 | loss 4.82 | ppl 123.91
| epoch 5 | 1600/ 2983 batches | lr 20.00 | ms/batch 12.09 | loss 4.89 | ppl 132.68
| epoch 5 | 1800/ 2983 batches | lr 20.00 | ms/batch 12.01 | loss 4.76 | ppl 117.30
| epoch 5 | 2000/ 2983 batches | lr 20.00 | ms/batch 12.15 | loss 4.80 | ppl 120.99
| epoch 5 | 2200/ 2983 batches | lr 20.00 | ms/batch 12.12 | loss 4.69 | ppl 109.35
| epoch 5 | 2400/ 2983 batches | lr 20.00 | ms/batch 12.16 | loss 4.74 | ppl 114.76
| epoch 5 | 2600/ 2983 batches | lr 20.00 | ms/batch 12.16 | loss 4.77 | ppl 117.37
| epoch 5 | 2800/ 2983 batches | lr 20.00 | ms/batch 12.03 | loss 4.70 | ppl 109.65
-----------------------------------------------------------------------------------------
| end of epoch 5 | time: 37.41s | valid loss 5.04 | valid ppl 153.98
-----------------------------------------------------------------------------------------
| epoch 6 | 200/ 2983 batches | lr 20.00 | ms/batch 12.41 | loss 4.76 | ppl 117.17
| epoch 6 | 400/ 2983 batches | lr 20.00 | ms/batch 12.16 | loss 4.80 | ppl 121.60
| epoch 6 | 600/ 2983 batches | lr 20.00 | ms/batch 12.22 | loss 4.61 | ppl 100.98
| epoch 6 | 800/ 2983 batches | lr 20.00 | ms/batch 12.22 | loss 4.68 | ppl 107.31
| epoch 6 | 1000/ 2983 batches | lr 20.00 | ms/batch 12.19 | loss 4.68 | ppl 107.50
| epoch 6 | 1200/ 2983 batches | lr 20.00 | ms/batch 12.13 | loss 4.68 | ppl 107.84
| epoch 6 | 1400/ 2983 batches | lr 20.00 | ms/batch 12.09 | loss 4.73 | ppl 113.68
| epoch 6 | 1600/ 2983 batches | lr 20.00 | ms/batch 12.14 | loss 4.80 | ppl 121.91
| epoch 6 | 1800/ 2983 batches | lr 20.00 | ms/batch 12.15 | loss 4.68 | ppl 108.10
| epoch 6 | 2000/ 2983 batches | lr 20.00 | ms/batch 12.15 | loss 4.71 | ppl 111.39
| epoch 6 | 2200/ 2983 batches | lr 20.00 | ms/batch 12.11 | loss 4.62 | ppl 101.04
| epoch 6 | 2400/ 2983 batches | lr 20.00 | ms/batch 12.18 | loss 4.67 | ppl 106.34
| epoch 6 | 2600/ 2983 batches | lr 20.00 | ms/batch 12.19 | loss 4.69 | ppl 108.52
| epoch 6 | 2800/ 2983 batches | lr 20.00 | ms/batch 12.15 | loss 4.62 | ppl 101.45
-----------------------------------------------------------------------------------------
| end of epoch 6 | time: 37.66s | valid loss 5.00 | valid ppl 148.12
-----------------------------------------------------------------------------------------
=========================================================================================
| End of training | test loss 4.93 | test ppl 138.48
=========================================================================================
---------------------------------------
Begin PBS Epilogue Fri Oct 18 17:11:15 EDT 2019
Job ID: 12419.testflight-sched.pace.gatech.edu
User ID: svemuri8
Job name: pytorchTest
Resources: nodes=1:ppn=2:gpus=1,pmem=2gb,walltime=00:10:00,neednodes=1:ppn=2:gpus=1
Rsrc Used: cput=00:03:43,vmem=17700628kb,walltime=00:04:15,mem=1776028kb,energy_used=0
Queue: testflight-gpu
Nodes:
rich133-k33-14.pace.gatech.edu
End PBS Epilogue Fri Oct 18 17:11:15 EDT 2019
---------------------------------------
- After the result files are produced, you can move the files off the cluster, refer to the file transfer guide for help.
- Congratulations! You successfully ran PyTorch on the cluster.