Updated 2019-10-25

Run PyTorch on the Cluster (RHEL7 Only)

Overview

  • PyTorch is an open source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing.
  • This guide will cover how to run PyTorch on RHEL7 on the Cluster.
  • You can find more information about PyTorch on their homepage.

Things to Note

  • This guide only works with the pytorch module on RHEL7.
  • You must submit the job to a queue like testflight-gpu, ece-gpu, etc that has access to GPUs and the pytorch module to run this example.
  • More information about the example below can be found by opening the README located at $PYTORCHROOT/examples/word_language_model/README.md on the Cluster or here.

Walkthrough: Run PyTorch on the Cluster

  • This example trains a multi-layer RNN (Elman, GRU, or LSTM) on a language modeling task.
  • The files used in this example can be found on the Cluster at $PYTORCHROOT/examples/word_language_model.
  • PBS Script can be found here.
  • You can transfer the files to your account on the cluster to follow along. The file transfer guide may be helpful.

Part 1: The PBS Script

#PBS -N pytorchTest
#PBS -l nodes=1:ppn=2:gpus=1
#PBS -l pmem=2gb
#PBS -l walltime=10:00
#PBS -q testflight-gpu
#PBS -j oe
#PBS -o pytorchTest.out

cd $PBS_O_WORKDIR
module load pytorch/1.2
source activate $PYTORCHROOT

cp -r $PYTORCHROOT/examples/word_language_model .
cd word_language_model
python main.py --cuda --epochs 6

  • The #PBS directives request 10 minutes of walltime, 1 node with 2 cores, and 1 gpu. More on #PBS directives can be found in the PBS guide
  • $PBS_O_WORKDIR is a variable that represents the directory you submit the PBS script from. Input and output files for the script should be found in the same directory you put the PBS script.
  • module load pytorch/1.2 loads the version 1.2 of PyTorch. To see what PyTorch versions are available, run module avail pytorch, and load the one you want.
  • source activate $PYTORCHROOT creates an anaconda environment from the contents of $PYTORCHROOT.
  • The next two instructions will copy the word_language_model example directory to your working directory and cd into it.
  • python main.py --cuda --epochs 6 will train a LSTM on Wikitext-2 with CUDA with an upper epoch limit of 6.

Part 2: Submit Job and Check Status

  • Make sure you're in the dir that contains the PBS Script as well as the pytorch program
  • Submit as normal, with qsub <pbs script name>. In this case qsub pytorch.pbs
  • Check job status with qstat -t 22182721, replacing the number with the job id returned after running qsub
  • You can delete the job with qdel 22182721 , again replacing the number with the jobid returned after running qsub

Part 3: Collecting Results

  • In the directory where you submitted the PBS script, you should see a pytorchTest.out file which contains the results of the job. Use cat or open the file in a text editor to take a look.
  • pytorchTest.out should look like this:
---------------------------------------
Begin PBS Prologue Fri Oct 18 17:07:00 EDT 2019
Job ID:     12419.testflight-sched.pace.gatech.edu
User ID:    svemuri8
Job name:   pytorchTest
Queue:      testflight-gpu
End PBS Prologue Fri Oct 18 17:07:00 EDT 2019
---------------------------------------
| epoch   1 |   200/ 2983 batches | lr 20.00 | ms/batch 12.62 | loss  7.63 | ppl  2048.93
| epoch   1 |   400/ 2983 batches | lr 20.00 | ms/batch 11.57 | loss  6.86 | ppl   949.01
| epoch   1 |   600/ 2983 batches | lr 20.00 | ms/batch 11.52 | loss  6.48 | ppl   651.67
| epoch   1 |   800/ 2983 batches | lr 20.00 | ms/batch 11.46 | loss  6.30 | ppl   545.05
| epoch   1 |  1000/ 2983 batches | lr 20.00 | ms/batch 11.46 | loss  6.15 | ppl   469.78
| epoch   1 |  1200/ 2983 batches | lr 20.00 | ms/batch 11.63 | loss  6.07 | ppl   430.61
| epoch   1 |  1400/ 2983 batches | lr 20.00 | ms/batch 11.53 | loss  5.96 | ppl   385.86
| epoch   1 |  1600/ 2983 batches | lr 20.00 | ms/batch 11.63 | loss  5.95 | ppl   383.91
| epoch   1 |  1800/ 2983 batches | lr 20.00 | ms/batch 11.73 | loss  5.80 | ppl   331.57
| epoch   1 |  2000/ 2983 batches | lr 20.00 | ms/batch 11.60 | loss  5.78 | ppl   324.11
| epoch   1 |  2200/ 2983 batches | lr 20.00 | ms/batch 11.61 | loss  5.68 | ppl   292.09
| epoch   1 |  2400/ 2983 batches | lr 20.00 | ms/batch 11.69 | loss  5.68 | ppl   294.27
| epoch   1 |  2600/ 2983 batches | lr 20.00 | ms/batch 11.66 | loss  5.66 | ppl   287.08
| epoch   1 |  2800/ 2983 batches | lr 20.00 | ms/batch 11.62 | loss  5.55 | ppl   257.19
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 36.23s | valid loss  5.53 | valid ppl   251.96
-----------------------------------------------------------------------------------------
| epoch   2 |   200/ 2983 batches | lr 20.00 | ms/batch 11.80 | loss  5.55 | ppl   256.59
| epoch   2 |   400/ 2983 batches | lr 20.00 | ms/batch 11.72 | loss  5.53 | ppl   253.35
| epoch   2 |   600/ 2983 batches | lr 20.00 | ms/batch 11.85 | loss  5.36 | ppl   213.31
| epoch   2 |   800/ 2983 batches | lr 20.00 | ms/batch 11.80 | loss  5.38 | ppl   217.89
| epoch   2 |  1000/ 2983 batches | lr 20.00 | ms/batch 11.82 | loss  5.36 | ppl   213.11
| epoch   2 |  1200/ 2983 batches | lr 20.00 | ms/batch 11.81 | loss  5.34 | ppl   208.17
| epoch   2 |  1400/ 2983 batches | lr 20.00 | ms/batch 11.81 | loss  5.34 | ppl   207.94
| epoch   2 |  1600/ 2983 batches | lr 20.00 | ms/batch 11.84 | loss  5.39 | ppl   220.03
| epoch   2 |  1800/ 2983 batches | lr 20.00 | ms/batch 11.84 | loss  5.26 | ppl   193.22
| epoch   2 |  2000/ 2983 batches | lr 20.00 | ms/batch 11.80 | loss  5.28 | ppl   195.50
| epoch   2 |  2200/ 2983 batches | lr 20.00 | ms/batch 11.84 | loss  5.18 | ppl   177.08
| epoch   2 |  2400/ 2983 batches | lr 20.00 | ms/batch 11.86 | loss  5.21 | ppl   183.81
| epoch   2 |  2600/ 2983 batches | lr 20.00 | ms/batch 11.78 | loss  5.23 | ppl   186.74
| epoch   2 |  2800/ 2983 batches | lr 20.00 | ms/batch 11.82 | loss  5.14 | ppl   170.28
-----------------------------------------------------------------------------------------
| end of epoch   2 | time: 36.56s | valid loss  5.27 | valid ppl   195.11
-----------------------------------------------------------------------------------------
| epoch   3 |   200/ 2983 batches | lr 20.00 | ms/batch 12.32 | loss  5.19 | ppl   179.34
| epoch   3 |   400/ 2983 batches | lr 20.00 | ms/batch 12.26 | loss  5.20 | ppl   181.92
| epoch   3 |   600/ 2983 batches | lr 20.00 | ms/batch 12.12 | loss  5.02 | ppl   152.10
| epoch   3 |   800/ 2983 batches | lr 20.00 | ms/batch 11.87 | loss  5.07 | ppl   159.61
| epoch   3 |  1000/ 2983 batches | lr 20.00 | ms/batch 11.99 | loss  5.06 | ppl   157.82
| epoch   3 |  1200/ 2983 batches | lr 20.00 | ms/batch 11.96 | loss  5.05 | ppl   156.11
| epoch   3 |  1400/ 2983 batches | lr 20.00 | ms/batch 11.95 | loss  5.08 | ppl   161.37
| epoch   3 |  1600/ 2983 batches | lr 20.00 | ms/batch 11.95 | loss  5.15 | ppl   172.53
| epoch   3 |  1800/ 2983 batches | lr 20.00 | ms/batch 11.95 | loss  5.02 | ppl   150.77
| epoch   3 |  2000/ 2983 batches | lr 20.00 | ms/batch 11.82 | loss  5.04 | ppl   154.34
| epoch   3 |  2200/ 2983 batches | lr 20.00 | ms/batch 11.91 | loss  4.94 | ppl   140.38
| epoch   3 |  2400/ 2983 batches | lr 20.00 | ms/batch 11.85 | loss  4.99 | ppl   147.31
| epoch   3 |  2600/ 2983 batches | lr 20.00 | ms/batch 11.88 | loss  5.01 | ppl   149.24
| epoch   3 |  2800/ 2983 batches | lr 20.00 | ms/batch 12.10 | loss  4.93 | ppl   138.46
-----------------------------------------------------------------------------------------
| end of epoch   3 | time: 37.08s | valid loss  5.16 | valid ppl   175.00
-----------------------------------------------------------------------------------------
| epoch   4 |   200/ 2983 batches | lr 20.00 | ms/batch 12.32 | loss  5.00 | ppl   148.84
| epoch   4 |   400/ 2983 batches | lr 20.00 | ms/batch 11.94 | loss  5.02 | ppl   151.21
| epoch   4 |   600/ 2983 batches | lr 20.00 | ms/batch 11.93 | loss  4.84 | ppl   126.34
| epoch   4 |   800/ 2983 batches | lr 20.00 | ms/batch 11.97 | loss  4.89 | ppl   132.94
| epoch   4 |  1000/ 2983 batches | lr 20.00 | ms/batch 11.92 | loss  4.89 | ppl   132.35
| epoch   4 |  1200/ 2983 batches | lr 20.00 | ms/batch 11.89 | loss  4.89 | ppl   132.45
| epoch   4 |  1400/ 2983 batches | lr 20.00 | ms/batch 11.93 | loss  4.93 | ppl   137.78
| epoch   4 |  1600/ 2983 batches | lr 20.00 | ms/batch 11.98 | loss  5.00 | ppl   148.85
| epoch   4 |  1800/ 2983 batches | lr 20.00 | ms/batch 11.94 | loss  4.87 | ppl   130.03
| epoch   4 |  2000/ 2983 batches | lr 20.00 | ms/batch 11.95 | loss  4.90 | ppl   133.86
| epoch   4 |  2200/ 2983 batches | lr 20.00 | ms/batch 12.09 | loss  4.81 | ppl   122.17
| epoch   4 |  2400/ 2983 batches | lr 20.00 | ms/batch 12.17 | loss  4.85 | ppl   127.52
| epoch   4 |  2600/ 2983 batches | lr 20.00 | ms/batch 11.99 | loss  4.87 | ppl   130.16
| epoch   4 |  2800/ 2983 batches | lr 20.00 | ms/batch 12.05 | loss  4.80 | ppl   121.41
-----------------------------------------------------------------------------------------
| end of epoch   4 | time: 37.12s | valid loss  5.07 | valid ppl   159.65
-----------------------------------------------------------------------------------------
| epoch   5 |   200/ 2983 batches | lr 20.00 | ms/batch 12.38 | loss  4.86 | ppl   129.45
| epoch   5 |   400/ 2983 batches | lr 20.00 | ms/batch 12.02 | loss  4.89 | ppl   133.48
| epoch   5 |   600/ 2983 batches | lr 20.00 | ms/batch 12.01 | loss  4.71 | ppl   110.85
| epoch   5 |   800/ 2983 batches | lr 20.00 | ms/batch 12.07 | loss  4.78 | ppl   118.55
| epoch   5 |  1000/ 2983 batches | lr 20.00 | ms/batch 12.10 | loss  4.77 | ppl   117.38
| epoch   5 |  1200/ 2983 batches | lr 20.00 | ms/batch 12.01 | loss  4.77 | ppl   117.71
| epoch   5 |  1400/ 2983 batches | lr 20.00 | ms/batch 12.14 | loss  4.82 | ppl   123.91
| epoch   5 |  1600/ 2983 batches | lr 20.00 | ms/batch 12.09 | loss  4.89 | ppl   132.68
| epoch   5 |  1800/ 2983 batches | lr 20.00 | ms/batch 12.01 | loss  4.76 | ppl   117.30
| epoch   5 |  2000/ 2983 batches | lr 20.00 | ms/batch 12.15 | loss  4.80 | ppl   120.99
| epoch   5 |  2200/ 2983 batches | lr 20.00 | ms/batch 12.12 | loss  4.69 | ppl   109.35
| epoch   5 |  2400/ 2983 batches | lr 20.00 | ms/batch 12.16 | loss  4.74 | ppl   114.76
| epoch   5 |  2600/ 2983 batches | lr 20.00 | ms/batch 12.16 | loss  4.77 | ppl   117.37
| epoch   5 |  2800/ 2983 batches | lr 20.00 | ms/batch 12.03 | loss  4.70 | ppl   109.65
-----------------------------------------------------------------------------------------
| end of epoch   5 | time: 37.41s | valid loss  5.04 | valid ppl   153.98
-----------------------------------------------------------------------------------------
| epoch   6 |   200/ 2983 batches | lr 20.00 | ms/batch 12.41 | loss  4.76 | ppl   117.17
| epoch   6 |   400/ 2983 batches | lr 20.00 | ms/batch 12.16 | loss  4.80 | ppl   121.60
| epoch   6 |   600/ 2983 batches | lr 20.00 | ms/batch 12.22 | loss  4.61 | ppl   100.98
| epoch   6 |   800/ 2983 batches | lr 20.00 | ms/batch 12.22 | loss  4.68 | ppl   107.31
| epoch   6 |  1000/ 2983 batches | lr 20.00 | ms/batch 12.19 | loss  4.68 | ppl   107.50
| epoch   6 |  1200/ 2983 batches | lr 20.00 | ms/batch 12.13 | loss  4.68 | ppl   107.84
| epoch   6 |  1400/ 2983 batches | lr 20.00 | ms/batch 12.09 | loss  4.73 | ppl   113.68
| epoch   6 |  1600/ 2983 batches | lr 20.00 | ms/batch 12.14 | loss  4.80 | ppl   121.91
| epoch   6 |  1800/ 2983 batches | lr 20.00 | ms/batch 12.15 | loss  4.68 | ppl   108.10
| epoch   6 |  2000/ 2983 batches | lr 20.00 | ms/batch 12.15 | loss  4.71 | ppl   111.39
| epoch   6 |  2200/ 2983 batches | lr 20.00 | ms/batch 12.11 | loss  4.62 | ppl   101.04
| epoch   6 |  2400/ 2983 batches | lr 20.00 | ms/batch 12.18 | loss  4.67 | ppl   106.34
| epoch   6 |  2600/ 2983 batches | lr 20.00 | ms/batch 12.19 | loss  4.69 | ppl   108.52
| epoch   6 |  2800/ 2983 batches | lr 20.00 | ms/batch 12.15 | loss  4.62 | ppl   101.45
-----------------------------------------------------------------------------------------
| end of epoch   6 | time: 37.66s | valid loss  5.00 | valid ppl   148.12
-----------------------------------------------------------------------------------------
=========================================================================================
| End of training | test loss  4.93 | test ppl   138.48
=========================================================================================
---------------------------------------
Begin PBS Epilogue Fri Oct 18 17:11:15 EDT 2019
Job ID:     12419.testflight-sched.pace.gatech.edu
User ID:    svemuri8
Job name:   pytorchTest
Resources:  nodes=1:ppn=2:gpus=1,pmem=2gb,walltime=00:10:00,neednodes=1:ppn=2:gpus=1
Rsrc Used:  cput=00:03:43,vmem=17700628kb,walltime=00:04:15,mem=1776028kb,energy_used=0
Queue:      testflight-gpu
Nodes:     
rich133-k33-14.pace.gatech.edu
End PBS Epilogue Fri Oct 18 17:11:15 EDT 2019
---------------------------------------
  • After the result files are produced, you can move the files off the cluster, refer to the file transfer guide for help.
  • Congratulations! You successfully ran PyTorch on the cluster.