Updated 2019-10-22

Use TensorFlow on Cluster

Overview: Tensorflow on the cluster

  • GPU's can greatly speed up tensorflow and training of neural networks in general. In addition, parallelism with multiple gpus can be achieved using two main techniques:
    • data paralellism
    • model paralellism
  • However, this guide will focus on using 1 gpu. Model/ data parallelism is coded by the user into their tensorflow script. On the cluster side, you can set the number of gpus to 2 in the PBS script to achieve parallelism.
  • User Side: The user programs whatever parallelism they want into their tensorflow models.
  • Cluster Side: We provide multiple versions of tensorflow along with gpu nodes. This guide will show you how to write a PBS script to submit your tensorflow job on the cluster.
  • Make sure you submit to a node with gpus like force-gpu (check your available queues with pace-whoami)
  • If you need the name of the gpus, system info will be printed to the .out file when a tf script is run. Submit this small tensorflow script (python) with a PBS script similar to the one in this guide to see device info, or just look at the top of an .out file whenever you run a tf script.
import tensorflow as tf
hello = tf.constant('hello tensorflow')
sess = tf.Session()
  • It will print to the out file the name of the devices available, observe the gpu is shown as gpu:0:
TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:81:00.0, compute capability: 6.0)
  • PACE provides powerful Nvidia P100 gpus avail though the force-gpu queue. Each node on this cluster contains 2 Nvidia gpus.

Walkthrough: Running an Example TensorFlow Script

  • After logging in to your account on the cluster, you can follow along with this guide. The tensorflow script is a slightly modified version of Google's text classification with TensorFlow and Keras Guide. The neural net is trained on imdb movie reviews and is designed to predict if a given movie review is positive or negative. Here are the files required to run it on the cluster:
  • The imdb_tf.py script above should only take about a minute to train the model. The output file should contain the training information as well as print out the loss and accuracy of the model against the testing set. In addition, it should save a graph of the accuracy of the model on the training set vs the vaildation set, as shown below. This chart illustrates the concept of overfitting, or when the model starts to recognize patterns apparent in the training set but not genreralizable to data it hasn't seen before.

Screenshot

  • Transfer the files to the cluster using any of the file transfer techniques to follow along.

Part 1: Breaking down the PBS Script

#PBS -N tensorflow_test
#PBS -l nodes=1:ppn=8:teslap100:gpus=1:exclusive_process
#PBS -l walltime=5:00
#PBS -q force-gpu
#PBS -j oe
#PBS -o tf_imdb_results.out

cd ~/data/tensorflow
module purge
module load tensorflow/1.5.0-cuda8.0.44
python imdb_tf.py
  • After you have logged on to pace, examine the PBS script from above that you have transferred to the cluster, using cat tf_imdb.pbs. It should look the same
  • More info on the #PBS lines (PBS directives) can be found on the PBS Scripting Guide
  • Second Line: request gpu: It is important to note this directive looks different from most, since you are requesting a gpu. In this case, 1 node is requested, with 8 cores. A single Nvidia Tesla p100 gpu is requested. exclusive_process is added to make sure that the gpu is used for your job only, and is not shared with other jobs.
  • The first step is to tell the cluster to enter the directory where the tensorflow script is located (in this case ~/data/tensorflow, a dir I have made in my data directory). Should be whatever directory you put the tensorflow script to be run in (and the imdb_dataset.pickle data). Any files besides the .out file will show up in this dir. This means the matplotlib graph will show up here.
  • Make sure any dependencies (like data) are in that same directory you told the cluster to enter
  • Then, the preferred version of the tensorflow is loaded. Many versions of tensorflow available on the cluster. To find available versions of a software, run
module avail
module avail <software name> #ex: module avail tensorflow would show available verisons of tensorflow
  • to load a software, include the module load <module name> line in the pbs script. You must load the software's module before you run the software in the pbs script.
  • The final line of the computation block runs the python script. The files generated (chart) will be generated in the dir specified (~/data/tensorflow), and the output is automatically recorded in the out file (which will show up in the directory you submitted the script from)

Part 2: Submit the PBS Script and Collect Results

  • make sure your're in the folder where the tensorflow PBS Script is located, and run
qsub <scriptName.pbs>  #ex: qsub tf_imdb.pbs for this guide
  • if successful, this will print something like 2180446.shared-sched-pace.gatech.edu
  • the number in the beginning is the job id, useful for checking predicted wait time in queue or job status
  • After a couple seconds, find estimated wait time in queue with
showstart <jobID>
  • check job status with
qstat <jobID> or qstat -u someuser3 -n
  • for more ways to check status, how to cancel job, and more useful commands, checkout the command cheatsheet
  • the output file will be found by typing ls and looking for the output file you named in the PBS script, in this case tf_imdb_results.out
  • To see the contents of the out file, you can open it up in an editor, or run
cat <output file name>  #ex: cat tf_imdb_results.out
  • output for example should print the training info, along with overall loss and accuracy at the end.
  • The end of tf_imdb_results.out will look like this:
25000/25000 [==============================]25000/25000 [==============================] - 1s 29us/step

Loss, accuracy: [0.31646855477809904, 0.8754]
---------------------------------------
Begin PBS Epilogue Thu Aug  2 14:45:27 EDT 2018
Job ID:     21872475.shared-sched.pace.gatech.edu
  • The tensorflow script should also create this training vs accuracy chart, which will appear as training_accuracy.png in the directory the tensorflow script and the data are stored (in my case, ~/data/tensorflow):

Screenshot

  • To move output files off the cluster, see storage and moving files guide
  • Congratulations! you have successfully run a tensorflow script on the cluster