Updated 2019-10-28

TensorFlow on RHe7

Overview

  • TensorFlow is a popular open source library for machine learning
  • The process is still the same to run it as on RHe6, but you now have to load a different module

Things to Note

  • To utilize tensortflow-gpu/2.0.0, as this guide does, you must submit the job to a gpu enabled queue like testflight-gpu, ece-gpu etc. As of recently, force-gpu has been updated to RHe7 so it should work as well.
  • Run pace-whoami to see what queues you have access to

Walkthrough

Part 1: PBS Script

Important

  • Make sure you load tensorflow-gpu/2.0.0. This is the difference between running tensorflow on RHe6 and RHe7. tensorflow-gpu/2.0.0 isn't available on RHe6.
#PBS -N tensorflow_test
#PBS -l nodes=1:ppn=4:gpus=1:exclusive_process
#PBS -l walltime=5:00
#PBS -q force-gpu
#PBS -j oe
#PBS -o tf_imdb_results.out

cd $PBS_O_WORKDIR
module purge
module load tensorflow-gpu/2.0.0
python imdb_tf.py
  • The above script can be found here: tensorflow_rhe7.pbs)
  • The #PBS directives request 5 minutes of walltime, 1 node with 4 cores, and 1 gpu. More on #PBS directives can be found in the PBS guide
  • $PBS_O_WORKDIR is a variable that represents the directory you submit the PBS script from. Input and output files for the script should be found in the same directory you put the PBS script. Make sure the data file and the python script are in this same folder where you submit the PBS script.
  • module load tensorflow-gpu/2.0.0 loads the version 1.2 of Tensorflow. To see what Tensorflow versions are available, run module avail tensorflow, and load the one you want.
  • python imdb_tf.py runs tensorflow

Part 2: Submit Job and Check Status

  • Make sure you're in the dir that contains the PBS Script as well as the tensorflow script
  • Submit as normal, with qsub <pbs script name>. In this case qsub tensorflow_rhe7.pbs
  • Check job status with qstat -t 22182721, replacing the number with the job id returned after running qsub
  • You can delete the job with qdel 22182721 , again replacing the number with the jobid returned after running qsub

Part 3: Collecting Results

  • In the directory where you submitted the PBS script, you should see a tf_imdb_results.out file which contains the results of the job. Use cat or open the file in a text editor to take a look.
  • tf_imdb_results.out should look like this:
25000/25000 [==============================]25000/25000 [==============================] - 1s 29us/step

Loss, accuracy: [0.31646855477809904, 0.8754]
---------------------------------------
Begin PBS Epilogue Thu Aug  2 14:45:27 EDT 2018
Job ID:     21872475.shared-sched.pace.gatech.edu
  • The tensorflow script should also create a training vs accuracy chart, which will appear as training_accuracy.png in the directory where you submitted the PBS script from
  • After the result files are produced, you can move the files off the cluster, refer to the file transfer guide for help.
  • Congratulations! You successfully ran Tensorflow on RHe7.