Updated 2021-10-20

Use TensorFlow on Cluster

Overview: TensorFlow on the cluster

  • GPU's can greatly speed up tensorflow and training of neural networks in general. In addition, parallelism with multiple gpus can be achieved using two main techniques:
    • data paralellism
    • model paralellism
  • However, this guide will focus on using 1 gpu. Model/ data parallelism is coded by the user into their tensorflow script. On the cluster side, you can set the number of gpus to 2 in the PBS script to achieve parallelism.
  • User Side: The user programs whatever parallelism they want into their tensorflow models.
  • Cluster Side: We provide multiple versions of tensorflow along with gpu nodes. This guide will show you how to write a PBS script to submit your tensorflow job on the cluster.

Walkthrough: Running an Example TensorFlow Script

  • After logging in to your account on the cluster, you can follow along with this guide. The tensorflow script is a slightly modified version of Google's text classification with TensorFlow and Keras Guide. The neural net is trained on imdb movie reviews and is designed to predict if a given movie review is positive or negative.
  • The imdb_tf.py script above should only take about a minute to train the model. The output file should contain the training information as well as print out the loss and accuracy of the model against the testing set. In addition, it should save a graph of the accuracy of the model on the training set vs the vaildation set, as shown below. This chart illustrates the concept of overfitting, or when the model starts to recognize patterns apparent in the training set but not genreralizable to data it hasn't seen before.

Screenshot

  • Transfer the files to the cluster using any of the file transfer techniques to follow along.

Optional Part: Creating User Virtual Environment

  • This step is not necessary to use TensorFlow, as PACE provides an environment with it alredy installed via the tensorflow-gpu module. The user has the option to create their own virtual environments using Conda, if you wish to install additional packages alongside the ones we include.
  • For more information, refer to our Anaconda guide. Here are the general steps to create, activate, and install your own packages/library:
module load tensorflow-gpu/2.6.0
conda create --name <your-name>
conda activate <your-name>
conda install <your-library>
  • Once the library you want is installed in your environment, it's time to check it. The following steps should print out the version of your library:
python
>>> import tensorflow as tf
>>> tf.__version__
>>> '2.6.0'
>>> import <your-library name>
>>> <your-library name>.__version__
>>> '<your-library version>'

Part 1: Breaking down the PBS Script

#PBS -N tensorflow_test
#PBS -A [Account]
#PBS -l nodes=1:ppn=6:gpus=1:RTX6000
#PBS -l walltime=5:00
#PBS -q inferno
#PBS -j oe
#PBS -o tf_imdb_results.out

cd $PBS_O_WORKDIR
module load tensorflow-gpu/2.6.0
python imdb_tf.py
  • After you have logged on to PACE, examine the PBS script from above that you have transferred to the cluster, using cat tf_imdb.pbs. It should look the same
  • More info on the #PBS lines (PBS directives) can be found on the PBS Scripting Guide
  • Second Line: request gpu: It is important to note this directive looks different from most, since you are requesting a gpu. In this case, 1 node is requested, with 6 cores. A single Nvidia RTX 6000 gpu is requested.
  • $PBS_O_WORKDIR is a variable that represents the directory you submit the PBS script from. Input and output files for the script should be found in the same directory you put the PBS script. Make sure the data file and the python script are in this same folder where you submit the PBS script.
  • Make sure any dependencies (like data) are in that same directory you told the cluster to enter
  • Then, tensorflow is loaded. To find available versions of tensorflow, run
module spider tensorflow
  • to load a software, include the module load <module name> line in the pbs script. You must load the software's module before you run the software in the pbs script.
  • The final line of the computation block runs the python script. The files generated (chart) will be generated in the dir specified, and the output is automatically recorded in the out file (which will show up in the directory you submitted the script from)

Part 2: Prepare Input Files

In the same directory as your PBS script, include the Python script it calls as well as your input data. Download the data used to train model (in the form of a .pickle file, make sure you put it in the same directory you put the script above in). For the Python script, save the script below as imdb_tf.py.

# Slightly modified version of Google's Text Classification with Movie Reviews Guide
# Original Guide can be found here: https://www.tensorflow.org/tutorials/keras/basic_text_classification

import matplotlib
matplotlib.use("Agg")
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
import numpy as np
import pickle

# Imports the data set, which has been limited to only use the top 10,000 most frequent words in order to save space
# each review is in the form of a list of numbers, each number represents a word in a dictionary
with open("imdb_data.pickle", "rb") as f:
    dataset = pickle.load(f)

train_data = dataset[0]
train_labels = dataset[1]
test_data = dataset[2]
test_labels = dataset[3]
word_index = dataset[4]

#each example is an array of integers, each integer represents a word in a dictionary
#Label is an integer of 0 or 1, where 1 is positive and 0 is a negative review
print("Training entries: {}, labels {}".format(len(train_data), len(train_labels)))

word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2 #unkown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key,value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i,'?') for i in text])

#convert the reviews (arrays of integers) into tensors by padding the arrays so they all have the same length
#then create an integer tensor of shape num_example * max_length

train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=256)
test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=256)

#build model
vocab_size = 10000

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16)) #embedding layer, dimensions are (batch, sequence, embedding)
model.add(keras.layers.GlobalAveragePooling1D()) #returns a fixed length output vector so model can handle input of variable length
model.add(keras.layers.Dense(16, activation=tf.nn.relu)) #piped through a fully-connected (dense) layer with 16 hidden units
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid)) #single output node, using sigmoid activation funciton, 0 or 1

#model.compile(optimizer=tf.train.AdamOptimizer(),
#                loss='binary_crossentropy',
#                metrics=['accuracy'])
model.compile(optimizer=tf.optimizers.Adam(),
                loss='binary_crossentropy',
                #metrics=['accuracy'])
                metrics=['acc'])

#create a validation set
x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=40,
                    batch_size=256,
                    validation_data=(x_val, y_val),
                    verbose=1)

results = model.evaluate(test_data, test_labels)
print("Loss, accuracy: {} ".format(results))

# Create matplotlib chart that illustrates the problem of overfitting by comparing training and validaiton accuracy
history_dict = history.history
history_dict.keys()

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.clf()
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

plt.savefig("training_accuracy.png")

Part 3: Submit the PBS Script and Collect Results

  • make sure you're in the folder where the tensorflow PBS Script is located, and run
qsub <scriptName.pbs>  #ex: qsub tf_imdb.pbs for this guide
  • if successful, this will print something like 2180446.sched-torque.gatech.edu
  • the number in the beginning is the job id, useful for checking predicted wait time in queue or job status

  • check job status with

qstat <jobID> or qstat -u someuser3 -n
  • for more ways to check status, how to cancel job, and more useful commands, checkout the command cheatsheet
  • the output file will be found by typing ls and looking for the output file you named in the PBS script, in this case tf_imdb_results.out
  • To see the contents of the out file, you can open it up in an editor, or run
cat <output file name>  #ex: cat tf_imdb_results.out
  • output for example should print the training info, along with overall loss and accuracy at the end.
  • The end of tf_imdb_results.out will look like this:
782/782 [==============================] - 1s 1ms/step - loss: 0.4633 - acc: 0.8586
Loss, accuracy: [0.46331024169921875, 0.8586400151252747]
---------------------------------------
Begin PBS Epilogue Wed Oct 20 17:31:33 EDT 2021
Job ID:     3284390.sched-torque.pace.gatech.edu
  • The tensorflow script should also create this training vs accuracy chart, which will appear as training_accuracy.png in the directory the tensorflow script and the data are stored. The accuracy chart results may vary slightly for each run but the general curve should be the same:

Screenshot

  • To move output files off the cluster, see storage and moving files guide
  • Congratulations! you have successfully run a tensorflow script on the cluster