Updated 2022-11-23

Use TensorFlow on Cluster

Overview: TensorFlow on the cluster

  • GPU's can greatly speed up tensorflow and training of neural networks in general. In addition, parallelism with multiple gpus can be achieved using two main techniques:
    • data paralellism
    • model paralellism
  • However, this guide will focus on using 1 gpu. Model/ data parallelism is coded by the user into their tensorflow script. On the cluster side, you can set the number of gpus to 2 in the SBATCH script to achieve parallelism.
  • User Side: The user programs whatever parallelism they want into their tensorflow models.
  • Cluster Side: We provide multiple versions of tensorflow along with gpu nodes. This guide will show you how to write a SBATCH script to submit your tensorflow job on the cluster.

Walkthrough: Running an Example TensorFlow Script

  • After logging in to your account on the cluster, you can follow along with this guide. The tensorflow script is a slightly modified version of Google's text classification with TensorFlow and Keras Guide. The neural net is trained on imdb movie reviews and is designed to predict if a given movie review is positive or negative.
  • The imdb_tf.py script above should only take about a minute to train the model. The output file should contain the training information as well as print out the loss and accuracy of the model against the testing set. In addition, it should save a graph of the accuracy of the model on the training set vs the vaildation set, as shown below. This chart illustrates the concept of overfitting, or when the model starts to recognize patterns apparent in the training set but not genreralizable to data it hasn't seen before. Screenshot
  • Transfer the files to the cluster using any of the file transfer techniques to follow along.

Optional Part: Creating User Virtual Environment

  • This step is not necessary to use TensorFlow, as PACE provides an environment with it already installed via the tensorflow-gpu module. The user has the option to create their own virtual environments using Conda, if you wish to install additional packages alongside the ones we include.
  • For more information, refer to our [Anaconda guide]. Here are the general steps to create, activate, and install your own packages/library:
module load tensorflow-gpu/2.9.0
conda create --name <your-env-name>
conda activate <your-env-name>
conda install <your-library>
  • Once the library you want is installed in your environment, it's time to check it. The following steps should print out the version of your library:
python
>>> import tensorflow as tf
>>> tf.__version__
>>> '2.9.0'
>>> import <your-library name>
>>> <your-library name>.__version__
>>> '<your-library version>'
  • Further, in the job submission script, the user need to include the following lines:
module load tensorflow-gpu/2.9.0
conda activate <your-env-name>
  • For working with packages in addition to tensorflow in Jupyter notebook, the user needs to create their own virtual environment and install tensorflow along with other libraries they would need.

Part 1: Breaking down the SBATCH Script

#!/bin/bash
#SBATCH -Jtensorflow_test
#SBATCH -A [Account]
#SBATCH -N1 --gres=gpu:RTX6000:1
#SBATCH -t5
#SBATCH -qinferno
#SBATCH -oReport-%j.out

cd $SLURM_SUBMIT_DIR
module load tensorflow-gpu/2.9.0
python imdb_tf.py
  • After you have logged on to Phoenix-Slurm, examine the SBATCH script from above that you have transferred to the cluster, using cat tf_imdb.sbatch. It should look the same
  • More on #SBATCH lines (SBATCH directives) can be found on the Using Slurm on Phoenix Guide
  • Third Line: request gpu: It is important to note this directive looks different from most, since you are requesting a gpu. In this case, 1 node is requested, with 6 cores. A single Nvidia RTX 6000 gpu is requested.
  • $SLURM_SUBMIT_DIR is a variable that represents the directory you submit the SBATCH script from. Input and output files for the script should be found in the same directory you put the PBS script. Make sure the data file and the python script are in this same folder where you submit the SBATCH script.
  • Make sure any dependencies (like data) are in that same directory you told the cluster to enter
  • Then, tensorflow is loaded. To find available versions of tensorflow, run
module spider tensorflow
  • to load a software, include the module load <module name> line in the pbs script. You must load the software's module before you run the software in the pbs script.
  • The final line of the computation block runs the python script. The files generated (chart) will be generated in the dir specified, and the output is automatically recorded in the out file (which will show up in the directory you submitted the script from)

Part 2: Prepare Input Files

In the same directory as your PBS script, include the Python script it calls as well as your input data. Download the data used to train model (in the form of a .pickle file, make sure you put it in the same directory you put the script above in). For the Python script, save the script below as imdb_tf.py.

# Slightly modified version of Google's Text Classification with Movie Reviews Guide
# Original Guide can be found here: https://www.tensorflow.org/tutorials/keras/basic_text_classification

import matplotlib
matplotlib.use("Agg")
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
import numpy as np
import pickle

# Imports the data set, which has been limited to only use the top 10,000 most frequent words in order to save space
# each review is in the form of a list of numbers, each number represents a word in a dictionary
with open("imdb_data.pickle", "rb") as f:
    dataset = pickle.load(f)

train_data = dataset[0]
train_labels = dataset[1]
test_data = dataset[2]
test_labels = dataset[3]
word_index = dataset[4]

#each example is an array of integers, each integer represents a word in a dictionary
#Label is an integer of 0 or 1, where 1 is positive and 0 is a negative review
print("Training entries: {}, labels {}".format(len(train_data), len(train_labels)))

word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2 #unkown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key,value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i,'?') for i in text])

#convert the reviews (arrays of integers) into tensors by padding the arrays so they all have the same length
#then create an integer tensor of shape num_example * max_length

train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=256)
test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=256)

#build model
vocab_size = 10000

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16)) #embedding layer, dimensions are (batch, sequence, embedding)
model.add(keras.layers.GlobalAveragePooling1D()) #returns a fixed length output vector so model can handle input of variable length
model.add(keras.layers.Dense(16, activation=tf.nn.relu)) #piped through a fully-connected (dense) layer with 16 hidden units
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid)) #single output node, using sigmoid activation funciton, 0 or 1

#model.compile(optimizer=tf.train.AdamOptimizer(),
#                loss='binary_crossentropy',
#                metrics=['accuracy'])
model.compile(optimizer=tf.optimizers.Adam(),
                loss='binary_crossentropy',
                #metrics=['accuracy'])
                metrics=['acc'])

#create a validation set
x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=40,
                    batch_size=256,
                    validation_data=(x_val, y_val),
                    verbose=1)

results = model.evaluate(test_data, test_labels)
print("Loss, accuracy: {} ".format(results))

# Create matplotlib chart that illustrates the problem of overfitting by comparing training and validaiton accuracy
history_dict = history.history
history_dict.keys()

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.clf()
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

plt.savefig("training_accuracy.png")

Part 3: Submit the PBS Script and Collect Results

  • make sure you're in the folder where the tensorflow SBATCH Script is located, and run
sbatch <scriptName.sbatch>  #ex: sbatch tf_imdb.sbatch for this guide
  • if successful, this will print something like `Submitted batch job <jobID>
  • the number in the beginning is the job id, useful for checking predicted wait time in queue or job status
  • check job status with
squeue  --job <jobID>

for more ways to check status, how to cancel job, and more useful commands, checkout the command cheatsheet * the output file will be found by typing ls and looking for the output file you named in the sbatch script, in this case Report-%.out * To see the contents of the out file, you can open it up in an editor, or run

cat <output file name>  #ex: cat Report-%<jobID>.out
  • output for example should print the training info, along with *overall loss and accuracy at the end. The end of tf_imdb_results.out will look like this
782/782 [==============================] - 1s 1ms/step - loss: 0.4633 - acc: 0.8586
Loss, accuracy: [0.46331024169921875, 0.8586400151252747]
---------------------------------------
Begin Slurm Prolog: Fri Nov-11-2022 11:15:15
Job ID:     68541

Screenshot * To move output files off the cluster, see storage and moving files guide * Congratulations! you have successfully run a tensorflow script on the cluster