Updated 2021-12-06

Run Horovod on the Cluster

Overview

  • Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.
  • Horovod can run on multiple nodes with multiple GPUs.
  • You can find more information about Horovod on their overview page.

Walkthrough: Run Horovod on the Cluster

  • This walkthrough will show how to implement Horovod using TensorFlow Convolutional Neural Network (TF CNN).
  • The python scripts can be found here. Make sure all the scripts and directories from the link are downloaded over to the same directory as the PBS Script.
  • Follow these steps to get started setting up the python and PBS scripts:
mkdir test-Horovod
cd test-Horovod
git clone https://github.com/tensorflow/benchmarks.git
cd benchmarks/scripts/tf_cnn_benchmarks
  • Transfer the PBS script (horovod.pbs) to the this directory (.../test-Horovod/benchmarks/scripts/tf_cnn_benchmarks). This is to make sure all the scripts and directories from the link are downloaded over to the same directory as the PBS script.
  • You can transfer the files to your account on the cluster to follow along. The file transfer guide may be helpful.

Part 1: The PBS Script

#PBS -N HorovodTest
#PBS -A [Account]
#PBS -l nodes=1:ppn=4:gpus=1:RTX6000
#PBS -l walltime=30:00
#PBS -q inferno
#PBS -j oe
#PBS -o test-cnn.out

cd $PBS_O_WORKDIR
module load pace-community
module load horovod-gpu

NPROCS=`wc -l < ${PBS_NODEFILE}`
NGPUS=`wc -l < ${PBS_GPUFILE}`

mpirun -N $NGPUS singularity exec --nv $HOROVOD_SIF python tf_cnn_benchmarks.py --variable_update=horovod --num_gpus=1 --model resnet50 --batch_size 64 --num_batches 100 --allow_growth=True

  • The #PBS directives request 30 minutes of walltime, 1 node with 4 cores, and 1 gpu. More on #PBS directives can be found in the PBS guide. Make sure #PBS -A [Account] is filled with account of user.
  • $PBS_O_WORKDIR is a variable that represents the directory you submit the PBS script from. Input and output files for the script should be found in the same directory you put the PBS script.
  • The pace-community module needs to be loaded before horovod-gpu can be loaded.
  • Be sure #PBS -A [Account] is filled with user's account before proceeding to the next step.
  • For more information on running Horovod, try module help horovod.

Part 2: Submit Job and Check Status

  • Be sure to change to the directory that contains the PBS Script.
  • qsub horovod.pbs
  • Check job status with qstat -u <userid>, replacing the number with the job id returned after running qsub.
  • You can delete the job with qdel <jobid> , again replacing the number with the jobid returned after running qsub.

Part 3: Collecting Results

  • In the directory where you submitted the PBS script, you should see a test-cnn.out file, which contains the results of the job. Use cat test-cnn.out or open the file in a text editor to take a look.
  • test-cnn.out should look like this:
---------------------------------------
Begin PBS Prologue Mon Sep 13 18:10:11 EDT 2021
Job ID:     2831141.sched-torque.pace.gatech.edu
User ID:    emin7
Job name:   HorovodTest
Queue:      inferno
End PBS Prologue Mon Sep 13 18:10:11 EDT 2021
---------------------------------------

Lmod is automatically replacing "intel/19.0.5" with "gcc/7.4.0".

2021-09-13 18:10:15.871524: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2021-09-13 18:10:22.577608: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-13 18:10:22.590079: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2700000000 Hz
2021-09-13 18:10:22.590409: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3d46b30 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-09-13 18:10:22.590419: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-09-13 18:10:22.593045: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-09-13 18:10:22.839035: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3d3f660 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-09-13 18:10:22.839059: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA Quadro RTX 6000, Compute Capability 7.5
2021-09-13 18:10:22.853971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:d8:00.0 name: NVIDIA Quadro RTX 6000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2021-09-13 18:10:22.854011: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-09-13 18:10:23.121267: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-09-13 18:10:23.255971: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-09-13 18:10:23.379506: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-09-13 18:10:23.592344: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-09-13 18:10:23.696039: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-09-13 18:10:24.051124: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-09-13 18:10:24.056412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-09-13 18:10:24.056454: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-09-13 18:10:29.357479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-13 18:10:29.357513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
2021-09-13 18:10:29.357519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
2021-09-13 18:10:29.364017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22477 MB memory) -> physical GPU (device: 0, name: NVIDIA Quadro RTX 6000, pci bus id: 0000:d8:00.0, compute capability: 7.5)
TensorFlow:  2.3
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  64 global
             64 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['horovod/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   horovod
==========
Generating training model
WARNING:tensorflow:From /storage/scratch1/5/emin7/horovod-test/anotherOne/tf_cnn_benchmarks/convnet_builder.py:134: conv2d (from tensorflow.python.keras.legacy_tf_layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.keras.layers.Conv2D` instead.
W0913 18:10:29.430423 46912496342336 deprecation.py:323] From /storage/scratch1/5/emin7/horovod-test/anotherOne/tf_cnn_benchmarks/convnet_builder.py:134: conv2d (from tensorflow.python.keras.legacy_tf_layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.keras.layers.Conv2D` instead.
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/legacy_tf_layers/convolutional.py:424: Layer.apply (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
W0913 18:10:29.441053 46912496342336 deprecation.py:323] From /usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/legacy_tf_layers/convolutional.py:424: Layer.apply (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /storage/scratch1/5/emin7/horovod-test/anotherOne/tf_cnn_benchmarks/convnet_builder.py:266: max_pooling2d (from tensorflow.python.keras.legacy_tf_layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.MaxPooling2D instead.
W0913 18:10:29.496724 46912496342336 deprecation.py:323] From /storage/scratch1/5/emin7/horovod-test/anotherOne/tf_cnn_benchmarks/convnet_builder.py:266: max_pooling2d (from tensorflow.python.keras.legacy_tf_layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.MaxPooling2D instead.
Initializing graph
WARNING:tensorflow:From /storage/scratch1/5/emin7/horovod-test/anotherOne/tf_cnn_benchmarks/benchmark_cnn.py:2267: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W0913 18:10:35.032272 46912496342336 deprecation.py:323] From /storage/scratch1/5/emin7/horovod-test/anotherOne/tf_cnn_benchmarks/benchmark_cnn.py:2267: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2021-09-13 18:10:35.805860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:d8:00.0 name: NVIDIA Quadro RTX 6000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2021-09-13 18:10:35.805907: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-09-13 18:10:35.805947: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-09-13 18:10:35.805959: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-09-13 18:10:35.805969: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-09-13 18:10:35.805978: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-09-13 18:10:35.805988: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-09-13 18:10:35.805998: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-09-13 18:10:35.811220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-09-13 18:10:35.811262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-13 18:10:35.811269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
2021-09-13 18:10:35.811274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
2021-09-13 18:10:35.816169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22477 MB memory) -> physical GPU (device: 0, name: NVIDIA Quadro RTX 6000, pci bus id: 0000:d8:00.0, compute capability: 7.5)
INFO:tensorflow:Running local_init_op.
I0913 18:10:36.984717 46912496342336 session_manager.py:505] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0913 18:10:37.048840 46912496342336 session_manager.py:508] Done running local_init_op.
Running warm up
2021-09-13 18:10:40.291528: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-09-13 18:10:42.124966: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
Done warm up
Step    Img/sec total_loss
1       images/sec: 301.9 +/- 0.0 (jitter = 0.0)        7.608
10      images/sec: 307.8 +/- 0.7 (jitter = 0.9)        7.849
20      images/sec: 305.5 +/- 0.8 (jitter = 4.7)        8.013
30      images/sec: 304.9 +/- 0.6 (jitter = 4.6)        7.940
40      images/sec: 305.3 +/- 0.6 (jitter = 4.8)        8.136
50      images/sec: 304.4 +/- 0.5 (jitter = 2.3)        8.052
60      images/sec: 303.8 +/- 0.5 (jitter = 1.6)        7.783
70      images/sec: 303.3 +/- 0.4 (jitter = 1.3)        7.854
80      images/sec: 302.9 +/- 0.4 (jitter = 1.3)        8.011
90      images/sec: 302.6 +/- 0.4 (jitter = 1.5)        7.843
100     images/sec: 302.4 +/- 0.3 (jitter = 1.6)        8.095
----------------------------------------------------------------
total images/sec: 302.35
----------------------------------------------------------------
---------------------------------------
Begin PBS Epilogue Mon Sep 13 18:11:17 EDT 2021
Job ID:     2831141.sched-torque.pace.gatech.edu
User ID:    emin7
Job name:   HorovodTest
Resources:  nodes=1:ppn=4:gpus=1:RTX6000,walltime=00:20:00,neednodes=1:ppn=4:gpus=1:RTX6000
Rsrc Used:  cput=00:01:00,vmem=22614868kb,walltime=00:01:06,mem=1847168kb,energy_used=0
Queue:      inferno
Nodes:
atl1-1-03-006-11.pace.gatech.edu
End PBS Epilogue Mon Sep 13 18:11:17 EDT 2021
---------------------------------------
  • After the result files are produced, you can move the files off the cluster, refer to the file transfer guide for help.
  • Congratulations! You successfully ran a Python script using Horovod on the cluster.