Updated 2021-12-06
Run Horovod on the Cluster¶
Overview¶
- Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.
- Horovod can run on multiple nodes with multiple GPUs.
- You can find more information about Horovod on their overview page.
Walkthrough: Run Horovod on the Cluster¶
- This walkthrough will show how to implement Horovod using TensorFlow Convolutional Neural Network (TF CNN).
- The python scripts can be found here. Make sure all the scripts and directories from the link are downloaded over to the same directory as the
PBS
Script. - Follow these steps to get started setting up the python and PBS scripts:
mkdir test-Horovod
cd test-Horovod
git clone https://github.com/tensorflow/benchmarks.git
cd benchmarks/scripts/tf_cnn_benchmarks
- Transfer the PBS script (horovod.pbs) to the this directory (
.../test-Horovod/benchmarks/scripts/tf_cnn_benchmarks
). This is to make sure all the scripts and directories from the link are downloaded over to the same directory as the PBS script. - You can transfer the files to your account on the cluster to follow along. The file transfer guide may be helpful.
Part 1: The PBS Script¶
#PBS -N HorovodTest
#PBS -A [Account]
#PBS -l nodes=1:ppn=4:gpus=1:RTX6000
#PBS -l walltime=30:00
#PBS -q inferno
#PBS -j oe
#PBS -o test-cnn.out
cd $PBS_O_WORKDIR
module load pace-community
module load horovod-gpu
NPROCS=`wc -l < ${PBS_NODEFILE}`
NGPUS=`wc -l < ${PBS_GPUFILE}`
mpirun -N $NGPUS singularity exec --nv $HOROVOD_SIF python tf_cnn_benchmarks.py --variable_update=horovod --num_gpus=1 --model resnet50 --batch_size 64 --num_batches 100 --allow_growth=True
- The
#PBS
directives request 30 minutes of walltime, 1 node with 4 cores, and 1 gpu. More on#PBS
directives can be found in the PBS guide. Make sure#PBS -A [Account]
is filled with account of user. $PBS_O_WORKDIR
is a variable that represents the directory you submit the PBS script from. Input and output files for the script should be found in the same directory you put the PBS script.- The
pace-community
module needs to be loaded beforehorovod-gpu
can be loaded. - Be sure
#PBS -A [Account]
is filled with user's account before proceeding to the next step. - For more information on running Horovod, try
module help horovod
.
Part 2: Submit Job and Check Status¶
- Be sure to change to the directory that contains the
PBS
Script. qsub horovod.pbs
- Check job status with
qstat -u <userid>
, replacing the number with the job id returned after running qsub. - You can delete the job with
qdel <jobid>
, again replacing the number with the jobid returned after running qsub.
Part 3: Collecting Results¶
- In the directory where you submitted the
PBS
script, you should see atest-cnn.out
file, which contains the results of the job. Usecat test-cnn.out
or open the file in a text editor to take a look. test-cnn.out
should look like this:
---------------------------------------
Begin PBS Prologue Mon Sep 13 18:10:11 EDT 2021
Job ID: 2831141.sched-torque.pace.gatech.edu
User ID: emin7
Job name: HorovodTest
Queue: inferno
End PBS Prologue Mon Sep 13 18:10:11 EDT 2021
---------------------------------------
Lmod is automatically replacing "intel/19.0.5" with "gcc/7.4.0".
2021-09-13 18:10:15.871524: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2021-09-13 18:10:22.577608: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-13 18:10:22.590079: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2700000000 Hz
2021-09-13 18:10:22.590409: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3d46b30 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-09-13 18:10:22.590419: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-09-13 18:10:22.593045: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-09-13 18:10:22.839035: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3d3f660 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-09-13 18:10:22.839059: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA Quadro RTX 6000, Compute Capability 7.5
2021-09-13 18:10:22.853971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:d8:00.0 name: NVIDIA Quadro RTX 6000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2021-09-13 18:10:22.854011: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-09-13 18:10:23.121267: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-09-13 18:10:23.255971: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-09-13 18:10:23.379506: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-09-13 18:10:23.592344: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-09-13 18:10:23.696039: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-09-13 18:10:24.051124: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-09-13 18:10:24.056412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-09-13 18:10:24.056454: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-09-13 18:10:29.357479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-13 18:10:29.357513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2021-09-13 18:10:29.357519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2021-09-13 18:10:29.364017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22477 MB memory) -> physical GPU (device: 0, name: NVIDIA Quadro RTX 6000, pci bus id: 0000:d8:00.0, compute capability: 7.5)
TensorFlow: 2.3
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 64 global
64 per device
Num batches: 100
Num epochs: 0.00
Devices: ['horovod/gpu:0']
NUMA bind: False
Data format: NCHW
Optimizer: sgd
Variables: horovod
==========
Generating training model
WARNING:tensorflow:From /storage/scratch1/5/emin7/horovod-test/anotherOne/tf_cnn_benchmarks/convnet_builder.py:134: conv2d (from tensorflow.python.keras.legacy_tf_layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.keras.layers.Conv2D` instead.
W0913 18:10:29.430423 46912496342336 deprecation.py:323] From /storage/scratch1/5/emin7/horovod-test/anotherOne/tf_cnn_benchmarks/convnet_builder.py:134: conv2d (from tensorflow.python.keras.legacy_tf_layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.keras.layers.Conv2D` instead.
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/legacy_tf_layers/convolutional.py:424: Layer.apply (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
W0913 18:10:29.441053 46912496342336 deprecation.py:323] From /usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/legacy_tf_layers/convolutional.py:424: Layer.apply (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /storage/scratch1/5/emin7/horovod-test/anotherOne/tf_cnn_benchmarks/convnet_builder.py:266: max_pooling2d (from tensorflow.python.keras.legacy_tf_layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.MaxPooling2D instead.
W0913 18:10:29.496724 46912496342336 deprecation.py:323] From /storage/scratch1/5/emin7/horovod-test/anotherOne/tf_cnn_benchmarks/convnet_builder.py:266: max_pooling2d (from tensorflow.python.keras.legacy_tf_layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.MaxPooling2D instead.
Initializing graph
WARNING:tensorflow:From /storage/scratch1/5/emin7/horovod-test/anotherOne/tf_cnn_benchmarks/benchmark_cnn.py:2267: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W0913 18:10:35.032272 46912496342336 deprecation.py:323] From /storage/scratch1/5/emin7/horovod-test/anotherOne/tf_cnn_benchmarks/benchmark_cnn.py:2267: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2021-09-13 18:10:35.805860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:d8:00.0 name: NVIDIA Quadro RTX 6000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2021-09-13 18:10:35.805907: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-09-13 18:10:35.805947: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-09-13 18:10:35.805959: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-09-13 18:10:35.805969: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-09-13 18:10:35.805978: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-09-13 18:10:35.805988: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-09-13 18:10:35.805998: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-09-13 18:10:35.811220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-09-13 18:10:35.811262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-13 18:10:35.811269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2021-09-13 18:10:35.811274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2021-09-13 18:10:35.816169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22477 MB memory) -> physical GPU (device: 0, name: NVIDIA Quadro RTX 6000, pci bus id: 0000:d8:00.0, compute capability: 7.5)
INFO:tensorflow:Running local_init_op.
I0913 18:10:36.984717 46912496342336 session_manager.py:505] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0913 18:10:37.048840 46912496342336 session_manager.py:508] Done running local_init_op.
Running warm up
2021-09-13 18:10:40.291528: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-09-13 18:10:42.124966: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
Done warm up
Step Img/sec total_loss
1 images/sec: 301.9 +/- 0.0 (jitter = 0.0) 7.608
10 images/sec: 307.8 +/- 0.7 (jitter = 0.9) 7.849
20 images/sec: 305.5 +/- 0.8 (jitter = 4.7) 8.013
30 images/sec: 304.9 +/- 0.6 (jitter = 4.6) 7.940
40 images/sec: 305.3 +/- 0.6 (jitter = 4.8) 8.136
50 images/sec: 304.4 +/- 0.5 (jitter = 2.3) 8.052
60 images/sec: 303.8 +/- 0.5 (jitter = 1.6) 7.783
70 images/sec: 303.3 +/- 0.4 (jitter = 1.3) 7.854
80 images/sec: 302.9 +/- 0.4 (jitter = 1.3) 8.011
90 images/sec: 302.6 +/- 0.4 (jitter = 1.5) 7.843
100 images/sec: 302.4 +/- 0.3 (jitter = 1.6) 8.095
----------------------------------------------------------------
total images/sec: 302.35
----------------------------------------------------------------
---------------------------------------
Begin PBS Epilogue Mon Sep 13 18:11:17 EDT 2021
Job ID: 2831141.sched-torque.pace.gatech.edu
User ID: emin7
Job name: HorovodTest
Resources: nodes=1:ppn=4:gpus=1:RTX6000,walltime=00:20:00,neednodes=1:ppn=4:gpus=1:RTX6000
Rsrc Used: cput=00:01:00,vmem=22614868kb,walltime=00:01:06,mem=1847168kb,energy_used=0
Queue: inferno
Nodes:
atl1-1-03-006-11.pace.gatech.edu
End PBS Epilogue Mon Sep 13 18:11:17 EDT 2021
---------------------------------------
- After the result files are produced, you can move the files off the cluster, refer to the file transfer guide for help.
- Congratulations! You successfully ran a Python script using Horovod on the cluster.