Install horovod for PyTorch to work with slurm on ABCI
04 Oct 2019Horovod is a great tool for multi-node, multi-gpu gradient synchronization. The official ABCI document does give a guide on running Tensorflow with multiple nodes, which can be found here https://docs.abci.ai/en/apps/tensorflow/. However, to make horovod run with PyTorch and work with multiple nodes is not that straight-forward.
First, it turns out that Anaconda and Miniconda is not well supported in this case. The PyTorch plugin requires GCC version higher than 4.9. So finally, may be the only solution is to stick with the Python provided in module list and use venv
for package control.
Load modules and create a python environment:
module load cuda/10.0/10.0.130 cudnn/7.6/7.6.4 nccl/2.4/2.4.8-1 python/3.6/3.6.5 openmpi/2.1.6
python3 -m venv $HOME/base
source $HOME/base/bin/activate
Install PyTorch
pip install numpy
pip install torch
Install horovod with NCCL support and a higher version of g++
CC=/apps/gcc/7.3.0/bin/gcc CXX=/apps/gcc/7.3.0/bin/g++ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_NCCL_HOME=/apps/nccl/2.4.8-1/cuda10.0/lib pip install horovod
To test whether the horovod works or not, let’s make a simple script named hvd_test.py
.
import horovod.torch as hvd
import torch
hvd.init()
print(
"size=", hvd.size(),
"global_rank=", hvd.rank(),
"local_rank=", hvd.local_rank(),
"device=", torch.cuda.get_device_name(hvd.local_rank())
)
The launch the job, make a script hvd_job.sh
like this:
#!/bin/bash
source /etc/profile.d/modules.sh
module load cuda/10.0/10.0.130 cudnn/7.6/7.6.4 nccl/2.4/2.4.8-1 python/3.6/3.6.5 openmpi/2.1.6
source $HOME/base/bin/activate
NUM_NODES=${NHOSTS}
NUM_GPUS_PER_NODE=4
NUM_GPUS_PER_SOCKET=$(expr ${NUM_GPUS_PER_NODE} / 2)
NUM_PROCS=$(expr ${NUM_NODES} \* ${NUM_GPUS_PER_NODE})
LD_LIBRARY_PATH=/apps/gcc/7.3.0/lib64:$LD_LIBRARY_PATH
mpirun -np $NUM_PROCS --map-by ppr:${NUM_GPUS_PER_SOCKET}:socket \
--mca mpi_warn_on_fork 0 -x PATH -x LD_LIBRARY_PATH \
python hvd_test.py > hvd_test.stdout
Launch the job with
qsub -g group_name -l rt_F=2 hvd_job.sh
After the job is executed, you can check the output by
cat hvd_test.stdout