Install horovod for PyTorch to work with slurm on ABCI

Horovod is a great tool for multi-node, multi-gpu gradient synchronization. The official ABCI document does give a guide on running Tensorflow with multiple nodes, which can be found here However, to make horovod run with PyTorch and work with multiple nodes is not that straight-forward.

First, it turns out that Anaconda and Miniconda is not well supported in this case. The PyTorch plugin requires GCC version higher than 4.9. So finally, may be the only solution is to stick with the Python provided in module list and use venv for package control.

Load modules and create a python environment:

module load cuda/10.0/10.0.130 cudnn/7.6/7.6.4 nccl/2.4/2.4.8-1 python/3.6/3.6.5 openmpi/2.1.6
python3 -m venv $HOME/base
source $HOME/base/bin/activate

Install PyTorch

pip install numpy
pip install torch

Install horovod with NCCL support and a higher version of g++

CC=/apps/gcc/7.3.0/bin/gcc CXX=/apps/gcc/7.3.0/bin/g++ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_NCCL_HOME=/apps/nccl/2.4.8-1/cuda10.0/lib pip install horovod

To test whether the horovod works or not, let’s make a simple script named .

import horovod.torch as hvd
import torch
	"size=", hvd.size(),
	"global_rank=", hvd.rank(),
	"local_rank=", hvd.local_rank(),
	"device=", torch.cuda.get_device_name(hvd.local_rank())

The launch the job, make a script like this:


source /etc/profile.d/
module load cuda/10.0/10.0.130 cudnn/7.6/7.6.4 nccl/2.4/2.4.8-1 python/3.6/3.6.5 openmpi/2.1.6
source $HOME/base/bin/activate



mpirun -np $NUM_PROCS --map-by ppr:${NUM_GPUS_PER_SOCKET}:socket \
--mca mpi_warn_on_fork 0 -x PATH -x LD_LIBRARY_PATH \
python > hvd_test.stdout

Launch the job with

qsub -g group_name -l rt_F=2

After the job is executed, you can check the output by

cat hvd_test.stdout