Install torch with Intel MKL and NCCL

Intel MKL allows fast math computation on CPU and NCCL enables fast multiple GPU communication. Both of them are desirable for the running experiments, so let's get them.

Install Intel MKL

I will prefer to install all the packages in $HOME/apps. Thus, first create the directory.

mkdir ~/apps

Get the download link from Intel MKL's page, and download on the server: .

cd ~/apps
# Put your real link here.
tar xzvf l_mkl*.tgz
rm l_mkl_*.tgz
cd l_mkl_*

Install Torch

Clone torch library and run the installation script.

git clone ~/apps/torch --recursive
cd ~/apps/torch
bash install-deps

Error: no default constructor exists for class ….

If you get this error, try to get the latest cuda driver. Make sure the driver version is not a pre-release version.

Install NCCL

Clone the NCCL repository and compile the project.

cd ~/apps
git clone
cd nccl
make CUDA_HOME=/usr/local/cuda test
sudo make install
# Copy libnccl files to cuda's folder,
# so you don't have to modify the environment paths
sudo cp /usr/local/lib/libnccl* /usr/local/cuda/lib64/

Restart the terminal to make the torch avaiable

If you are running tmux, try source ~/.zshrc or source ~/.bashrc.


It turns out that Torch does not yet support cudnn 6.0 currently.

Getting mad at Theano

Theano is a fantastic computational graph builder and optimizer. However, the graph optimization can drive you mad when it gives these two errors:

  • Out of Memory
  • Index out of bounds

Now I'm encountering both of these issues. For the out of memory error, the “omnipotent” solution that recommended in the Theano user group is to reduce the batch size. Well, reducing the batch size is a workaround, but considerably slows down the training speed. A clever way for debugging is to turn on the exception_verbosity=high option, which gives a list of storage map, where you can see which operation occupies the vast majority of the GPU memory. Another fix that works for me is to use Theano APIs whenever possible. For example, using T.nnet.cross_entropy to compute loss.

Now, let's talk about Index out of bounds. Speak frankly, I don't have a good solution for this one. The error happens in the forward graph, then using test values (tesnor.tag.test_value) can help to solve the problem easily. The tough situation is that the error happens at backpropagation. I got this problem when implementing a Neural Machine Translation model. This issue is very tough to solve because the backward graph basically is undebugable for normal users, a debug print of graph is unreadable if you don't have a good knowledge about what the graph optimization engine is doing. Finally, the only possible solution is to run all the graph on numpy, and hope the same error can be caught by doing this.