Sunday, 6 May 2018

Installing Intel Python 3, tensorflow-gpu, and multiple versions of CUDA and cuDNN on CentOS 7

Content:

1. Introduction

2. Enabling the use of EPEL repository

3. Installing multiple CUDA versions on CentOS 7

4. Monitoring the NVidia GPU device by nvidia-smi

5. Making the software state in the NVIDIA driver persistent

6. Installing cuDNN for multiple versions of CUDA

7. Installing Intel Python 3 and tensorflow-gpu

8. Testing the CUDA and cuDNN installation

8.1. Testing if cuDNN library is loadable

8.2. Testing the CUDA Python 3 integration by using Numba

8.3. Testing the CUDA Python 3 integration by using tensorflow-gpu


This publication describes how to install multiple versions of CUDA and cuDNN on the same system running CentOS 7 to support various applications, and tensorflow in particular (via tensorflow-gpu). The recipes provided bellow can be follow when adding GPU computing support for compute nodes, which are part of HPC cluster.


This is an optional step, applicable if the configuration for using EPEL repository is not presented in /etc/yum.repos.d. EPEL is required here because the installation of the nvidia graphics driver, part of CUDA packages, requires the presence of DKMS in the system in advance. That package is included in EPEL. To use EPEL install first its repository package:

# yum install epel-release
# yum update

The dkms RPM package will be installed later, as a dependence required by the CUDA packages (see next section).


The most reasonable question here is why do we need multiple version of CUDA installed and supported locally on the system. Its answer is straightforward - it is all about the application software specific requirements. Some software products are very specific about the version of CUDA.

The most rational way to install the CUDA packages on CentOS 7 is through yum. NVidia provides the configuration files for using their yum repositories as a separate RPM package, which might be downloaded here:

https://developer.nvidia.com/cuda-downloads

To initiate the download consequently select Linux > x86_64 > CentOS > 7 > rpm (network) > Download as shown in the screen shots bellow:


and installed by following the instructions given bellow the "Download" button.

From time to time some inconsistencies appear in the CUDA yum repository. To prevent any problems they might cause edit the file /etc/yum.repos.d/cuda.repo by changing there the line:

enabled=1

into

enabled=0

From now on, every time an access to the CUDA repository RPM packages is required, do supply the command line option --enablerepo=cuda to yum.

After finishing with the yum configuration install the RPM packages containing the versions of CUDA currently supported by the vendor:

# yum --enablerepo=cuda install cuda-8-0 cuda-9-0 cuda-9-1

That will install plenty of packages. Take into account their installation size and prepare to meet that demand for disk space.

If, by any chance, the installer misses to install the packages nvidia-kmod, xorg-x11-drv-nvidia, xorg-x11-drv-nvidia-libs, and xorg-x11-drv-nvidia-gl, install them separately:

# yum --enablerepo=cuda install nvidia-kmod xorg-x11-drv-nvidia xorg-x11-drv-nvidia-libs xorg-x11-drv-nvidia-gl

The tool nvidia-smi is part of the package xorg-x11-drv-nvidia. It shows the current status of the NVidia GPU device:

$ nvidia-smi

Sun May  6 17:15:10 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K620         On   | 00000000:02:00.0 Off |                  N/A |
| 34%   36C    P8     1W /  30W |      1MiB /  2000MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

That tool is useful to check how many applications are currently running on the GPU device, what is the temperature there, the consumed power, the utilization rate, and what amount of memory is taken by the applications.


To prevent the driver from releasing the NVidia GPU device, when that device is not in use by any process, the daemon nvidia-persistenced (part of the package xorg-x11-drv-nvidia) needs to be enabled and started:

# systemctl enable nvidia-persistenced
# systemctl start nvidia-persistenced

The cuDNN library and header files can be downloaded from the web page of the vendor at:

https://developer.nvidia.com/cudnn

Note that a proper user registration is required to obtain the cuDNN files. Also, you need to download the archives with cuDNN library and header files for each and every CUDA version locally installed and supported. Which process, in turn, will end up bringing the following files into the download directory:

cudnn-8.0-linux-x64-v6.0.tgz
cudnn-9.0-linux-x64-v7.tgz
cudnn-9.1-linux-x64-v7.tgz

To proceed with the installation, unpack the content of the archives into the respective CUDA installation folders and recreate the database with the dynamic linker run time bindings, by executing (as root or super user) the command lines:

# tar --strip-components 1 -xf cudnn-8.0-linux-x64-v6.0.tgz -C /usr/local/cuda-8.0
# tar --strip-components 1 -xf cudnn-9.0-linux-x64-v7.tgz -C /usr/local/cuda-9.0
# tar --strip-components 1 -xf cudnn-9.1-linux-x64-v7.tgz -C /usr/local/cuda-9.1
# ldconfig /

It is recommended to check the successful archive unpacking and the proper recreation of the database with the dynamic linker run time bindings, by listing the database cache and grep the output for locating the string "cudnn" in it:

$ ldconfig -p | grep cudnn

The grep result indicating successful cuDNN installation, will look like:

libcudnn.so.7 (libc6,x86-64) => /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so.7
libcudnn.so.7 (libc6,x86-64) => /usr/local/cuda-9.1/targets/x86_64-linux/lib/libcudnn.so.7
libcudnn.so.6 (libc6,x86-64) => /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudnn.so.6
libcudnn.so (libc6,x86-64) => /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudnn.so
libcudnn.so (libc6,x86-64) => /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so
libcudnn.so (libc6,x86-64) => /usr/local/cuda-9.1/targets/x86_64-linux/lib/libcudnn.so

Do not become confused due to the multiple declarations made for libcudnn.so in the database (as seen in the output above). Seemingly, that indicates a collision, but note that each of libcudnn.so files is a symlink and it also provides an unique version number. That number is used by the tensorflow libraries to find which of the files matches best the version requirements.


If Intel Python 3 is not available in the system, follow the instructions given here:

https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-yum-repo

on how to install it. It is a single RPM package (mind its large installation size of several gigabytes) which contains tensorflow but (currently) does not include tensorflow-gpu module. Once Intel Python 3 is available the tensorflow-gpu module could be installed by invoking pip (the one provided by Intel Python 3).

Do not install tensorflow-gpu or any other module for Intel Python 3 as root or super user. Avoid any module installations inside the /opt/intel/intelpython3/ folder. Instead, perform the installation as unprivileged user and append the --user option to pip:

$ /opt/intel/intelpython3/bin/pip install --user tensorflow-gpu

The output information generated during the installation process should look like:

Collecting tensorflow-gpu
  Downloading https://files.pythonhosted.org/packages/59/41/ba6ac9b63c5bfb90377784e29c4f4c478c74f53e020fa56237c939674f2d/tensorflow_gpu-1.8.0-cp36-cp36m-manylinux1_x86_64.whl (216.2MB)
    100% |████████████████████████████████| 216.3MB 7.8kB/s 
Collecting protobuf>=3.4.0 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/74/ad/ecd865eb1ba1ff7f6bd6bcb731a89d55bc0450ced8d457ed2d167c7b8d5f/protobuf-3.5.2.post1-cp36-cp36m-manylinux1_x86_64.whl (6.4MB)
    100% |████████████████████████████████| 6.4MB 266kB/s 
Collecting gast>=0.2.0 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/5c/78/ff794fcae2ce8aa6323e789d1f8b3b7765f601e7702726f430e814822b96/gast-0.2.0.tar.gz
Collecting termcolor>=1.1.0 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
Requirement already satisfied: wheel>=0.26 in /opt/intel/intelpython3/lib/python3.6/site-packages (from tensorflow-gpu)
Collecting tensorboard<1.9.0,>=1.8.0 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/59/a6/0ae6092b7542cfedba6b2a1c9b8dceaf278238c39484f3ba03b03f07803c/tensorboard-1.8.0-py3-none-any.whl (3.1MB)
    100% |████████████████████████████████| 3.1MB 545kB/s 
Collecting grpcio>=1.8.6 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/c8/b8/00e703183b7ae5e02f161dafacdfa8edbd7234cb7434aef00f126a3a511e/grpcio-1.11.0-cp36-cp36m-manylinux1_x86_64.whl (8.8MB)
    100% |████████████████████████████████| 8.8MB 195kB/s 
Collecting astor>=0.6.0 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/b2/91/cc9805f1ff7b49f620136b3a7ca26f6a1be2ed424606804b0fbcf499f712/astor-0.6.2-py2.py3-none-any.whl
Requirement already satisfied: numpy>=1.13.3 in /opt/intel/intelpython3/lib/python3.6/site-packages (from tensorflow-gpu)
Requirement already satisfied: six>=1.10.0 in /opt/intel/intelpython3/lib/python3.6/site-packages (from tensorflow-gpu)
Collecting absl-py>=0.1.6 (from tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/90/6b/ba04a9fe6aefa56adafa6b9e0557b959e423c49950527139cb8651b0480b/absl-py-0.2.0.tar.gz (82kB)
    100% |████████████████████████████████| 92kB 8.8MB/s 
Requirement already satisfied: setuptools in /opt/intel/intelpython3/lib/python3.6/site-packages (from protobuf>=3.4.0->tensorflow-gpu)
Requirement already satisfied: werkzeug>=0.11.10 in /opt/intel/intelpython3/lib/python3.6/site-packages (from tensorboard<1.9.0,>=1.8.0->tensorflow-gpu)
Collecting bleach==1.5.0 (from tensorboard<1.9.0,>=1.8.0->tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/33/70/86c5fec937ea4964184d4d6c4f0b9551564f821e1c3575907639036d9b90/bleach-1.5.0-py2.py3-none-any.whl
Collecting markdown>=2.6.8 (from tensorboard<1.9.0,>=1.8.0->tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/6d/7d/488b90f470b96531a3f5788cf12a93332f543dbab13c423a5e7ce96a0493/Markdown-2.6.11-py2.py3-none-any.whl (78kB)
    100% |████████████████████████████████| 81kB 8.9MB/s 
Collecting html5lib==0.9999999 (from tensorboard<1.9.0,>=1.8.0->tensorflow-gpu)
  Downloading https://files.pythonhosted.org/packages/ae/ae/bcb60402c60932b32dfaf19bb53870b29eda2cd17551ba5639219fb5ebf9/html5lib-0.9999999.tar.gz (889kB)
    100% |████████████████████████████████| 890kB 1.7MB/s 
Building wheels for collected packages: gast, termcolor, absl-py, html5lib
  Running setup.py bdist_wheel for gast ... done
  Stored in directory: /home/vesso/.cache/pip/wheels/9a/1f/0e/3cde98113222b853e98fc0a8e9924480a3e25f1b4008cedb4f
  Running setup.py bdist_wheel for termcolor ... done
  Stored in directory: /home/vesso/.cache/pip/wheels/7c/06/54/bc84598ba1daf8f970247f550b175aaaee85f68b4b0c5ab2c6
  Running setup.py bdist_wheel for absl-py ... done
  Stored in directory: /home/vesso/.cache/pip/wheels/23/35/1d/48c0a173ca38690dd8dfccfa47ffc750db48f8989ed898455c
  Running setup.py bdist_wheel for html5lib ... done
  Stored in directory: /home/vesso/.cache/pip/wheels/50/ae/f9/d2b189788efcf61d1ee0e36045476735c838898eef1cad6e29
Successfully built gast termcolor absl-py html5lib
Installing collected packages: protobuf, gast, termcolor, html5lib, bleach, markdown, tensorboard, grpcio, astor, absl-py, tensorflow-gpu
Successfully installed absl-py-0.2.0 astor-0.6.2 bleach-1.5.0 gast-0.2.0 grpcio-1.11.0 html5lib-0.9999999 markdown-2.6.11 protobuf-3.5.2.post1 tensorboard-1.8.0 tensorflow-gpu-1.8.0 termcolor-1.1.0

NOTE: The files brought by the tensorflow-gpu installation to the local file system will be located under ${HOME}/.local/lib/python3.6/site-packages/ directory!

That kind of test is very easy to perform. If it returns no error that means all symbols brought by the library libcudnn.so are known to the Python 3 interpreter.

To perform the test create the Python 3 script:

import ctypes

t=ctypes.cdll.LoadLibrary("libcudnn.so")

print(t._name)

save it as a file under the name cudnn_loading_cheker.py and then execute the script:

$ /opt/intel/intelpython3/bin/python3 cudnn_loading_cheker.py

If the libcudnn.so is successfully loaded the script will return the name of the library file:

libcudnn.so

and rise an message otherwise.

Along with the other modules for scientific computing and data analysis, the Intel Python 3 package supplies Numba. To perform GPU computing based on CUDA, the Numba jit compiler requires the environmental variables NUMBAPRO_NVVM and NUMBAPRO_LIBDEVICE both properly declared before start compiling any Python code containing GPU instructions. Those variables should point to the installation tree of the latest version of CUDA:

$ export NUMBAPRO_NVVM=/usr/local/cuda-9.1/nvvm/lib64/libnvvm.so.3.2.0
$ export NUMBAPRO_LIBDEVICE=/usr/local/cuda-9.1/nvvm/libdevice

It is highly recommendable to declare these variables in ${HOME}/.bashrc file.

Once the variables are declared and loaded, execute the test script /opt/intel/intelpython3/lib/python3.6/site-packages/numba/cuda/tests/cudapy/test_matmul.py:

$ /opt/intel/intelpython3/bin/python3 /opt/intel/intelpython3/lib/python3.6/site-packages/numba/cuda/tests/cudapy/test_matmul.py

In case of successful execution the script will exit by displaying the message:

.
----------------------------------------------------------------------
Ran 1 test in 0.093s

OK

A simple script for testing tensorflow-gpu can be found here:

https://github.com/yaroslavvb/stuff/blob/master/matmul_benchmark.py

It should be downloaded and then executed by using Intel Python 3 interpreter:

$ /opt/intel/intelpython3/bin/python3 matmul_benchmark.py

and in case of successful execution the following result will appear on the screen:

/opt/intel/intelpython3/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
2018-05-06 16:21:22.591713: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-05-06 16:21:22.684411: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-05-06 16:21:22.684824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: Quadro K620 major: 5 minor: 0 memoryClockRate(GHz): 1.124
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.92GiB
2018-05-06 16:21:22.684855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-05-06 16:21:23.151861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-06 16:21:23.151903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-05-06 16:21:23.151916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-05-06 16:21:23.152061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1692 MB memory) -> physical GPU (device: 0, name: Quadro K620, pci bus id: 0000:01:00.0, compute capability: 5.0)

 8192 x 8192 matmul took: 1.34 sec, 817.99 G ops/sec


Tuesday, 7 March 2017

Compiling xdrfile library by using PGI C compiler (fast and dirty way)

Content:

1. Introduction.

2. Downloading the code of xdrfile library.

3. Compiling and installing the code.

4. Testing the compilation.


The xdrfile library is a side-project of GROMACS and provides an easy to use universal interface for reading XTC and TRR trajectories from within python, C/C++, and Fortran code based applications.

The library code could be easily compiled by using GNU C and Intel C compilers, but the attempt to use PGI compilers collection through the configure script will fail. One of the reasons for that failure is that PGI compilers do not support invocation with GNU style parameters. Of course, that might be solved by changing the configure script and thus matching the PGI C compiler set of input parameter, but unless you wanna pack the compilation as RPM, DEB, or other package, it is not worth to do it. The code might be compiled directly from command line without any automation in seconds, and the goal of this document is to show how to.

 

The source code tarball of xdrfile library could be downloaded from the download page of GROMACS project (scroll down the page - the xdrfile links are in the bottom of the page):

http://www.gromacs.org/Downloads

The current version is 1.1.4. Create a separate folder, download and unzip the tarball into it, and enter the folder:

$ mkdir ~/tmp/xdrfile
$ cd ~/tmp/xdrfile
$ wget ftp://ftp.gromacs.org/contrib/xdrfile-1.1.4.tar.gz
$ tar zxvf xdrfile-1.1.4.tar.gz
$ cd xdrfile-1.1.4/src

 

Before start compiling the code one need to realize that both shared or static library might be needed build because some of the applications which will use the library might follow different model of compilation - some of them might need the shared version of the library, another ones might need the static version. So the example bellow shows how to compile both of them:

The compilation of shared library (using position independent code (PIC) model):

$ cd ~/tmp/xdrfile/xdrfile-1.1.4/src
$ pgcc -fastsse -fPIC -c xdrfile_xtc.c xdrfile_trr.c xdrfile.c -I../include
$ pgcc -fastsse -fPIC -shared -o libxdrfile.so xdrfile.o xdrfile_trr.o xdrfile_xtc.o
$ sudo cp ~/tmp/xdrfile/xdrfile-1.1.4/src/libxdrfile.so /usr/local/lib

To compile the static version of the library, create first the object files (as shown above) and use the GNU ar tool to pack them:

$ cd ~/tmp/xdrfile/xdrfile-1.1.4/src
$ pgcc -fastsse -fPIC -c xdrfile_xtc.c xdrfile_trr.c xdrfile.c -I../include
$ ar rcs libxdrfile.a xdrfile_xtc.o xdrfile_trr.o xdrfile.o
$ sudo cp ~/tmp/xdrfile/xdrfile-1.1.4/src/libxdrfile.a /usr/local/lib

Note that using the static version of the library is not recommended, unless you are very certain about what you want to achieve by compiling your code statically.

 

The xdrfile library is supplied with two test tools - one written in C and another one - in Python. Because the test requires some TRR and XTC trajectories to exist, the C tool should be run first to generate them, after its successful execution, of course.

To execute xdrfile_c_test we need to have it compiled from the C-source. The compilation with respect the shared library libxdrfile.so (for the example it is installed locally in /usr/local/bin) follows the recipe:

$ cd ~/tmp/xdrfile/xdrfile-1.1.4/src
$ pgcc -fastsse -o xdrfile_c_test xdrfile_c_test.c -L/usr/local/lib -lxdrfile -I../include

To check the if the shared library compiled before works as expected and generate sample TRR and XTC trajectories to test the tools on, execute xdrfile_c_test (presume the libxdrfile.so is in /usr/local/bin):

$ cd ~/tmp/xdrfile/xdrfile-1.1.4/src
$ export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
$ ./xdrfile_c_test

If the library libxdrfile.so is successfully compiled, loaded, and works as expected the following output will appear on the display:

Testing basic xdrfile library: PASSED
Testing xtc functionality: PASSED
Testing trr functionality: PASSED

and these new trajectory files will be created:

$ ~/tmp/xdrfile/xdrfile-1.1.4/src/test.trr
$ ~/tmp/xdrfile/xdrfile-1.1.4/src/test.xtc

Execute the tool xdrfile_test.py (have Python 2.7, presume the libxdrfile.so is in /usr/local/bin):

$ cd ~/tmp/xdrfile/xdrfile-1.1.4/src/python
$ export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
$ ./xdrfile_test.py

If the execution is successful the displayed result will look like:

../test.trr OK
../test.xtc OK

Compile trr2xtc from its C-source (for the example the shared library libxdrfile.so is installed locally in /usr/local/bin):

$ cd ~/tmp/xdrfile/xdrfile-1.1.4/src/python
$ pgcc -fastsse -o trr2xtc trr2xtc.c -L/usr/local/lib -lxdrfile -I../include

Execute it to convert the sample TRR trajectory into XTC one (if libxdrfile.so is in /usr/local/bin):

$ cd ~/tmp/xdrfile/xdrfile-1.1.4/src/python
$ export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
$ ./trr2xtc -i test.trr -o converted.xtc

To be sure that the conversion is successful, compare the SHA256 checksums of the produced converted.xtc and the sample test.xtc (that will work in case test.xtc has not been modified after its creation):

$ cd ~/tmp/xdrfile/xdrfile-1.1.4/src/python
$ sha256sum converted.xtc
$ sha256sum test.xtc

They must match!