Running GPU Batch jobs on Rivanna
We explain how to run GPU batch jobs using different GPU cards on Rivanna. Rivanna is a supercomputer at the University of Virginia. This tutorial is only useful if you can get an account on it. The official documentation is available at
However, it includes some issues and does not explain certain important aspects for using GPUs on it. Therefore, this guide has been created.
PLEASE HELP US IMPROVE THIS GUIDE
Requirements
We require that you have
- A valid account on Rivanna
- A valid accounting group allowing you to run GPU jobs on Rivanna
Introduction
Rivanna is the High-Performance Computing (HPC) cluster managed by University of Virginia’s Research Computing. Rivanna is composed 575 nodes with a total of 20,476 cores and 8PB of different types of storage. Table 1 shows an overview of the compute nodes. Some of the compute nodes also includes these GPUs:
Table 1: GPUs on Rivanna
Cores/Node | Memory/Node | Specialty Hardware | GPU memory/Device | GPU devices/Node | # of Nodes |
---|---|---|---|---|---|
40 | 354GB | - | - | - | 1 |
20 | 127GB | - | - | - | 115 |
28 | 255GB | - | - | - | 25 |
40 | 768GB | - | - | - | 34 |
40 | 384GB | - | - | - | 348 |
24 | 550GB | - | - | - | 4 |
16 | 1000GB | - | - | - | 5 |
48 | 1500GB | - | - | - | 6 |
64 | 180GB | KNL | - | - | 8 |
128 | 1000GB | GPU: A100 | 40GB | 8 | 2 |
28 | 255GB | GPU: K80 | 11GB | 8 | 9 |
28 | 255GB | GPU: P100 | 12GB | 4 | 3 |
40 | 383GB | GPU: RTX 2080 Ti | 11GB | 10 | 2 |
28 | 188GB | GPU: V100 | 16GB | 4 | 1 |
40 | 384GB | GPU: V100 | 32GB | 4 | 12 |
*) This information may be outdated
Access to Rivanna
Access to Rivanna is secured by University of Virginias VPN. UVA offers two different VPNs. We recommend that you install the UVA Anywhere VPN. This can be installed on Linux, macOS and Windows.
After installation, you have to start the VPN. After that, you can use a
terminal to access Rivanna via ssh. If you have not used ssh, we
encourage you to read about it and explore commands such as ssh
,
ssh-keygen
, ssh-copy-id
, ssh-agent, and
ssh-add`.
Note: gitbash on Windows
Please note that on Windows, you are expected to install gitbash so you can use the same commands and ssh logic as on Linux and Mac. For this reason, we do not recommendputty
, PowerShell
or
cmd.exe
. This is because we can do scripting the same way, even from
those running Windows, and significantly simplifies this guide.
We will not provide an extensive tutorial on how to use ssh, but you can contribute it. Instead, we will summarize the most important steps:
-
Create an ssh key if you have not done that before
$ ssh-keygen
It is VERY important that you create the key with a strong passphrase.
-
Add an abbreviation for Rivanna to your
~/.ssh/config
fileUse your favorite editor. Mine is
emacs
emacs ~/.ssh/config
copy and paste the following into that file, where
abc1de
is to be substituted by your UVA compute id.Host rivanna User abc1de HostName rivanna.hpc.virginia.edu IdentityFile ~/.ssh/id_rsa.pub
This will allow you to use
rivanna
instead ofabc1de@rivanna.hpc.virginia.edu
. The next steps assume you have done this and can use justrivanna
-
Copy your public key to rivanna
$ ssh-copy-id rivanna
This will copy your public key into the
rivanna:~/.ssh/authorized_keys
file. -
After this step, you can use your keys to authenticate. You still need to be using the VPN, though.
The most convenient system for it is Mac and Ubuntu. It already has a tool installed called ssh-agent and keychain. In Windows under gitbash you need to start it with
$ eval `ssh-agent`
First, you add the key to your session, so you do not have to constantly type in the password. Use the command
$ ssh-add
to test if it works, just say
$ ssh rivanna hostanme
which will print the hostname of Rivanna
In case your machine does not run ssh-agent, you can start it before you type in the ssh-add command with
$ ssh rivanna hostanme
If everything is set up correctly, it will return the string
udc-ba35-36
-
To login to Rivanna, simply say
“`bash ssh rivanna
If this does not work, you have made a mistake. Please, review the previous steps carefully.
Running Jobs on Rivanna
Jobs on Rivanna can be scheduled through Slurm either as a batch job or as an interactive job. In order to achieve this, one needs to load the software first and create special scripts that are used to submit them to nodes that contain the GPUs you specify.
The user documentation about this is provided here:
However, at the time when we looked at it, it had some mistakes and limitations that we hope to overcome here.
Modules
Rivanna’s default mechanism of software configuration management is done via modules. The UVA modules documentation is provided through this link.
Modules provide the ability to load a particular software stack and configuration into your shell but also into your batch jobs. You can load multiple modules in your environment to load them in order.
To list the available modules, log into Rivanna and use the command
$ module available
To list aproximately, the python modules use
$ module available py
It will return all modules that have py in it. Please chose those that look like python modules.
To probe for deep learning modules, use something similar to
$ module available cuda tensorflow pytorch mxnet nvidia cudnn
Python
Different versions of python are available.
To load python 3.8 we can say
$ module load anaconda/2020.11-py3.8
To load Python 3.10.0 we can say
$ module load anaconda
$ conda create -n py3.10 python=3.10
$ source activate py3.10
$ python -V
Python 3.10.0
Please note that at this time anaconda did not support 3.10.2, which I run personally on my computer, but from python.org.
Adding Modules with Spider
Details about modules can be identified with the module spider
command.
If you type it in you get a list of many available configurations.
Spider can take a keyword and lists all available version the keyword matches.
Let us demonstrate it on
$ module spider python
----------------------------------------------------------------------------
python:
----------------------------------------------------------------------------
Description:
Python is a programming language that lets you work more effectively.
Versions:
python/2.7.16
python/3.6.6
python/3.6.8
python/3.7.7
python/3.8.8
Other possible modules matches:
biopython openslide-python wxpython
----------------------------------------------------------------------------
...
For detailed information about a specific “python” package use the module’s full name.
$ module spider python/3.8.8
This will return a page with lots of information. The most important one for us is
You will need to load all module(s) on any one of the lines below before the
"python/3.8.8" module is available to load.
gcc/11.2.0 openmpi/3.1.6
gcc/9.2.0 cuda/11.0.228 openmpi/3.1.6
gcc/9.2.0 mvapich2/2.3.3
gcc/9.2.0 openmpi/3.1.6
gcccuda/9.2.0_11.0.228 openmpi/3.1.6
goolfc/9.2.0_3.1.6_11.0.228
Here you see various options that need to be loaded in BEFORE you load python.
Thus to properly load python 3.8.8 you need to say (if this is what you chose):
module load gcc/11.2.0
module load openmpi/3.1.6
module spider python/3.8.8
Modules for tensorflow
module load singularity/3.7.1
module load tensorflow/2.7.0
Modules for pytorch
module load singularity/3.7.1
module lod pytorch/1.10.0
Containers
Rivanna uses singularity as container technology. The documentation specific to singularity for Rivanna is avalable at this link
Singularity needs to be also loaded as a module befor it can be used.
Singularity containers have the ability to access GPUs via a passthrough using NVidia drivers. Once you load singularity you can use it as follows:
singularity <cmd> --nv <imagefile> <args>
The container will be used inside a job.
Jobs
More detail specific to jobs for Rivanna is provided here.
Before we start an example, we explain how we create a job first in a job description file and then submit it to Rivanna. We use a simple MNIST example showcases the aspects of successfully running a job on the machine. We will therefore focus on creating jobs using GPUs.
New 8 A100 GPUs to be added
Rivanna will have eight nodes available to us, but they are not yet in service.
Instead, we will be using the two existing nodes shared with other users.
Rivanna uses the SLURM job scheduler for allocating submitted jobs. Jobs are charged SUs from an allocation. The Rivanna compute allocation. Please contact your supervisor for the name of the allocation. Gregor’s allocation is named
bii_dsc
and it currently contains 100k SUs. Students from the UVA capstone class will have the following allocation:
ds6011-sp22-002
To see the available SUs for your project, please use the command
allocations
allocations -a <allocation_name>
SUs can be requested via the Standard Allocation Renewal form. Due to the limitation, we encourage you to plan things and try to avoid unnecessary runs. General instructions for submitting SLURM jobs is located at
To request the job be submitted to the GPU partition, you use the option
-p gpu
The A100 GPUs are a requestable resource. To request them, you would add the gres option with the number of A100 GPUs requested (1 through 8 GPUs), for example, to request 2 A100 GPUs,
--gres=gpu:a100:2
.
If you are using a SLURM script to submit the job the options would appear as follows. Your script will need to specify other options such as the allocation to charge as seen in the sample scripts shown in the above URL:
#SBATCH -p gpu
#SBATCH --gres=gpu:a100:2
#SBATCH -A bii_dsc
Interactive Jobs
Please avoid running interactive jobs as they may waste SUs, and we are charged by you keeping the A100 idle.
Although Research Computing also offers some interactive apps such as JupyterLab, RStudio, CodeServer, Blender, Mathematica via our Open OnDemand portal at:
we ask you to avoid using them for benchmarks.
To request the use of the A100s via Open OnDemand, first log in to the Open the OnDemand portal select the desired interactive app. You will be presented with a form to complete. Currently, you would
- select
gpu
for Rivanna partition, - select
NVIDIA A100
from theOptional: GPU type for GPU partition
pulldown menu and enter the number of desired GPUs from theOptional: Number of GPUs
. Once you’ve completed the form, click theLaunch
button and your session will be launched. The session will start once the resources are available.
Using the MNIST example
For now, the code is located at:
A sample slurm job specification is included at
To run it use the command
$ sbatch mnist-rivanna-a100.slurm
NOTE: We want to improve the script to make sure it is running on a GPU and add GPU placement commands into the code.
Custom Version of TensorFlow
https://www.rc.virginia.edu/userinfo/rivanna/software/tensorflow/
Keras on Rivanna
Building a Python verion from Source
Requirements
This section is under developmentWhy do you wnat to do this?
How is it been done?
Whe have developed the following script to create the enfironment on rivanna \url{httplatex ://example.com}
You can download the script from git with wget
wget ....
and place it in a driectory. running it with
$ python-install.py --version="3.10.2" --host=rivanna
will create an optimized version for rivanna. Other options can be found with python-install.py help
Where do you want to place it
scratch vs home dir
How do you access it?
deployment into your own environment
What is the performance gain?
benchmarks vs the various versions on python here. This needs to be reproducible when we have a new version of python
How to cite if you use this
This work was conducted as part of the mlcommons science benchmark earthquake project and if youl ike to reuse it we like that you cite the following paper:
@TechReport{mlcommons-eartquake,
author = {Thomas Butler and Robert Knuuti and
Jake Kolessar and Geoffrey C. Fox and
Gregor von Laszewski and Judy Fox},
title = {MLCommons Earthquake Science Benchmark},
institution = {MLCommons Science Working Group},
year = 2022,
type = {Report by University of Virginia},
address = {Charlottesville, VA},
month = may,
note = {The order of the authors and url location may change},
annote = {Version: draft},
url = {https://github.com/cyberaide/paper-capstone-mlcommons}
}