This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Documentation

1: Overview and Objective

2: Benchmarks

2.1: CANDLE-UNO
2.2: CloudMask (Segmentation)
2.3: TEvolOp Earthquake Forecasting
2.4: STEMDL (Classification)

3: Science Benchmark Policy Draft (Training)
4: Respondents
5: Tutorials

5.1: Setting up Environment from Scratch
5.2: Running MLCube on Rivanna
5.3: Singularity Collection
5.4: Installing Singularity on Windows Workstations
5.5: Running GPU Batch jobs on Rivanna
5.6: Installing nvcc on Uuntu 20.04
5.7: Installing tensorflow on Windows 10

Here we present a number of documents that we will evolve and submit to MLCommons once they have been improved.

Participans can obtain direct write access to this Web Site, so that overhead of creating initial drafts is minimized.

Our present focus will be the development of the policy document.

1 - Overview and Objective

We summarize the Overview and objective of the Working Group

Encourage and support the curation of large-scale experimental and scientific datasets and the engineering of ML benchmarks operating on those datasets.

The WG will engage with scientists, academics, national laboratories, such as synchrotrons, in securing, engineering, curating, and publishing datasets and machine learning benchmarks that operate on experimental scientific datasets. This will entail working across different domains of sciences, including material, life, environmental, and earth sciences, particle physics, and astronomy, to mention a few. We will include traditional observational and computer-generated data.

Although scientific data is widespread, curating, maintaining, and distributing large-scale, useful datasets for public consumption is a challenging process, covering various aspects of data (from FAIR principles to distribution to versioning). With large data products, various ML techniques have to be evaluated against different architectures and different datasets. Without these benchmarking efforts, the community has no clear pathway for utilizing these advanced models. We expect that the collection will have significant tutorial value as examples from one field, and one observational or computational experiment can be modified to advance other fields and experiments.

The working group’s goal is to assemble and distribute scientific data sets relevant to a scientific campaign in a systematic manner, and pose quantifiable targets (“science benchmark”). A benchmark involves

(i) a data set,
(ii) objective criteria to meet, and
(iii) an example implementation.

The objective criteria depends on the scientific problem at hand. The metric should be well defined on the data but could come from a diverse set of measures (one or more of: accuracy targets, top-1 or 5% error, time to convergence, cross-validation rates, confusion matrices, type-1/type-2 error rates, inference times, surrogate accuracy, control stability measure, etc.).

2 - Benchmarks

A list of benchmarks we currently work on

Currently we are working on a number of sientific applications.

Benchmark Science Task Owner Institute Specific Benchmark Targets

CloudMask Climate Segmentation RAL link cloudmask specifics STEMDL Material Classification ORNL link stemdl specifics CANDLE-UNO Medicine Classification ANL link candle-uno specifics TEvolOp Forecasting Earthquake Regression University of Virginia link tevolop specifics

2.1 - CANDLE-UNO

CANDLE (Exascale Deep Learning and Simulation Enabled Precision Medicine for Cancer) project aims to implement deep learning architectures that are relevant to problems in cancer.

CANDLE-UNO

CANDLE (Exascale Deep Learning and Simulation Enabled Precision Medicine for Cancer) project aims to implement deep learning architectures that are relevant to problems in cancer. These architectures address problems at three biological scales: cellular (Pilot1 P1), molecular (Pilot P2) and population (Pilot3).

Pilot1 (P1) benchmarks are formed out of problems and data at the cellular level. The high level goal of the problem behind the P1 benchmarks is to predict drug response based on molecular features of tumor cells and drug descriptors. Pilot2 (P2) benchmarks are formed out of problems and data at the molecular level. The high level goal of the problem behind the P2 benchmarks is molecular dynamic simulations of proteins involved in cancer, specifically the RAS protein. Pilot3 (P3) benchmarks are formed out of problems and data at the population level. The high level goal of the problem behind the P3 benchmarks is to predict cancer recurrence in patients based on patient related data.

Uno application from Pilot1 (P1): The goal of Uno is to predict tumor response to single and paired drugs, based on molecular features of tumor cells across multiple data sources. Combined dose response data contains sources: [‘CCLE’ ‘CTRP’ ‘gCSI’ ‘GDSC’ ‘NCI60’ ‘SCL’ ‘SCLC’ ‘ALMANAC.FG’ ‘ALMANAC.FF’ ‘ALMANAC.1A’]. Uno implements a deep learning architecture with 21M parameters in TensorFlow framework in Python. The code is publicly available on GitHub. The script in this repository downloads all required datasets. The primary metric to evaluate this applications is throughput (samples per second). More details on running Uno can be found here.

CANDLE-UNO Specific Benchmark Targets

Scientific objective(s):
- Objective: Predictions of tumor response to drug treatments, based on molecular features of tumor cells and drug descriptors
- Formula: Validation loss
- Score: 0.0054
Data
- Download: http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/
- Data Size: 6.4G
- Training samples: 423952
- Validation samples: 52994
Example implementation
- Model: Multi-task Learning-based custom model
- Reference Code: https://github.com/ECP-CANDLE/Benchmarks/tree/develop/Pilot1/Uno
- Run Instructions: https://github.com/ECP-CANDLE/Benchmarks/blob/develop/Pilot1/Uno/README.AUC.md
- Time-to-solution: 10667 samples/sec (batch size 64) on single A100

2.2 - CloudMask (Segmentation)

Estimation of sea surface temperature (SST) from space-borne sensors.

CloudMask (Segmentation)

Estimation of sea surface temperature (SST) from space-borne sensors, such as satellites, is crucial for a number of applications in environmental sciences. One of the aspects that underpins the derivation of SST is cloud screening, which is a step that marks each and every pixel of thousands of satellite imageries as containing cloud or clear sky, historically performed using either thresholding or Bayesian methods.

This benchmark focuses on using a machine learning-based model for masking clouds, in the Sentinel-3 satellite, which carries the Sea and Land Surface Temperature Radiometer (SLSTR) instrument. More specifically, the benchmark operates on multispectral image data. The example implementation is a variation of the U-Net deep neural network. The benchmark includes two datasets of DS1-Cloud and DS2-Cloud, with sizes of 180GB and 4.9TB, respectively. Each dataset is made up of two parts: reflectance and brightness temperature. The reflectance is captured across six channels with the resolution of 2400 x 3000 pixels, and the brightness temperature is captured across three channels with the resolution of 1200 x 1500 pixels.

CloudMask Specific Benchmark Targets

Scientific objective(s):
- Objective: Compare the accuracy produced by the Neural Network with the accuracy of a Bayesian method
- Formula: Weighted Binary Cross Entropy of validation dat
- Score: 0.9 for convergence
Data
- Download: aws s3 --no-sign-request --endpoint-url https://s3.echo.stfc.ac.uk sync s3://sciml-datasets/en/ cloud_slstr_ds1 .
- Data Size: 180GB
- Training samples: 15488
- Validation samples: 3840
Example implementation
- Model: U-Net
- Reference Code: https://github.com/stfc-sciml/sciml-bench/tree/master/sciml_bench/benchmarks/slstr_cloud
- Run Instructions: https://github.com/stfc-sciml/sciml-bench/blob/master/README.md
- Time-to-solution: 180GB dataset runs 59 min on DGX-2 with 32 V100 GPU

2.3 - TEvolOp Earthquake Forecasting

Forcasting Earthquakes

Forcatsing Earthquakes

TEvolOp Earthquake Forecasting

Time series are seen in many scientific problems and many of them are geospatial – functions of space and time and this benchmark illustrates this type. Some time series have a clear spatial structure that for example strongly relates nearby space points. The problem chosen is termed a spatial bag where there is spatial variation but it is not clearly linked to the geometric distance between spatial regions. In contrast, traffic-related time series have a strong spatial structure. We intend benchmarks that cover a broad range of problem types.

The earthquake data comes from USGS and we have chosen a 4 degrees of Latitude (32 to 36 N) and 6 degrees of Longitude (-120 to -114) region covering Southern California. The data runs from 1950 to the present day and is presented as events: magnitude, ground location, depth, and time. We have divided the data into time and space bins. The time interval is daily but in our reference models, we accumulate this into fortnightly data. Southern California is divided into a 40 by 60 grid of 0.1 by 0.1-degree “pixels” which corresponds roughly to squares with an 11 km side, The dataset also includes an assignment of pixels to known faults and a list of the largest earthquakes in that region from 1950 until today. We have chosen various samplings of the dataset to provide both input and predicted values. These include time ranges from a fortnight up to 4 years. Further, we calculate summed magnitudes and depths and counts of significant quakes (magnitude > 3.29). Other easily available quantities are powers of quake energy (using Energy ~ 101.5m where m is magnitude). Quantities are “Energy averaged” when there are multiple events in a single space-time bin except for simple event counts.

Current reference models are a basic LSTM recurrent neural network and a modification of the original science transformer. Details can be found here, and here.

TEvolOp Specific Benchmark Targets

Scientific objective(s):
- Objective: Improve the quality of Earthquake forecasting
- Formula: Normalized Nash–Sutcliffe model efficiency coefficient (NNSE)
- Score: The NNSE lies between 0.8 and 0.99 depending on model and predicted time series
Data
- Download: https://drive.google.com/drive/folders/1wz7K2R4gc78fXLNZMHcaSVfQvIpIhNPi?usp=sharing
- Data Size: 5GB from USGS
- Training samples: Data is decided spatially in an 80%-20% fashion between training and validation. The full dataset covers 6 degrees of longitude (-114 to -120) and 4 degrees of latitude (32 to 56) In Southern California. This is divided into 2400 spatial bins 0.1 degree (~11km) on a side
- Validation samples: Most analyses use 500 most active bins of which 400 are training and 100 validation.
Example implementation
- Model: 3 state of the art geospatial deep learning implementations are provided
- Reference Code: https://colab.research.google.com/drive/1JrPcRwX06xIN5iLhc53_MOLzU9q_Q7wD?usp=sharing (Second model below)
- Run Instructions: This is set up currently as a Jupyter notebook to run on Colab/GitHub. A container DGX version is also available
- Time-to-solution: 1 to 2 days on a single GPU

Example Implementation:

The example implementation is primarily to demonstrate feasibility, show how the data is represented, help address any interpretation considerations, and potentially trigger initial ideas on how the benchmark can be improved.

2.4 - STEMDL (Classification)

STEMDL (Classification)

State of the art scanning transmission electron microscopes (STEM) produce focused electron beams with atomic dimensions and allow to capture diffraction patterns arising from the interaction of incident electrons with nanoscale material volumes. Backing out the local atomic structure of said materials requires compute- and time-intensive analyses of these diffraction patterns (known as convergent beam electron diffraction, CBED). Traditional analyses of CBED requires iterative numerical solutions of partial differential equations and comparison with experimental data to refine the starting material configuration. This process is repeated anew for every newly acquired experimental CBED pattern and/or probed material.

In this benchmark, we used newly developed multi-GPU and multi-node electron scattering simulation codes [1] on the Summit supercomputer to generate CBED patterns from over 60,000 materials (solid-state materials), representing nearly every known crystal structure. A scaled-down version of this data [2] is used for one of the data challenges [3] at SMC 2020 conference, and the overarching goals are to: (1) explore the suitability of machine learning algorithms in the advanced analysis of CBED and (2) produce a machine learning algorithm capable of overcoming intrinsic difficulties posed by scientific datasets.

A data sample from this data set is given by a 3-d array formed by stacking various CBED patterns simulated from the same material at different distinct material projections (i.e. crystallographic orientations). Each CBED pattern is a 2-d array with float 32-bit image intensities. Associated with each data sample in the data set is a host of material attributes or properties which are, in principle, retrievable via analysis of this CBED stack. Of note are (1) 200 crystal space groups out of 230 unique mathematical discrete space groups and (2) local electron density which governs material’s property.

This benchmark consists of 2 tasks: classification for crystal space groups and reconstruction for local electron density, the example implementation of which are provided in [4] and [5].

STEMDL Specific Benchmark Targets

Scientific objective(s):
- Objective: Classification for crystal space groups
- Formula: F1 score on validation data
- Score: 0.9 considered converged
Data
- Download: https://doi.ccs.ornl.gov/ui/doi/70
- Data Size: 548.7 GiB
- Training samples: 138.7K
- Validation samples: 48.4
Example implementation
- Model: ResNet-50
- Reference Code: https://github.com/at-aaims/stemdl-benchmark
- Run Instructions: https://github.com/at-aaims/stemdl-benchmark#quickstart
- Time-to-solution: 40min on 60 V100 GPUs

3 - Science Benchmark Policy Draft (Training)

THe Policy draft document

The document is under development.

Draft

Under development. The raw document is located at this [link]( https://github.com/laszewsk/mlcommons/blob/main/www/content/en/docs/policy.adoc)

Bug

Bug: The table of content does not render when we use hugo

MLCommons Science Benchmark Suite Training Rules

Version 0.1 January 31, 2021

Points of contact: Gregor von Laszewski(laszewski@gmail.com), Juri Papay (juripapay@hotmail.com)

Supporting documents We included here a list of supporting documents that will be removed in the final version, but caould be helping in shaping this draft:

Presentation
Benchmarks Is this the correct link?

1. Overview

All rules are taken from the MLPerf Training Rules except for those that are overridden here.

The MLPerf and MLCommons name and logo are trademarks. In order to refer to a result using the MLPerf and MLCommons name, the result must conform to the letter and spirit of the rules specified in this document. The MLCommons organization reserves the right to solely determine if a use of its name or logo is acceptable.

2. Benchmarks

The benchmark suite consists of the benchmarks shown in the following table.

Warning

change the table

Problem

Dataset

Quality Target

Earth Quake Prediction

TBD

TBD (some error minimization)

3. Divisions

There are two divisions of the Science Benchmark Suite, the Closed division and the Open division.

3.1. Closed Division

The Closed division requires using the same preprocessing, model, and training method as the reference implementation.

The closed division models are:

Problem

Model

REPLACE: Climate segmentation

https://github.com/azrael417/mlperf-deepcam

REPLACE: Cosmological parameter prediction

https://github.com/sparticlesteve/cosmoflow-benchmark

REPLACE: Modeling catalysts

https://github.com/sparticlesteve/ocp/tree/mlperf-hpc-reference

4. Data Set

4.1. Data State at Start of Run

Each reference implementation includes a download script or broadly available method to acquire and verify the dataset.

The data at the start of the benchmark run should reside on a parallel file system that is persistent (>= 1 month, not subject to eviction by other users), can be downloaded to / accessed by the user, and can be shared among users at the facility. Any staging to node-local disk or memory or system burst buffer should be included in the benchmark time measurement.

Note	discuss parallel. some scence benchmarks may not be parallel,

You must flush/reset the on-node caches prior to running each instance of the benchmark. Due to practicality issues, you are not required to reset off-node system-level caches.

Note	discuss what exactly an on node cache is … is this an application on node cache or something else.

We otherwise follow the training rule Data State at Start of Run on consistency with the reference implementation preprocessing and allowance for reformatting.

5. Training Loop

5.1. Hyperparameters and Optimizer

CLOSED:

Allowed hyperparameter and optimizer settings are specified here. For anything not explicitly mentioned here, submissions must match the behavior and settings of the reference implementations.

5.2. Hyperparameters and Optimizer Earth Quae Prediction

Warning

TBD. Next values will all be replaced with application specific values

Model

Name

Constraint

Definition

Reference Code

CosmoFlow

global_batch_size

unconstrained

the global batch size for training

local batch_size (--batch-size) times number of workers. Baseline config is 64

CosmoFlow

opt_name

"sgd"

the optimizer name

--optimizer or config

CosmoFlow

sgd_opt_momentum

0.9

SGD momentum

config

CosmoFlow

opt_base_learning_rate

unconstrained

The base learning rate

base_lr times scaling factor, e.g. global_batch_size/base_batch_size if scaling="linear". Config

CosmoFlow

opt_learning_rate_warmup_epochs

unconstrained

the number of epochs for learning rate to warm up to base value

config

CosmoFlow

opt_learning_rate_warmup_factor

unconstrained

the constant factor applied at learning rate warm up

scaled learning rate / base_lr

CosmoFlow

opt_learning_rate_decay_boundary_epochs

list of positive integers

Epochs at which learning rate decays

config

CosmoFlow

opt_learning_rate_decay_factor

0 < value < 1, and you may use a different value for each decay

the learning rate decay factor(s) at the decay boundary epochs

config

CosmoFlow

dropout

0 ⇐ value < 1

Dropout regularization probability for the dense layers

dropout setting in config

CosmoFlow

opt_weight_decay

value >= 0

L2 regularization parameter for the dense layers

l2 setting in config

5.3. Hyperparameters and Optimizer Other App

Warning

TBD. Next values will all be replaced with application specific values

Model

Name

Constraint

Definition

Reference Code

DeepCAM

global_batch_size

unconstrained

the global batch size for training

--local_batch_size times number of workers

DeepCAM

batchnorm_group_size

value >= 1

Determines how many ranks participate in the batchnorm

--batchnorm_group_size

DeepCAM

opt_name

Adam, AdamW, or LAMB

the optimizer name

--optimizer

DeepCAM

opt_eps

1e-6

epsilon for Adam

--adam_eps

DeepCAM

opt_betas

unconstrained

Momentum terms for Adam-type optimizers

--optimizer_betas

DeepCAM

opt_weight_decay

value >= 0

L2 weight regularization

--weight_decay

DeepCAM

opt_lr

unconstrained

the base learning rate

--start_lr times warmup factor

DeepCAM

scheduler_lr_warmup_steps

value >= 0

the number of epochs for learning rate to warm up to base value

--lr_warmup_steps

DeepCAM

scheduler_lr_warmup_factor

value >= 1

When warmup is used, the target learning_rate will be lr_warmup_factor * start_lr

--lr_warmup_factor

DeepCAM

scheduler_type

multistep or cosine_annealing

Specifies the learning rate schedule

--lr_schedule

DeepCAM

scheduler_milestones

unconstrained

If multistep, the steps at which learning rate is decayed

milestones in --lr_schedule type="multistep",milestones="3000 10000",decay_rate="0.1"

DeepCAM

scheduler_decay_rate

unconstrained

If multistep, the learning rate decay factor

decay_rate in --lr_schedule type="multistep",milestones="15000 25000",decay_rate="0.1"

DeepCAM

scheduler_t_max

value >= 0

For cosine_annealing, period length in steps

--lr_schedule

DeepCAM

scheduler_eta_min

value >= 0

For cosine_annealing, sets the minimal LR

--lr_schedule

DeepCAM

gradient_accumulation_frequency

value >= 1

Specifies the number of gradient accumulation steps before a weight update is performed

--gradient_accumulation_frequency

5.4. Hyperparameters and Optimizer Other App

Warning

TBD. Next values will all be replaced with application specific values

Model

Name

Constraint

Definition

Reference Code

OpenCatalyst

global_batch_size

value >= 1

the global batch size

batch_size times number of GPUs

OpenCatalyst

opt_name

AdamW

the optimizer name

config setting optim name

OpenCatalyst

opt_base_learning_rate

value > 0

the base learning rate

config setting lr_initial

OpenCatalyst

opt_learning_rate_warmup_steps

value >= 0

the number of steps for learning rate to warm up to base value

warmup_steps

OpenCatalyst

opt_learning_rate_warmup_factor

0 ⇐ value ⇐ 1

the factor applied to the learning rate at the start of warmup

warmup_factor

OpenCatalyst

opt_learning_rate_decay_boundary_steps

list of positive integers

lr_milestones

OpenCatalyst

OPEN: Hyperparameters and optimizer may be freely changed.

6. Run Results

MLCommon Science Benchmark Suite submissions consist of the following two metrics: metrics 1 is considered mandatory for a complete submission whereas metric 2 is considered optional:

6.1. Strong Scaling (Time to Convergence)

This is a mandatory metric: see MLPerf Training Run Results for reference. The same rules apply here.

6.2. Weak Scaling (Throughput)

TODO

This is an optional metric. It was designed to test the training capacity of a system.

Measurement: we will define 3 important parameters first.

number of models M: number of model instances which are going to be trained in this benchmark.
instance scale S: each individual model instance will be trained at this scale.
total utilized scale T: the total scale used for running this benchmark. For example, if all M models are trained concurrently, then T=M*S. More generally we can write that S⇐T⇐M*S if (some of) the models are trained sequentially.

Notes:

All three numbers M,S,T are chosen by the submitter. This allows the submitter to accomodate their submission to available machine resources, i.e. compute capacity and compute time.
S and T should be in units of compute resources, e.g. nodes, GPUs or other accelerators. This choice should be aligned with the HPC system description. For example, if the systems descriptions table lists number GPUs to define the scale of the system, then S should be specified in numbers of GPUs.
S and T can be chosen independently of the submission for metric 1 (strong scaling). We encourage to choose T as large as possible, ideally full system scale, but this is not required.

The submitter then trains M models on the resource partitioning (S,T) as defined above to convergence.

We define a Time-To-Train-all (TTTa) number by computing the difference between the end time of the instance which needs longest time to converge and the start time of the instance which starts up fastest. Mathematically this can be expressed as

TTTa = max(run_stop) - min(run_start) where the max/min are taken over all instances M.

Note: the submitter is allowed to prune this number by removing results from individual training instances. As long as the minimum number of models rule is satisfied (see section Benchmark Results below), the submission is valid. They then use a modified number of models M'⇐M and computes TTTa over the reduced set. This allows the submitter to remove occasional outliers or stragglers which would otherwise reduce the score disproportionally.

Reporting: the submitter reports the the tuple (T, S, M', TTTa). It is required to submit a separate MLLOG file for each of the training instances, so that reviewers can verify the quoted numbers. It is not allowed to merge logging files for individual instances.

Restrictions:

The submitter must not report this score on its own. It has to be reported in conjunction with at least one score from Strong Scaling (Time to Convergence) from the same benchmark.
this score does not allow for extrapolation. All reported M' training instances must have converged and it is not allowed to extrapolate results in S or T.

7. Benchmark Results

We follow MLPerf Training Benchmark Results rule along with the following required number of runs per benchmark. Note that since run-to-run variability is already captured by spatial multiplexing in case of metric 3, we use the adjusted requirement that the number of trained instances has to be at least equal to the number of runs for metric 1 and 2.

Warning

TBD. Next values will all be replaced with application specific values

Benchmark

Number of Runs (Metric 1, 2)

M' (Metric 3)

DeepCAM

>=5

CosmoFlow

>=10

OpenCatalyst

>=5

4 - Respondents

Submitting the benchmark

At the present time, we expect respondents to submit the results of their run, and should provide justification in the form of documentation (e.g., a technical manuscript or source code with run instructions). We are exploring setting this up as a “pull request” based contribution mechanism.

Benchmark Views and Criteria

5 - Tutorials

Show your user how to work through some end to end examples.

This is a placeholder page that shows you how to use this template site.

Tutorials are complete worked examples made up of multiple tasks that guide the user through a relatively simple but realistic scenario: building an application that uses some of your project’s features, for example. If you have already created some Examples for your project you can base Tutorials on them. This section is optional. However, remember that although you may not need this section at first, having tutorials can be useful to help your users engage with your example code, especially if there are aspects that need more explanation than you can easily provide in code comments.

5.1 - Setting up Environment from Scratch

A procedure to build an optimized python from source and setup a development environment to run benchmarks.

A description on how to install nvcc in cuda

Requirements

Draft

Introduction

Most modern linux systems come prepackaged with a version of Python 3. However, this version is typically deeply integrated into the operating system’s ecosystem of tools, so it may be a significantly older version of python and it may lack some optimizations to maximize compatibility.

For benchmarking, it is desireable to have control over your source program, so that running programs are both consistent and repeatable. Below are the steps to build Python 3.10.2 on a variety of hosts.

Setup

Configurations

This procedure assumes the following:

You are building using bash
You have curl, make, gcc, openssl, bzip2, libffi, ‘zlib, readline, sqlite3, llvm, ncurses, and xz c header files installed.
You have set the following environment variables
1. BASE - Specifies the working directory for all operations. This procedure assumes ~/.local
2. PREFIX - Where you want the final python instance to be positioned. This procedure assumes ${BASE}/python/3.10.2.

Build OpenSSL

# Fetch source code
curl -OL https://www.openssl.org/source/openssl-1.1.1m.tar.gz
tar -zxvf openssl-1.1.1m.tar.gz -C ${BASE}/src/
cd ${BASE}/src/openssl-1.1.1m/
./config --prefix=${BASE}/ssl --openssldir=${BASE}/ssl shared zlib
make
#make test
make instal
make clean

Build Python

curl -OL https://www.python.org/ftp/python/3.10.2/Python-3.10.2.tar.xz
tar Jxvf Python-3.10.2.tar.xz -C ${BASE}/src/
cd Python-3.10.2
export CPPFLAGS=" -I${BASE}/ssl/include "
export LDFLAGS=" -L${BASE}/ssl/lib "
export LD_LIBRARY_PATH=${BASE}/ssl/lib:$LD_LIBRARY_PATH
./configure --prefix=${PREFIX} --enable-optimizations --with-lto --with-computed-gotos --with-system-ffi

make -j "$(nproc)"
make test
make altinstall
make clean

mkdir -p ${BASE}/.local/bin
(cd ${BASE}/bin ; ln -s python3.10 python)

cat <<EOF > ${BASE}/setup.source
#!/bin/bash

BASE=$BASE
PREFIX=$PREFIX

export LD_LIBRARY_PATH=\$BASE/ssl/lib:\$PREFIX/lib:\$LD_LIBRARY_PATH
export PATH=\$PREFIX/bin:\$PATH
EOF

Archive Build

tar Jxvf python-3.10.2.tar .xz $BASE

Common Setup Procedures

To bootstrap your new environment with all the tools frequently leveraged during development, see the below procedures.

Assumption: The variable BASE is your user home directory, and python3.10 is on the path.

mkdir -p ${BASE}/ENV3
python3.10 -m venv --prompt ENV3 ~/ENV3

source ${BASE}/ENV3/bin/activate
pip install -U pip
pip install cloudmesh-installer

mkdir -p ~/git/cm
(cd ~/git/cm && cloudmesh-installer get cms)

echo "alias ENV3=\"source $BASE/ENV3/bin/activate\"" >> ~/.bash_profile
echo "alias EQ=\"cd $BASE/git\"" >> ~/.bash_profile
source ~/.bash_profile

EQ

git clone git@github.com:laszewsk/mlcommons.git
git clone git@github.com:laszewsk/mlcommons-data-earthquake.git

pip install -r mlcommons/examples/mnist-tensorflow/requirements.txt
pip install -r mlcommons/benchmarks/earthquake/new/requirements.txt

5.2 - Running MLCube on Rivanna

A gentle introduction to running MLCube on Rivanna

In this guide, we introduce MLCube and demonstrate how to run workloads on Rivanna using the Singularity backend.

Running models consistently across platforms requires users to have commanding knowledge of the configuration of not only the source code, but also of the hardware ecosystem. It’s not uncommon that you’ll encounter a project where configuring your system to get reproducible results is error prone and time consuming, and ultimately not productive to the analyst.

MLCube(tm) is a contract-driven approach to address system configuration details and establishes a standard for generating consistent models and a mechanism for delivering these models to others, allowing others to benefit from having a solved environment.

Getting Started

First you need to install a runner for MLCube. The MLCube supports many backend runners and should run on each of them equally.

For this walkthrough, we will target the Rivanna HPC ecosystem, so we’ll leverage the lmod and singularity ecosystems.

Python install

We have two choices to install python. One is with pyenv, the other is with conda.

If you decide to install it with pyenv, use the following steps

pyenv install 3.9.7
pyenv global 3.9.7
python -m venv --prompt mlcube venv
source venv/bin/activate
python -m pip install mlcube-singularity

If you decide to install it with conda, use the following steps

conda create -n mlcube -c conda-forge python=3.9.7
conda activate mlcube
# We use pip as conda does not have an mlcube repository
python -m pip install mlcube-singularity

Note that the mlcube-singularity package can and should be installed within your target environment.

Using MLCube

Once you have run the above commands, you will now have the MLCube script available on your path and you can now list what runners mlcube has registered with

$ mlcube config --get runners
# System settings file path = /home/<username>/mlcube.yaml
# singularity:
#   pkg: mlcube_singularity

At this point you can run through any of the example projects that the mlcube project hosts at https://github.com/mlcommons/mlcube_examples.git.

Below is a set of procedures to run their hello world project.

git clone https://github.com/mlcommons/mlcube_examples.git
cd ./mlcube_examples/hello_world

mlcube run --mlcube=. --task=hello --platform=singularity
# No output expected.

mlcube run --mlcube=. --task=bye --platform=singularity
# No output expected.

cat ./workspace/chats/chat_with_alice.txt
# You should some log lines in this file.

Nontrivial example - Earthquake Data

Help wanted

We are looking to convert our earthquake model into an MLCube container.

5.3 - Singularity Collection

A collection of information about Singularity

User Guides

Add gregors info

Presentations

Organize

TACC Singularity

Containers

5.4 - Installing Singularity on Windows Workstations

A procedure to get singularity running on WSL2

Singularity is a container-based runtime engine designed to run in permission constrained environments. Singularity provides similar functions to systems like Docker, Containerd, and Podman, and provides an ecosystem to share a computer’s kernel and drivers and provide a filesystem based on overlaying files. These overlays create a type of partitioned software that that can create isolated execution on the host as a type of “container”.

However, Singularity differs from typical container runtime engine, most notably:

Singularity was designed to be run as a normal, non-root user and does not depend on a daemon.
Singularity does not natively support OCI images (the typical container image format target), and uses its own SIF format; but OCI images can be imported.
Singularity container images are distributed as files.
Singularity was designed to create a container platform that works from laptops to HPC clusters.

(Windows Only) Setup on Window Subsystem for Linux

While not the normal place to install singularity, it is useful to have the ability to run commands from a local machine to validate command structure and workflows. Singularity does not run natively on windows, but with Windows 10 Professional, you can build Singularity using a WSL2 distribution and provide the ability to run the commands on your workstation.

Enabling WSL2

To enable WSL2, follow microsoft’s instructions

Windows 10/11 - https://docs.microsoft.com/en-us/windows/wsl/install
Windows 10 older than 2004 - https://docs.microsoft.com/en-us/windows/wsl/install-manual

Any version of linux will work with Singularity, but we recommend using Ubuntu.

Building Singularity

This process has been automated in ./tools/install-singularity-wsl2.bash if you’re running Ubuntu. However, the general flow of the instruction is:

Install the singularity code dependencies (gcc, libssl, gpgme, squashfs, seccomp, wget, pkg-config, git, and cryptsetup)
Install a modern version of golang.
Download the Singularity source code from https://github.com/apptainer/singularity.git
Run ./mconfig from the singularity codebase
Run make && make install from the ./builddir directory.

These procedures are more thoroughly covered in the apptainer website at: https://apptainer.org/docs/user/main/quick_start.html#quick-installation-steps

Run your first singularity container

Once the build has completed, you should be able to run the singularity command. Try to run

$ singularity run docker://godlovedc/lolcow

If this command was successful you should see something similar to the following:

 _____________________________________
/ You recoil from the crude; you tend \
\ naturally toward the exquisite.     /
 -------------------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

5.5 - Running GPU Batch jobs on Rivanna

A short introduction on how to run GPU Jobs on Rivanna

We explain how to run GPU batch jobs using different GPU cards on Rivanna. Rivanna is a supercomputer at the University of Virginia. This tutorial is only useful if you can get an account on it. The official documentation is available at

https://www.rc.virginia.edu/userinfo/rivanna/overview/

However, it includes some issues and does not explain certain important aspects for using GPUs on it. Therefore, this guide has been created.

PLEASE HELP US IMPROVE THIS GUIDE

Requirements

We require that you have

A valid account on Rivanna
A valid accounting group allowing you to run GPU jobs on Rivanna

Introduction

Rivanna is the High-Performance Computing (HPC) cluster managed by University of Virginia’s Research Computing. Rivanna is composed 575 nodes with a total of 20,476 cores and 8PB of different types of storage. Table 1 shows an overview of the compute nodes. Some of the compute nodes also includes these GPUs:

A100, K80, P100, V100, RTX2080, and RTX3090

Table 1: GPUs on Rivanna

Cores/Node	Memory/Node	Specialty Hardware	GPU memory/Device	GPU devices/Node	# of Nodes
40	354GB	-	-	-	1
20	127GB	-	-	-	115
28	255GB	-	-	-	25
40	768GB	-	-	-	34
40	384GB	-	-	-	348
24	550GB	-	-	-	4
16	1000GB	-	-	-	5
48	1500GB	-	-	-	6
64	180GB	KNL	-	-	8
128	1000GB	GPU: A100	40GB	8	2
28	255GB	GPU: K80	11GB	8	9
28	255GB	GPU: P100	12GB	4	3
40	383GB	GPU: RTX 2080 Ti	11GB	10	2
28	188GB	GPU: V100	16GB	4	1
40	384GB	GPU: V100	32GB	4	12

*) This information may be outdated

Access to Rivanna

Access to Rivanna is secured by University of Virginias VPN. UVA offers two different VPNs. We recommend that you install the UVA Anywhere VPN. This can be installed on Linux, macOS and Windows.

After installation, you have to start the VPN. After that, you can use a terminal to access Rivanna via ssh. If you have not used ssh, we encourage you to read about it and explore commands such as ssh, ssh-keygen, ssh-copy-id, ssh-agent, and ssh-add`.

Note: gitbash on Windows

Please note that on Windows, you are expected to install gitbash so you can use the same commands and ssh logic as on Linux and Mac. For this reason, we do not recommend putty, PowerShell or cmd.exe. This is because we can do scripting the same way, even from those running Windows, and significantly simplifies this guide.

We will not provide an extensive tutorial on how to use ssh, but you can contribute it. Instead, we will summarize the most important steps:

Create an ssh key if you have not done that before
```
$ ssh-keygen
```
It is VERY important that you create the key with a strong passphrase.
Add an abbreviation for Rivanna to your ~/.ssh/config file

Use your favorite editor. Mine is emacs

emacs ~/.ssh/config

copy and paste the following into that file, where abc1de is to be substituted by your UVA compute id.
```
Host rivanna
  User abc1de
  HostName rivanna.hpc.virginia.edu 
  IdentityFile ~/.ssh/id_rsa.pub
```
This will allow you to use rivanna instead of abc1de@rivanna.hpc.virginia.edu. The next steps assume you have done this and can use just rivanna
Copy your public key to rivanna
```
$ ssh-copy-id rivanna
```
This will copy your public key into the rivanna:~/.ssh/authorized_keys file.
After this step, you can use your keys to authenticate. You still need to be using the VPN, though.

The most convenient system for it is Mac and Ubuntu. It already has a tool installed called ssh-agent and keychain. In Windows under gitbash you need to start it with
```
$ eval `ssh-agent`
```
First, you add the key to your session, so you do not have to constantly type in the password. Use the command
```
$ ssh-add
```
to test if it works, just say
```
$ ssh rivanna hostanme
```
which will print the hostname of Rivanna

In case your machine does not run ssh-agent, you can start it before you type in the ssh-add command with
```
$ ssh rivanna hostanme
```
If everything is set up correctly, it will return the string
```
udc-ba35-36
```

To login to Rivanna, simply say

“`bash ssh rivanna


If this does not work, you have made a mistake. Please, review the
previous steps carefully.

Running Jobs on Rivanna

Jobs on Rivanna can be scheduled through Slurm either as a batch job or as an interactive job. In order to achieve this, one needs to load the software first and create special scripts that are used to submit them to nodes that contain the GPUs you specify.

The user documentation about this is provided here:

https://www.rc.virginia.edu/userinfo/rivanna/overview/#gpu-partition

However, at the time when we looked at it, it had some mistakes and limitations that we hope to overcome here.

Modules

Rivanna’s default mechanism of software configuration management is done via modules. The UVA modules documentation is provided through this link.

Modules provide the ability to load a particular software stack and configuration into your shell but also into your batch jobs. You can load multiple modules in your environment to load them in order.

To list the available modules, log into Rivanna and use the command

$ module available

To list aproximately, the python modules use

$ module available py

It will return all modules that have py in it. Please chose those that look like python modules.

To probe for deep learning modules, use something similar to

$ module available cuda tensorflow pytorch mxnet nvidia cudnn

Python

Different versions of python are available.

To load python 3.8 we can say

$ module load anaconda/2020.11-py3.8

To load Python 3.10.0 we can say

$ module load anaconda
$ conda create -n py3.10 python=3.10
$ source activate py3.10
$ python -V
Python 3.10.0

Please note that at this time anaconda did not support 3.10.2, which I run personally on my computer, but from python.org.

Adding Modules with Spider

Details about modules can be identified with the module spider command. If you type it in you get a list of many available configurations. Spider can take a keyword and lists all available version the keyword matches. Let us demonstrate it on

$ module spider python

----------------------------------------------------------------------------
  python:
----------------------------------------------------------------------------
    Description:
      Python is a programming language that lets you work more effectively.

     Versions:
        python/2.7.16
        python/3.6.6
        python/3.6.8
        python/3.7.7
        python/3.8.8
     Other possible modules matches:
        biopython  openslide-python  wxpython
----------------------------------------------------------------------------
...

For detailed information about a specific “python” package use the module’s full name.

$ module spider python/3.8.8

This will return a page with lots of information. The most important one for us is

 You will need to load all module(s) on any one of the lines below before the
 "python/3.8.8" module is available to load.

      gcc/11.2.0  openmpi/3.1.6
      gcc/9.2.0  cuda/11.0.228  openmpi/3.1.6
      gcc/9.2.0  mvapich2/2.3.3
      gcc/9.2.0  openmpi/3.1.6
      gcccuda/9.2.0_11.0.228  openmpi/3.1.6
      goolfc/9.2.0_3.1.6_11.0.228

Here you see various options that need to be loaded in BEFORE you load python.

Thus to properly load python 3.8.8 you need to say (if this is what you chose):

module load gcc/11.2.0
module load openmpi/3.1.6
module spider python/3.8.8

Modules for tensorflow

module load singularity/3.7.1
module load tensorflow/2.7.0

Modules for pytorch

module load singularity/3.7.1
module lod pytorch/1.10.0

Containers

Rivanna uses singularity as container technology. The documentation specific to singularity for Rivanna is avalable at this link

Singularity needs to be also loaded as a module befor it can be used.

Singularity containers have the ability to access GPUs via a passthrough using NVidia drivers. Once you load singularity you can use it as follows:

singularity <cmd> --nv <imagefile> <args>

The container will be used inside a job.

Jobs

More detail specific to jobs for Rivanna is provided here.

Before we start an example, we explain how we create a job first in a job description file and then submit it to Rivanna. We use a simple MNIST example showcases the aspects of successfully running a job on the machine. We will therefore focus on creating jobs using GPUs.

New 8 A100 GPUs to be added

Rivanna will have eight nodes available to us, but they are not yet in service.

Instead, we will be using the two existing nodes shared with other users.

Rivanna uses the SLURM job scheduler for allocating submitted jobs. Jobs are charged SUs from an allocation. The Rivanna compute allocation. Please contact your supervisor for the name of the allocation. Gregor’s allocation is named

bii_dsc

and it currently contains 100k SUs. Students from the UVA capstone class will have the following allocation:

ds6011-sp22-002

To see the available SUs for your project, please use the command

allocations
allocations -a <allocation_name>

SUs can be requested via the Standard Allocation Renewal form. Due to the limitation, we encourage you to plan things and try to avoid unnecessary runs. General instructions for submitting SLURM jobs is located at

https://www.rc.virginia.edu/userinfo/rivanna/slurm/

To request the job be submitted to the GPU partition, you use the option

-p gpu

The A100 GPUs are a requestable resource. To request them, you would add the gres option with the number of A100 GPUs requested (1 through 8 GPUs), for example, to request 2 A100 GPUs,

--gres=gpu:a100:2.

If you are using a SLURM script to submit the job the options would appear as follows. Your script will need to specify other options such as the allocation to charge as seen in the sample scripts shown in the above URL:

#SBATCH -p gpu
#SBATCH --gres=gpu:a100:2
#SBATCH -A bii_dsc

Interactive Jobs

Please avoid running interactive jobs as they may waste SUs, and we are charged by you keeping the A100 idle.

Although Research Computing also offers some interactive apps such as JupyterLab, RStudio, CodeServer, Blender, Mathematica via our Open OnDemand portal at:

https://rivanna-portal.hpc.virginia.edu

we ask you to avoid using them for benchmarks.

To request the use of the A100s via Open OnDemand, first log in to the Open the OnDemand portal select the desired interactive app. You will be presented with a form to complete. Currently, you would

select gpu for Rivanna partition,
select NVIDIA A100 from the Optional: GPU type for GPU partition pulldown menu and enter the number of desired GPUs from the Optional: Number of GPUs. Once you’ve completed the form, click the Launch button and your session will be launched. The session will start once the resources are available.

Using the MNIST example

For now, the code is located at:

https://github.com/laszewsk/mlcommons/tree/main/examples/mnist-tensorflow

A sample slurm job specification is included at

https://github.com/laszewsk/mlcommons/blob/main/examples/mnist-tensorflow/mnist-rivanna-a100.slurm

To run it use the command

$ sbatch mnist-rivanna-a100.slurm

NOTE: We want to improve the script to make sure it is running on a GPU and add GPU placement commands into the code.

Custom Version of TensorFlow

https://www.rc.virginia.edu/userinfo/rivanna/software/tensorflow/

Keras on Rivanna

https://www.rc.virginia.edu/userinfo/rivanna/software/keras/

Building a Python verion from Source

Requirements

This section is under development

Why do you wnat to do this?

How is it been done?

Whe have developed the following script to create the enfironment on rivanna \url{httplatex ://example.com}

You can download the script from git with wget

wget ....

and place it in a driectory. running it with

$ python-install.py --version="3.10.2" --host=rivanna

will create an optimized version for rivanna. Other options can be found with python-install.py help

Where do you want to place it

scratch vs home dir

How do you access it?

deployment into your own environment

What is the performance gain?

benchmarks vs the various versions on python here. This needs to be reproducible when we have a new version of python

How to cite if you use this

This work was conducted as part of the mlcommons science benchmark earthquake project and if youl ike to reuse it we like that you cite the following paper:

@TechReport{mlcommons-eartquake,
  author = 	 {Thomas Butler and Robert Knuuti and
              Jake Kolessar and Geoffrey C. Fox and
              Gregor von Laszewski and Judy Fox},
  title = 	 {MLCommons Earthquake Science Benchmark},
  institution =  {MLCommons Science Working Group},
  year = 	 2022,
  type = 	 {Report by University of Virginia},
  address = 	 {Charlottesville, VA},
  month = 	 may,
  note = 	 {The order of the authors and url location may change},
  annote = 	 {Version: draft},
  url = {https://github.com/cyberaide/paper-capstone-mlcommons}
}

5.6 - Installing nvcc on Uuntu 20.04

A description on how to install nvcc in cuda

Requirements

Draft

Instalation

$ sudo wget -O /etc/apt/preferences.d/cuda-repository-pin-600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
$ sudo add-apt-repository "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
$ sudo apt update
$ sudo apt install cuda

Add it to your path

$ echo 'export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}' >> ~/.bashrc

Check CUDA version:

$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

5.7 - Installing tensorflow on Windows 10

A description on how to install nvcc in cuda

Requirements

Draft

Instalation

$ TBD

Add it to your path

$ TBD

Check CUDA version: