Science Benchmark Policy Draft (Training)

THe Policy draft document

The document is under development.

MLCommons Science Benchmark Suite Training Rules

Version 0.1 January 31, 2021

Points of contact: Gregor von Laszewski(laszewski@gmail.com), Juri Papay (juripapay@hotmail.com)

Supporting documents We included here a list of supporting documents that will be removed in the final version, but caould be helping in shaping this draft:

1. Overview

All rules are taken from the MLPerf Training Rules except for those that are overridden here.

The MLPerf and MLCommons name and logo are trademarks. In order to refer to a result using the MLPerf and MLCommons name, the result must conform to the letter and spirit of the rules specified in this document. The MLCommons organization reserves the right to solely determine if a use of its name or logo is acceptable.

2. Benchmarks

The benchmark suite consists of the benchmarks shown in the following table.

Warning
change the table

Problem

Dataset

Quality Target

Earth Quake Prediction

TBD

TBD (some error minimization)

3. Divisions

There are two divisions of the Science Benchmark Suite, the Closed division and the Open division.

3.1. Closed Division

The Closed division requires using the same preprocessing, model, and training method as the reference implementation.

The closed division models are:

Problem

Model

REPLACE: Climate segmentation

https://github.com/azrael417/mlperf-deepcam

REPLACE: Cosmological parameter prediction

https://github.com/sparticlesteve/cosmoflow-benchmark

REPLACE: Modeling catalysts

https://github.com/sparticlesteve/ocp/tree/mlperf-hpc-reference

4. Data Set

4.1. Data State at Start of Run

Each reference implementation includes a download script or broadly available method to acquire and verify the dataset.

The data at the start of the benchmark run should reside on a parallel file system that is persistent (>= 1 month, not subject to eviction by other users), can be downloaded to / accessed by the user, and can be shared among users at the facility. Any staging to node-local disk or memory or system burst buffer should be included in the benchmark time measurement.

Note
discuss parallel. some scence benchmarks may not be parallel,

You must flush/reset the on-node caches prior to running each instance of the benchmark. Due to practicality issues, you are not required to reset off-node system-level caches.

Note
discuss what exactly an on node cache is …​ is this an application on node cache or something else.

We otherwise follow the training rule Data State at Start of Run on consistency with the reference implementation preprocessing and allowance for reformatting.

5. Training Loop

5.1. Hyperparameters and Optimizer

CLOSED:

Allowed hyperparameter and optimizer settings are specified here. For anything not explicitly mentioned here, submissions must match the behavior and settings of the reference implementations.

5.2. Hyperparameters and Optimizer Earth Quae Prediction

Warning
TBD. Next values will all be replaced with application specific values

Model

Name

Constraint

Definition

Reference Code

CosmoFlow

global_batch_size

unconstrained

the global batch size for training

local batch_size (--batch-size) times number of workers. Baseline config is 64

CosmoFlow

opt_name

"sgd"

the optimizer name

--optimizer or config

CosmoFlow

sgd_opt_momentum

0.9

SGD momentum

config

CosmoFlow

opt_base_learning_rate

unconstrained

The base learning rate

base_lr times scaling factor, e.g. global_batch_size/base_batch_size if scaling="linear". Config

CosmoFlow

opt_learning_rate_warmup_epochs

unconstrained

the number of epochs for learning rate to warm up to base value

config

CosmoFlow

opt_learning_rate_warmup_factor

unconstrained

the constant factor applied at learning rate warm up

scaled learning rate / base_lr

CosmoFlow

opt_learning_rate_decay_boundary_epochs

list of positive integers

Epochs at which learning rate decays

config

CosmoFlow

opt_learning_rate_decay_factor

0 < value < 1, and you may use a different value for each decay

the learning rate decay factor(s) at the decay boundary epochs

config

CosmoFlow

dropout

0 ⇐ value < 1

Dropout regularization probability for the dense layers

dropout setting in config

CosmoFlow

opt_weight_decay

value >= 0

L2 regularization parameter for the dense layers

l2 setting in config

5.3. Hyperparameters and Optimizer Other App

Warning
TBD. Next values will all be replaced with application specific values

Model

Name

Constraint

Definition

Reference Code

DeepCAM

global_batch_size

unconstrained

the global batch size for training

--local_batch_size times number of workers

DeepCAM

batchnorm_group_size

value >= 1

Determines how many ranks participate in the batchnorm

--batchnorm_group_size

DeepCAM

opt_name

Adam, AdamW, or LAMB

the optimizer name

--optimizer

DeepCAM

opt_eps

1e-6

epsilon for Adam

--adam_eps

DeepCAM

opt_betas

unconstrained

Momentum terms for Adam-type optimizers

--optimizer_betas

DeepCAM

opt_weight_decay

value >= 0

L2 weight regularization

--weight_decay

DeepCAM

opt_lr

unconstrained

the base learning rate

--start_lr times warmup factor

DeepCAM

scheduler_lr_warmup_steps

value >= 0

the number of epochs for learning rate to warm up to base value

--lr_warmup_steps

DeepCAM

scheduler_lr_warmup_factor

value >= 1

When warmup is used, the target learning_rate will be lr_warmup_factor * start_lr

--lr_warmup_factor

DeepCAM

scheduler_type

multistep or cosine_annealing

Specifies the learning rate schedule

--lr_schedule

DeepCAM

scheduler_milestones

unconstrained

If multistep, the steps at which learning rate is decayed

milestones in --lr_schedule type="multistep",milestones="3000 10000",decay_rate="0.1"

DeepCAM

scheduler_decay_rate

unconstrained

If multistep, the learning rate decay factor

decay_rate in --lr_schedule type="multistep",milestones="15000 25000",decay_rate="0.1"

DeepCAM

scheduler_t_max

value >= 0

For cosine_annealing, period length in steps

--lr_schedule

DeepCAM

scheduler_eta_min

value >= 0

For cosine_annealing, sets the minimal LR

--lr_schedule

DeepCAM

gradient_accumulation_frequency

value >= 1

Specifies the number of gradient accumulation steps before a weight update is performed

--gradient_accumulation_frequency

5.4. Hyperparameters and Optimizer Other App

Warning
TBD. Next values will all be replaced with application specific values

Model

Name

Constraint

Definition

Reference Code

OpenCatalyst

global_batch_size

value >= 1

the global batch size

batch_size times number of GPUs

OpenCatalyst

opt_name

AdamW

the optimizer name

config setting optim name

OpenCatalyst

opt_base_learning_rate

value > 0

the base learning rate

config setting lr_initial

OpenCatalyst

opt_learning_rate_warmup_steps

value >= 0

the number of steps for learning rate to warm up to base value

warmup_steps

OpenCatalyst

opt_learning_rate_warmup_factor

0 ⇐ value ⇐ 1

the factor applied to the learning rate at the start of warmup

warmup_factor

OpenCatalyst

opt_learning_rate_decay_boundary_steps

list of positive integers

lr_milestones

OpenCatalyst

OPEN: Hyperparameters and optimizer may be freely changed.

6. Run Results

MLCommon Science Benchmark Suite submissions consist of the following two metrics: metrics 1 is considered mandatory for a complete submission whereas metric 2 is considered optional:

6.1. Strong Scaling (Time to Convergence)

This is a mandatory metric: see MLPerf Training Run Results for reference. The same rules apply here.

6.2. Weak Scaling (Throughput)

TODO

This is an optional metric. It was designed to test the training capacity of a system.

Measurement: we will define 3 important parameters first.

  • number of models M: number of model instances which are going to be trained in this benchmark.

  • instance scale S: each individual model instance will be trained at this scale.

  • total utilized scale T: the total scale used for running this benchmark. For example, if all M models are trained concurrently, then T=M*S. More generally we can write that S⇐T⇐M*S if (some of) the models are trained sequentially.

Notes:

  • All three numbers M,S,T are chosen by the submitter. This allows the submitter to accomodate their submission to available machine resources, i.e. compute capacity and compute time.

  • S and T should be in units of compute resources, e.g. nodes, GPUs or other accelerators. This choice should be aligned with the HPC system description. For example, if the systems descriptions table lists number GPUs to define the scale of the system, then S should be specified in numbers of GPUs.

  • S and T can be chosen independently of the submission for metric 1 (strong scaling). We encourage to choose T as large as possible, ideally full system scale, but this is not required.

The submitter then trains M models on the resource partitioning (S,T) as defined above to convergence.

We define a Time-To-Train-all (TTTa) number by computing the difference between the end time of the instance which needs longest time to converge and the start time of the instance which starts up fastest. Mathematically this can be expressed as

TTTa = max(run_stop) - min(run_start) where the max/min are taken over all instances M.

Note: the submitter is allowed to prune this number by removing results from individual training instances. As long as the minimum number of models rule is satisfied (see section Benchmark Results below), the submission is valid. They then use a modified number of models M'⇐M and computes TTTa over the reduced set. This allows the submitter to remove occasional outliers or stragglers which would otherwise reduce the score disproportionally.

Reporting: the submitter reports the the tuple (T, S, M', TTTa). It is required to submit a separate MLLOG file for each of the training instances, so that reviewers can verify the quoted numbers. It is not allowed to merge logging files for individual instances.

Restrictions:

  • The submitter must not report this score on its own. It has to be reported in conjunction with at least one score from Strong Scaling (Time to Convergence) from the same benchmark.

  • this score does not allow for extrapolation. All reported M' training instances must have converged and it is not allowed to extrapolate results in S or T.

7. Benchmark Results

We follow MLPerf Training Benchmark Results rule along with the following required number of runs per benchmark. Note that since run-to-run variability is already captured by spatial multiplexing in case of metric 3, we use the adjusted requirement that the number of trained instances has to be at least equal to the number of runs for metric 1 and 2.

Warning
TBD. Next values will all be replaced with application specific values

Benchmark

Number of Runs (Metric 1, 2)

M' (Metric 3)

DeepCAM

5

>=5

CosmoFlow

10

>=10

OpenCatalyst

5

>=5