Submission Rules

The MLPerf inference submission rules are spread between the MLCommons policies and the MLCommons Inference policies documents. Further, the rules related to power submissions are given here. The below points are a summary taken from the official rules to act as a checklist for the submitters.

Hardware requirements

MLCommons inference results can be submitted on any hardware and we have past results from Raspberry Pi to high-end inference servers.
Closed category submission for datacenter category needs ECC RAM and also needs to have the networking capabilities as detailed here
Power submissions need an approved power analyzer.

Things to Know

Closed submission needs performance and accuracy run for all the required scenarios (as per edge/datacenter category) with accuracy within 99% or 99.9% as given in the respective task READMEs. Further, the model weights are not supposed to be altered except for quantization. If any of these constraints are not met, the submission cannot go under closed division but can still be submitted under open division.
Reference models are mostly fp32 and reference implementations are just for reference and not meant to be directly used by submitters as they are not optimized for performance.
Calibration document due one week before the submission deadline
Power submission needs a power analyzer (approved by SPEC Power) and EULA signature to get access to SPEC PTDaemon
To submit under the available category your submission system must be available (in whole or in parts and either publicly or to customers) and the software used must be either open source or an official or beta release as on the submission deadline. Submissions using nightly release for example cannot be submitted under the available category.

Is there an automatic way to run the MLPerf inference benchmarks?

MLPerf inference submissions are expected on different hardware and related software stacks. For this reason, only reference implementations are provided by MLCommons and they can guide submitters to make their own optimal implementations for their software/hardware stack. Also, all the previous implementations are made available in the MLCommons Inference results repositories and they can also guide submitters in doing their own implementations.

The MLCommons taskforce on automation and reproducibility has automated all the MLCommons inference tasks using the MLCommons CM language and this readme can guide you in running the reference implementations with very minimal effort. Currently, this automation supports MLCommons reference implementations, Nvidia implementations, and C++ implementations for Onnxruntime and TFLite. Feel free to join the taskforce Discord channel if you have any questions.

The previous MLPerf inference results are aggregated in Collective Knowledge platform (MLCommons CK playground) as reproducible experiments and can be used by submitters to compare their results with the previous ones while adding various derived metrics (such as performance/watt) and constraints.

Expected time to do benchmark runs

Closed submission under data enter needs offline and server scenario runs with a minimum of ten minutes needed for both.
Closed submission under edge category needs single stream, multi-stream (only for R50 and retinanet), and offline scenarios. A minimum of ten minutes are needed for each scenario.
Further two (three for ResNet50) compliance runs are needed for closed division, each taking at least 10 minutes for each scenario.
SingleStream, MultiStream and Server scenarios use early stopping and so can always finish around 10 minutes
Offline scenario needs a minimum of 24756 input queries to be processed -- can take hours for low-performing models like 3dunet, LLMs, etc.
Open division has no accuracy constraints, no compliance runs required, and can be submitted for any single scenario as well. There is no constraint on the model used also except that the model must be trained on the dataset used in the corresponding MLPerf inference task.
Power submission needs an extra ranging mode to determine the peak current usage and this often doubles the overall experiment run time.

Validity of the submission

MLCommons Inference submission checker is provided to ensure that all submissions are passing the required checks.
In the unlikely event that there is an error on the submission checker for your submission, please raise a Github issue here
Any submission passing the submission checker is valid to go to the review discussions but submitters are still required to answer any queries and fix any issues being reported by other submitters.

Reviewing other submissions

Ensure that the system_desc_id.json file is having meaningful responses - submission_checker only checks for the existence of the fields.
For power submissions, power settings and analyzer table files are to be submitted, and even though the submission checker checks for the existence of these files, the content of these files must be checked manually for validity.
README files in the submission directory must be checked to make sure that the instructions are reproducible.
For closed datacenter submissions, ECC RAM and Networking requirements must be ensured.
Submission checker might be reporting warnings and some of these warnings can warrant an answer from the submitter.

Changes from MLCommons Inference 3.0

Two new benchmarks GPT-J and GPT-3 and DLRMv2 replacing DLRM
Submission checker is now checking for non-empty README files and mandatory system description and power-related fields
New script is provided which can be used to infer scenario results and low-accuracy results from a high-accuracy result
min_query_count is removed for all scenarios except offline due to early stopping. SingleStream now needs a minimum of 64 queries and MultiStream needs 662 queries as mandated by the early stopping criteria.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission_Guidelines.md

Submission_Guidelines.md

Submission Rules

Hardware requirements

Things to Know

Is there an automatic way to run the MLPerf inference benchmarks?

Expected time to do benchmark runs

Validity of the submission

Reviewing other submissions

Changes from MLCommons Inference 3.0

Files

Submission_Guidelines.md

Latest commit

History

Submission_Guidelines.md

File metadata and controls

Submission Rules

Hardware requirements

Things to Know

Is there an automatic way to run the MLPerf inference benchmarks?

Expected time to do benchmark runs

Validity of the submission

Reviewing other submissions

Changes from MLCommons Inference 3.0