CosmoFlow TensorFlow Keras benchmark implementation

WARNING: this repo is old. For the latest MLPerf HPC reference implementation of cosmoflow, see https://github.com/mlcommons/hpc/tree/main/cosmoflow

This is a an implementation of the CosmoFlow 3D convolutional neural network for benchmarking. It is written in TensorFlow with the Keras API and uses Horovod for distributed training.

You can find the previous TensorFlow implementation which accompanied the CosmoFlow paper at https://github.com/NERSC/CosmoFlow

Datasets

The dataset we use for this benchmark comes from simulations run by the ExaLearn group and hosted at NERSC. The following web portal describes the technical content of the dataset and provides links to the raw data.

https://portal.nersc.gov/project/m3363/

For this benchmark we currently use a preprocessed version of the dataset which generates crops of size (128, 128, 128, 4) and stores in TFRecord format. This preprocessing is done using the prepare.py script included in this package. We describe here how to get access to this processed dataset, but please refer to the ExaLearn web portal for additional technical details.

Globus is the current recommended way to transfer the dataset locally. There is a globus endpoint at:

https://app.globus.org/file-manager?origin_id=d0b1b73a-efd3-11e9-993f-0a8c187e8c12&origin_path=%2F

The contents are also available via HTTPS at:

https://portal.nersc.gov/project/dasrepo/cosmoflow-benchmark/

MLPerf HPC v1.0 preliminary dataset

Preprocessed TFRecord files are available in a 1.7TB tarball named cosmoUniverse_2019_05_4parE_tf_v2.tar. It contains subfolders for train/val/test file splits.

In this preparation, there are 524288 samples for training and 65536 samples for validation. The TFRecord files are written with gzip compression to reduce total storage size.

MLPerf HPC v0.7 dataset

The pre-processed dataset in TFRecord format is in the cosmoUniverse_2019_05_4parE_tf folder, which contains training and validation subfolders. There are 262144 samples for training and 65536 samples for validation/testing. The combined size of the dataset is 5.1 TB.

For getting started, there is also a small tarball (179MB) with 32 training samples and 32 validation samples, called cosmoUniverse_2019_05_4parE_tf_small.tgz.

Running the benchmark

Submission scripts are in scripts. YAML configuration files go in configs.

Running at NERSC

sbatch -N 64 scripts/train_cori.sh

Name		Name	Last commit message	Last commit date
Latest commit History 347 Commits
builds		builds
configs		configs
data		data
logs		logs
models		models
notebooks		notebooks
scripts		scripts
utils		utils
.gitignore		.gitignore
LEGAL		LEGAL
LICENSE.md		LICENSE.md
README.md		README.md
data_benchmark.py		data_benchmark.py
prepare.py		prepare.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CosmoFlow TensorFlow Keras benchmark implementation

Datasets

MLPerf HPC v1.0 preliminary dataset

MLPerf HPC v0.7 dataset

Running the benchmark

Running at NERSC

About

Releases

Packages

Languages

License

sparticlesteve/cosmoflow-benchmark

Folders and files

Latest commit

History

Repository files navigation

CosmoFlow TensorFlow Keras benchmark implementation

Datasets

MLPerf HPC v1.0 preliminary dataset

MLPerf HPC v0.7 dataset

Running the benchmark

Running at NERSC

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages