[2022 SIGMOD] Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines. [PDF] [Recording]
[2022 ATC] Cachew: Machine Learning Input Data Processing as a Service. [PDF]
[2021 ATC] Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training. [PDF] [Slides]
Preprocessing Stall: cache partially augmented samples across all epochs within a job
[2021 VLDB] Analyzing and Mitigating Data Stalls in DNN Training. [PDF]
Hyperparameter (HP) Search: stage preprocessed minibatch across all HP jobs within an epoch
[2020 FAST] Quiver: An Informed Storage Cache for Deep Learning. [PDF] [Slides]
Fetch Stall (Remote): share cached training data among multiple tasks
[2022 NSDI] Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models. [PDF] [Slides]
[2021 FAST] CheckFreq: Frequent, Fine-Grained DNN Checkpointing. [PDF] [Slides]
[2020 CCGRID] DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models. [PDF]
[2021 VLDB] tf.data: A Machine Learning Data Processing Framework. [PDF]
[2021 Ph.D. Dissertation] Accelerating Deep Learning Training : A Storage Perspective. [PDF]
[2020 MLSys] MLPerf Training Benchmark. [PDF]
[2021 Big Data Mining And Analytics] AIPerf: Automated Machine Learning as an AI-HPC Benchmark. [PDF]