diff --git a/contents/benchmarking/benchmarking.qmd b/contents/benchmarking/benchmarking.qmd index 3651ac5df..a7ddf90a4 100644 --- a/contents/benchmarking/benchmarking.qmd +++ b/contents/benchmarking/benchmarking.qmd @@ -199,17 +199,16 @@ Example: Tasks for natural language processing benchmarks might include sentimen Once a task is defined, benchmarks require metrics to quantify performance. These metrics offer objective measures to compare different models or systems. In classification tasks, metrics like accuracy, precision, recall, and [F1 score](https://en.wikipedia.org/wiki/F-score) are commonly used. Mean squared or absolute errors might be employed for regression tasks. -#### Baselines +#### Baselines and Baseline Models -Benchmarks often include baseline models or reference implementations. These serve as starting points or minimum performance standards against which new models or techniques can be compared. In many benchmark suites, simple models like linear regression or basic neural networks serve as baselines to provide context for more complex model evaluations. +Benchmarks often include baseline models or reference implementations. These usually serve as starting points or minimum performance standards for comparing new models or techniques. Baseline models help researchers measure the effectiveness of new algorithms. -In classification tasks, metrics like accuracy, precision, recall, and [F1 score](https://en.wikipedia.org/wiki/F-score) are commonly used. Mean squared or absolute errors might be employed for regression tasks. +In benchmark suites, simple models like linear regression or basic neural networks are often the common baselines. These provide context when evaluating more complex models. By comparing against these simpler models, researchers can quantify improvements from advanced approaches. -#### Baseline Models +Performance metrics vary by task, but here are some examples: -Benchmarks often include baseline models or reference implementations. These serve as starting points or minimum performance standards against which new models or techniques can be compared. - -Example: In many benchmark suites, simple models like linear regression or basic neural networks serve as baselines to provide context for more complex model evaluations. +* Classification tasks use metrics such as accuracy, precision, recall, and F1 score. +* Regression tasks often use mean squared error or mean absolute error. #### Hardware and Software Specifications @@ -237,11 +236,12 @@ Example: A benchmark might highlight that while Model A scored higher than Model ### Training vs. Inference -The development life cycle of a machine learning model involves two critical phases - training and inference. Training is the process of learning patterns from data to create the model. Inference refers to the model making predictions on new unlabeled data. Both phases play indispensable yet distinct roles. Consequently, each phase warrants rigorous benchmarking to evaluate performance metrics like speed, accuracy, and computational efficiency. +The development life cycle of a machine learning model involves two critical phases - training and inference. [Training](../training/training.qmd), as you may recall, is the process of learning patterns from data to create the model. Inference refers to the model making predictions on new unlabeled data. Both phases play indispensable yet distinct roles. Consequently, each phase warrants rigorous benchmarking to evaluate performance metrics like speed, accuracy, and computational efficiency. Benchmarking the training phase provides insights into how different model architectures, hyperparameter values, and optimization algorithms impact the time and resources needed to train the model. For instance, benchmarking shows how neural network depth affects training time on a given dataset. Benchmarking also reveals how hardware accelerators like GPUs and TPUs can speed up training. -On the other hand, benchmarking inference evaluates model performance in real-world conditions after deployment. Key metrics include latency, throughput, memory footprint, and power consumption. Inference benchmarking determines if a model meets the requirements of its target application regarding response time and device constraints, which is typically the focus of TinyML. However, we will discuss these broadly to ensure a general understanding. +On the other hand, benchmarking inference evaluates model performance in real-world conditions after deployment. Key metrics include latency, throughput, memory footprint, and power consumption. This type of benchmarking determines if a model meets the requirements of its target application regarding response time and device constraints. However, we will discuss these broadly to ensure a general understanding. + ### Training Benchmarks @@ -271,7 +271,7 @@ The following metrics are often considered important: 4. **Memory Consumption:** The amount of memory the training process uses. Memory consumption can be a limiting factor for training large models or datasets. For example, Google researchers faced significant memory consumption challenges when training BERT. The model has hundreds of millions of parameters, requiring large amounts of memory. The researchers had to develop techniques to reduce memory consumption, such as gradient checkpointing and model parallelism. -5. ** Energy Consumption: ** The energy consumed during training. As machine learning models become more complex, energy consumption has become an important consideration. Training large machine learning models can consume significant energy, leading to a large carbon footprint. For instance, the training of OpenAI's GPT-3 was estimated to have a carbon footprint equivalent to traveling by car for 700,000 kilometers. +5. **Energy Consumption:** The energy consumed during training. As machine learning models become more complex, energy consumption has become an important consideration. Training large machine learning models can consume significant energy, leading to a large carbon footprint. For instance, the training of OpenAI's GPT-3 was estimated to have a carbon footprint equivalent to traveling by car for 700,000 kilometers. 6. **Throughput:** The number of training samples processed per unit time. Higher throughput generally indicates a more efficient training process. The throughput is an important metric to consider when training a recommendation system for an e-commerce platform. A high throughput ensures that the model can process large volumes of user interaction data promptly, which is crucial for maintaining the relevance and accuracy of the recommendations. But it's also important to understand how to balance throughput with latency bounds. Therefore, a latency-bounded throughput constraint is often imposed on service-level agreements for data center application deployments. @@ -333,8 +333,8 @@ Metrics: * Time to completion for each workload * Memory bandwidth -**[MLPerf Training Benchmark](https://github.com/mlcommons/training):** MLPerf is a suite of benchmarks that grew out of DAWNBench and Fathom and other collective works such as [DeepBench](https://github.com/baidu-research/DeepBench) that was designed to measure the performance of machine learning hardware, software, and services. The MLPerf Training benchmark [@mattson2020mlperf] focuses on the time it takes to train models to a target quality metric. It includes diverse workloads, such as image classification, object detection, translation, and reinforcement learning. It's metrics include: - +#### Example Use Case + Consider a scenario where we want to benchmark the training of an image classification model on a specific hardware platform. 1. **Task:** The task is to train a convolutional neural network (CNN) for image classification on the CIFAR-10 dataset. @@ -355,7 +355,7 @@ Inference in machine learning refers to using a trained model to make prediction When we build machine learning models, our ultimate goal is to deploy them in real-world applications where they can provide accurate and reliable predictions on new, unseen data. This process of using a trained model to make predictions is known as inference. A machine learning model's real-world performance can differ significantly from its performance on training or validation datasets, which makes benchmarking inference a crucial step in the development and deployment of machine learning models. -Benchmarking inference allows us to evaluate how well a machine-learning model performs in real-world scenarios. This evaluation ensures that the model is practical and reliable when deployed in applications, providing a more comprehensive understanding of the model's behavior with real data. Additionally, benchmarking can help identify potential bottlenecks or limitations in the model's performance. For example, if a model takes less time to predict, it may be impractical for real-time applications such as autonomous driving or voice assistants. +Benchmarking inference allows us to evaluate how well a machine-learning model performs in real-world scenarios. This evaluation ensures that the model is practical and reliable when deployed in applications, providing a more comprehensive understanding of the model's behavior with real data. Additionally, benchmarking can help identify potential bottlenecks or limitations in the model's performance. For example, if a model takes too long to predict, it may be impractical for real-time applications such as autonomous driving or voice assistants. Resource efficiency is another critical aspect of inference, as it can be computationally intensive and require significant memory and processing power. Benchmarking helps ensure that the model is efficient regarding resource usage, which is particularly important for edge devices with limited computational capabilities, such as smartphones or IoT devices. Moreover, benchmarking allows us to compare the performance of our model with competing models or previous versions of the same model. This comparison is essential for making informed decisions about which model to deploy in a specific application. @@ -371,11 +371,9 @@ Finally, it is vital to ensure that the model's predictions are not only accurat 4. **Throughput:** Throughput assesses the system's capacity by measuring the number of inferences or predictions a machine learning model can handle within a specific unit of time. Consider a speech recognition system that employs a Recurrent Neural Network (RNN) as its underlying model; if this system can process and understand 50 different audio clips in a minute, then its throughput rate stands at 50 clips per minute. -3. **Energy Efficiency:** Energy efficiency is a metric that determines the amount of energy consumed by the machine learning model to perform a single inference. A prime example of this would be a natural language processing model built on a Transformer network architecture; if it utilizes 0.1 Joules of energy to translate a sentence from English to French, its energy efficiency is measured at 0.1 Joules per inference. +5. **Energy Efficiency:** Energy efficiency is a metric that determines the amount of energy consumed by the machine learning model to perform a single inference. A prime example of this would be a natural language processing model built on a Transformer network architecture; if it utilizes 0.1 Joules of energy to translate a sentence from English to French, its energy efficiency is measured at 0.1 Joules per inference. -6. **Energy Efficiency:** Energy efficiency is a metric that determines the amount of energy consumed by the machine learning model to perform a single inference. A prime example of this would be a natural language processing model built on a Transformer network architecture; if it utilizes 0.1 Joules of energy to translate a sentence from English to French, its energy efficiency is measured at 0.1 Joules per inference. - -7. **Memory Usage:** Memory usage quantifies the volume of RAM needed by a machine learning model to carry out inference tasks. A relevant example to illustrate this would be a face recognition system based on a CNN; if such a system requires 150 MB of RAM to process and recognize faces within an image, its memory usage is 150 MB. +6. **Memory Usage:** Memory usage quantifies the volume of RAM needed by a machine learning model to carry out inference tasks. A relevant example to illustrate this would be a face recognition system based on a CNN; if such a system requires 150 MB of RAM to process and recognize faces within an image, its memory usage is 150 MB. #### Tasks @@ -427,8 +425,6 @@ Metrics: **[OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html):** OpenVINO toolkit provides a benchmark tool to measure the performance of deep learning models for various tasks, such as image classification, object detection, and facial recognition, on Intel hardware. It offers detailed insights into the models' inference performance on different hardware configurations. It's metrics include: -OpenVINO toolkit provides a benchmark tool to measure the performance of deep learning models for various tasks, such as image classification, object detection, and facial recognition, on Intel hardware. It offers detailed insights into the models' inference performance on different hardware configurations. - Metrics: * Inference time @@ -436,7 +432,7 @@ Metrics: * Latency * CPU and GPU utilization -*Example Use Case* +#### Example Use Case Consider a scenario where we want to evaluate the inference performance of an object detection model on a specific edge device.