harvard-edge · profvjreddi · Aug 21, 2024 · Aug 21, 2024
diff --git a/contents/data_engineering/data_engineering.qmd b/contents/data_engineering/data_engineering.qmd
@@ -144,7 +144,7 @@ The quality assurance that comes with popular pre-existing datasets is important
 
 While platforms like Kaggle and UCI Machine Learning Repository are invaluable resources, it's essential to understand the context in which the data was collected. Researchers should be wary of potential overfitting when using popular datasets, as multiple models might have been trained on them, leading to inflated performance metrics. Sometimes, these [datasets do not reflect the real-world data](https://venturebeat.com/uncategorized/3-big-problems-with-datasets-in-ai-and-machine-learning/).
 
-In addition, bias, validity, and reproducibility issues may exist in these datasets, and there has been a growing awareness of these issues in recent years. Furthermore, using the same dataset to train different models as shown in @fig-misalignment can sometimes create misalignment: training multiple models using the same dataset resultsi in a 'misalignment' between the models and the world, in which an entire ecosystem of models reflects only a narrow subset of the real-world data.
+In addition, bias, validity, and reproducibility issues may exist in these datasets, and there has been a growing awareness of these issues in recent years. Furthermore, using the same dataset to train different models as shown in @fig-misalignment can sometimes create misalignment: training multiple models using the same dataset results in a 'misalignment' between the models and the world, in which an entire ecosystem of models reflects only a narrow subset of the real-world data.
 
 ![Training different models on the same dataset. Source: (icons from left to right: Becris; Freepik; Freepik; Paul J; SBTS2018).](images/png/dataset_myopia.png){#fig-misalignment}
 
@@ -300,7 +300,7 @@ Data often comes from diverse sources and can be unstructured or semi-structured
 * Using techniques like dimensionality reduction
 
 Data validation serves a broader role than ensuring adherence to certain standards, like preventing temperature values from falling below absolute zero. These issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings; such transients are not uncommon. Therefore, it is imperative to catch data errors early before propagating through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation, contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.
-Let’s take a look at @fig-data-engineering-kws2 for an example of a data processing pipeline. In the context of TinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelines—systematic and automated workflows for data transformation, storage, and processing. The input data (which's a collection of short recordings) goes through sevreral phases of processing, such as audio-word alignemnt and keyword extraction. By streamlining the data flow, from raw data to usable datasets, data pipelines improve productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage.
+Let’s take a look at @fig-data-engineering-kws2 for an example of a data processing pipeline. In the context of TinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelines—systematic and automated workflows for data transformation, storage, and processing. The input data (which's a collection of short recordings) goes through several phases of processing, such as audio-word alignement and keyword extraction. By streamlining the data flow, from raw data to usable datasets, data pipelines improve productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage.
 
 ![An overview of the Multilingual Spoken Words Corpus (MSWC) data processing pipeline. Source: @mazumder2021multilingual.](images/png/data_engineering_kws2.png){#fig-data-engineering-kws2}
 

diff --git a/contents/frameworks/frameworks.qmd b/contents/frameworks/frameworks.qmd
@@ -300,7 +300,7 @@ This automatic differentiation is a powerful feature of tensors in frameworks li
 
 #### Graph Definition
 
-Computational graphs are a key component of deep learning frameworks like TensorFlow and PyTorch. They allow us to express complex neural network architectures efficiently and differentiatedly. A computational graph consists of a directed acyclic graph (DAG) where each node represents an operation or variable, and edges represent data dependencies between them.
+Computational graphs are a key component of deep learning frameworks like TensorFlow and PyTorch. They allow us to express complex neural network architectures efficiently and differently. A computational graph consists of a directed acyclic graph (DAG) where each node represents an operation or variable, and edges represent data dependencies between them.
 
 It's important to differentiate computational graphs from neural network diagrams, such as those for multilayer perceptrons (MLPs), which depict nodes and layers. Neural network diagrams, as depicted in [Chapter 3](../dl_primer/dl_primer.qmd), visualize the architecture and flow of data through nodes and layers, providing an intuitive understanding of the model's structure. In contrast, computational graphs provide a low-level representation of the underlying mathematical operations and data dependencies required to implement and train these networks.
 

diff --git a/contents/hw_acceleration/hw_acceleration.qmd b/contents/hw_acceleration/hw_acceleration.qmd
@@ -88,7 +88,7 @@ For example, GPUs achieve high throughput via massively parallel architectures.
 
 #### Managing Silicon Area and Costs
 
-Chip area directly impacts manufacturing cost. Larger die sizes require more materials, lower yields, and higher defect rates. Mulit-die packages help scale designs but add packaging complexity. Silicon area depends on:
+Chip area directly impacts manufacturing cost. Larger die sizes require more materials, lower yields, and higher defect rates. Multi-die packages help scale designs but add packaging complexity. Silicon area depends on:
 
 * **Computational resources** - e.g., number of cores, memory, caches
 * **Manufacturing process node** - smaller transistors enable higher density
@@ -132,7 +132,7 @@ We then progressively consider more programmable and adaptable architectures, di
 
 By structuring the analysis along this spectrum, we aim to illustrate the fundamental tradeoffs between utilization, efficiency, programmability, and flexibility in accelerator design. The optimal balance point depends on the constraints and requirements of the target application. This spectrum perspective provides a framework for reasoning about hardware choices for machine learning and the capabilities required at each level of specialization.
 
-@fig-design-tradeoffs illustrates the complex interplay between flexibility, performance, functional diversity, and area of architecture design. Notice how the ASIC is on the bottom-right corner, with minimal area, flexibility, and power consumption and maximal performance, due to its highly specialized application-specific nature. A key tradeoff is functinoal diversity vs performance: general purpose architechtures can serve diverse applications but their application performance is degraded as compared to more customized architectures.
+@fig-design-tradeoffs illustrates the complex interplay between flexibility, performance, functional diversity, and area of architecture design. Notice how the ASIC is on the bottom-right corner, with minimal area, flexibility, and power consumption and maximal performance, due to its highly specialized application-specific nature. A key tradeoff is functional diversity vs performance: general purpose architectures can serve diverse applications but their application performance is degraded as compared to more customized architectures.
 
 The progression begins with the most specialized option, ASICs purpose-built for AI, to ground our understanding in the maximum possible optimizations before expanding to more generalizable architectures. This structured approach aims to elucidate the accelerator design space.
 
@@ -842,7 +842,7 @@ Intel and IBM are leading commercial efforts in neuromorphic hardware. Intel's L
 
 Spiking neural networks (SNNs) [@maass1997networks] are computational models for neuromorphic hardware. Unlike deep neural networks communicating via continuous values, SNNs use discrete spikes that are more akin to biological neurons. This allows efficient event-based computation rather than constant processing. Additionally, SNNs consider the temporal and spatial characteristics of input data. This better mimics biological neural networks, where the timing of neuronal spikes plays an important role. However, training SNNs remains challenging due to the added temporal complexity. @fig-spiking provides an overview of the spiking methodology: (a) Diagram of a neuron; (b) Measuring an action potential propagated along the axon of a neuron. Only the action potential is detectable along the axon; (c) The neuron's spike is approximated with a binary representation; (d) Event-Driven Processing; (e) Active Pixel Sensor and Dynamic Vision Sensor.
 
-![Neuromoprhic spiking. Source: @eshraghian2023training.](images/png/aimage4.png){#fig-spiking}
+![Neuromorphic spiking. Source: @eshraghian2023training.](images/png/aimage4.png){#fig-spiking}
 
 You can also watch @vid-snn linked below for a more detailed explanation.
 

diff --git a/contents/ondevice_learning/ondevice_learning.qmd b/contents/ondevice_learning/ondevice_learning.qmd
@@ -195,7 +195,7 @@ A specific algorithmic technique is Quantization-Aware Scaling (QAS), which impr
 
 As we discussed in the Model Optimizations chapter, quantization is the process of mapping a continuous range of values to a discrete set of values. In the context of neural networks, quantization often involves reducing the precision of the weights and activations from 32-bit floating point to lower-precision formats such as 8-bit integers. This reduction in precision can significantly reduce the computational cost and memory footprint of the model, making it suitable for deployment on low-precision hardware. @fig-float-int-quantization is an example of float-to-integer quantization.
 
-![Float to integer qunatization. Source: [Nvidia.](https://developer-blogs.nvidia.com/wp-content/uploads/2021/07/qat-training-precision.png)](images/png/ondevice_quantization_matrix.png){#fig-float-int-quantization}
+![Float to integer quantization. Source: [Nvidia.](https://developer-blogs.nvidia.com/wp-content/uploads/2021/07/qat-training-precision.png)](images/png/ondevice_quantization_matrix.png){#fig-float-int-quantization}
 
 However, the quantization process can also introduce quantization errors that can degrade the model's performance. Quantization-aware scaling is a technique that aims to minimize these errors by adjusting the scale factors used in the quantization process.
 
@@ -462,7 +462,7 @@ However, we cannot just reduce communication by sending pieces of those gradient
 
 ### Optimized Aggregation
 
-In addition to reducing the communication overhead, optimizing the aggregation function can improve model training speed and accuracy in certain federated learning use cases. While the standard for aggregation is just averaging, various other approaches can improve model efficiency, accuracy, and security. One alternative is clipped averaging, which clips the model updates within a specific range. Another strategy to preserve security is differential privacy average aggregation. This approach integrates differential privacy into the aggregations tep to protect client identities. Each client adds a layer of random noise to their updates before communicating to the server. The server then updates the server with the noisy updates, meaning that the amount of noise needs to be tuned carefully to balance privacy and accuracy.
+In addition to reducing the communication overhead, optimizing the aggregation function can improve model training speed and accuracy in certain federated learning use cases. While the standard for aggregation is just averaging, various other approaches can improve model efficiency, accuracy, and security. One alternative is clipped averaging, which clips the model updates within a specific range. Another strategy to preserve security is differential privacy average aggregation. This approach integrates differential privacy into the aggregation step to protect client identities. Each client adds a layer of random noise to their updates before communicating to the server. The server then updates the server with the noisy updates, meaning that the amount of noise needs to be tuned carefully to balance privacy and accuracy.
 
 In addition to security-enhancing aggregation methods, there are several modifications to the aggregation methods that can improve training speed and performance by adding client metadata along with the weight updates. Momentum aggregation is a technique that helps address the convergence problem. In federated learning, client data can be extremely heterogeneous depending on the different environments in which the devices are used. That means that many models with heterogeneous data may need help to converge. Each client stores a momentum term locally, which tracks the pace of change over several updates. With clients communicating this momentum, the server can factor in the rate of change of each update when changing the global model to accelerate convergence. Similarly, weighted aggregation can factor in the client performance or other parameters like device type or network connection strength to adjust the weight with which the server should incorporate the model updates. Further description of specific aggregation algorithms is described by @moshawrab2023reviewing.
 
@@ -713,16 +713,16 @@ By sparsely updating layers tailored to the device and task, TinyTrain significa
 +:=======================+:=======================================================================+:==========================================================+
 | Tiny Training Engine   | - On-device training                                                   | - Traces forward & backward graphs                        |
 |                        | - Optimize memory & computation                                        | - Prunes frozen weights                                   |
-|                        | - Leverage pruning, sparsity, etc                                      | - Interleaves backprop & gradients                        |
+|                        | - Leverage pruning, sparsity, etc.                                      | - Interleaves backprop & gradients                        |
 |                        |                                                                        | - Code generation                                         |
 +------------------------+------------------------------------------------------------------------+-----------------------------------------------------------+
 | TinyTL                 | - On-device training                                                   | - Freezes most weights                                    |
 |                        | - Optimize memory & computation                                        | - Only adapts biases                                      |
-|                        | - Leverage freezing, sparsity, etc                                     | - Uses residual model                                     |
+|                        | - Leverage freezing, sparsity, etc.                                     | - Uses residual model                                     |
 +------------------------+------------------------------------------------------------------------+-----------------------------------------------------------+
 | TinyTrain              | - On-device training                                                   | - Meta-training in pretraining                            |
 |                        | - Optimize memory & computation                                        | - Task-adaptive sparse updating                           |
-|                        | - Leverage sparsity, etc                                               | - Selective layer updating                                |
+|                        | - Leverage sparsity, etc.                                               | - Selective layer updating                                |
 +------------------------+------------------------------------------------------------------------+-----------------------------------------------------------+
 
 : Comparison of frameworks for on-device training optimization. {#tbl-framework-comparison .striped .hover}

diff --git a/contents/ops/ops.qmd b/contents/ops/ops.qmd
@@ -460,7 +460,7 @@ Project managers play a vital role in MLOps by coordinating the activities betwe
 * Facilitating communication through status reports, meetings, workshops, and documentation and enabling seamless collaboration.
 * Driving adherence to timelines and budget and escalating anticipated overruns or shortfalls for mitigation.
 
-For example, a project manager would create a project plan for developing and enhancing a customer churn prediction model. They coordinate between data engineers building data pipelines, data scientists experimenting with models, ML engineers productionalizing models, and DevOps setting up deployment infrastructure. The project manager tracks progress via milestones like dataset preparation, model prototyping, deployment, and monitoring. To enact preventive solutions, they surface any risks, delays, or budget issues.
+For example, a project manager would create a project plan for developing and enhancing a customer churn prediction model. They coordinate between data engineers building data pipelines, data scientists experimenting with models, ML engineers productizing models, and DevOps setting up deployment infrastructure. The project manager tracks progress via milestones like dataset preparation, model prototyping, deployment, and monitoring. To enact preventive solutions, they surface any risks, delays, or budget issues.
 
 Skilled project managers enable MLOps teams to work synergistically to rapidly deliver maximum business value from ML investments. Their leadership and organization align with diverse teams.