diff --git a/contents/data_engineering/data_engineering.qmd b/contents/data_engineering/data_engineering.qmd index 6274c503..13732096 100644 --- a/contents/data_engineering/data_engineering.qmd +++ b/contents/data_engineering/data_engineering.qmd @@ -144,7 +144,7 @@ The quality assurance that comes with popular pre-existing datasets is important While platforms like Kaggle and UCI Machine Learning Repository are invaluable resources, it's essential to understand the context in which the data was collected. Researchers should be wary of potential overfitting when using popular datasets, as multiple models might have been trained on them, leading to inflated performance metrics. Sometimes, these [datasets do not reflect the real-world data](https://venturebeat.com/uncategorized/3-big-problems-with-datasets-in-ai-and-machine-learning/). -In addition, bias, validity, and reproducibility issues may exist in these datasets, and there has been a growing awareness of these issues in recent years. Furthermore, using the same dataset to train different models as shown in @fig-misalignment can sometimes create misalignment: training multiple models using the same dataset resultsi in a 'misalignment' between the models and the world, in which an entire ecosystem of models reflects only a narrow subset of the real-world data. +In addition, bias, validity, and reproducibility issues may exist in these datasets, and there has been a growing awareness of these issues in recent years. Furthermore, using the same dataset to train different models as shown in @fig-misalignment can sometimes create misalignment: training multiple models using the same dataset results in a 'misalignment' between the models and the world, in which an entire ecosystem of models reflects only a narrow subset of the real-world data. ![Training different models on the same dataset. Source: (icons from left to right: Becris; Freepik; Freepik; Paul J; SBTS2018).](images/png/dataset_myopia.png){#fig-misalignment} @@ -300,7 +300,7 @@ Data often comes from diverse sources and can be unstructured or semi-structured * Using techniques like dimensionality reduction Data validation serves a broader role than ensuring adherence to certain standards, like preventing temperature values from falling below absolute zero. These issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings; such transients are not uncommon. Therefore, it is imperative to catch data errors early before propagating through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation, contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them. -Let’s take a look at @fig-data-engineering-kws2 for an example of a data processing pipeline. In the context of TinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelines—systematic and automated workflows for data transformation, storage, and processing. The input data (which's a collection of short recordings) goes through sevreral phases of processing, such as audio-word alignemnt and keyword extraction. By streamlining the data flow, from raw data to usable datasets, data pipelines improve productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage. +Let’s take a look at @fig-data-engineering-kws2 for an example of a data processing pipeline. In the context of TinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelines—systematic and automated workflows for data transformation, storage, and processing. The input data (which's a collection of short recordings) goes through several phases of processing, such as audio-word alignement and keyword extraction. By streamlining the data flow, from raw data to usable datasets, data pipelines improve productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage. ![An overview of the Multilingual Spoken Words Corpus (MSWC) data processing pipeline. Source: @mazumder2021multilingual.](images/png/data_engineering_kws2.png){#fig-data-engineering-kws2} diff --git a/contents/frameworks/frameworks.qmd b/contents/frameworks/frameworks.qmd index 48e89f21..1476b39f 100644 --- a/contents/frameworks/frameworks.qmd +++ b/contents/frameworks/frameworks.qmd @@ -300,7 +300,7 @@ This automatic differentiation is a powerful feature of tensors in frameworks li #### Graph Definition -Computational graphs are a key component of deep learning frameworks like TensorFlow and PyTorch. They allow us to express complex neural network architectures efficiently and differentiatedly. A computational graph consists of a directed acyclic graph (DAG) where each node represents an operation or variable, and edges represent data dependencies between them. +Computational graphs are a key component of deep learning frameworks like TensorFlow and PyTorch. They allow us to express complex neural network architectures efficiently and differently. A computational graph consists of a directed acyclic graph (DAG) where each node represents an operation or variable, and edges represent data dependencies between them. It's important to differentiate computational graphs from neural network diagrams, such as those for multilayer perceptrons (MLPs), which depict nodes and layers. Neural network diagrams, as depicted in [Chapter 3](../dl_primer/dl_primer.qmd), visualize the architecture and flow of data through nodes and layers, providing an intuitive understanding of the model's structure. In contrast, computational graphs provide a low-level representation of the underlying mathematical operations and data dependencies required to implement and train these networks. diff --git a/contents/hw_acceleration/hw_acceleration.qmd b/contents/hw_acceleration/hw_acceleration.qmd index dd790bfd..3d9c8af5 100644 --- a/contents/hw_acceleration/hw_acceleration.qmd +++ b/contents/hw_acceleration/hw_acceleration.qmd @@ -88,7 +88,7 @@ For example, GPUs achieve high throughput via massively parallel architectures. #### Managing Silicon Area and Costs -Chip area directly impacts manufacturing cost. Larger die sizes require more materials, lower yields, and higher defect rates. Mulit-die packages help scale designs but add packaging complexity. Silicon area depends on: +Chip area directly impacts manufacturing cost. Larger die sizes require more materials, lower yields, and higher defect rates. Multi-die packages help scale designs but add packaging complexity. Silicon area depends on: * **Computational resources** - e.g., number of cores, memory, caches * **Manufacturing process node** - smaller transistors enable higher density @@ -132,7 +132,7 @@ We then progressively consider more programmable and adaptable architectures, di By structuring the analysis along this spectrum, we aim to illustrate the fundamental tradeoffs between utilization, efficiency, programmability, and flexibility in accelerator design. The optimal balance point depends on the constraints and requirements of the target application. This spectrum perspective provides a framework for reasoning about hardware choices for machine learning and the capabilities required at each level of specialization. -@fig-design-tradeoffs illustrates the complex interplay between flexibility, performance, functional diversity, and area of architecture design. Notice how the ASIC is on the bottom-right corner, with minimal area, flexibility, and power consumption and maximal performance, due to its highly specialized application-specific nature. A key tradeoff is functinoal diversity vs performance: general purpose architechtures can serve diverse applications but their application performance is degraded as compared to more customized architectures. +@fig-design-tradeoffs illustrates the complex interplay between flexibility, performance, functional diversity, and area of architecture design. Notice how the ASIC is on the bottom-right corner, with minimal area, flexibility, and power consumption and maximal performance, due to its highly specialized application-specific nature. A key tradeoff is functional diversity vs performance: general purpose architectures can serve diverse applications but their application performance is degraded as compared to more customized architectures. The progression begins with the most specialized option, ASICs purpose-built for AI, to ground our understanding in the maximum possible optimizations before expanding to more generalizable architectures. This structured approach aims to elucidate the accelerator design space. @@ -842,7 +842,7 @@ Intel and IBM are leading commercial efforts in neuromorphic hardware. Intel's L Spiking neural networks (SNNs) [@maass1997networks] are computational models for neuromorphic hardware. Unlike deep neural networks communicating via continuous values, SNNs use discrete spikes that are more akin to biological neurons. This allows efficient event-based computation rather than constant processing. Additionally, SNNs consider the temporal and spatial characteristics of input data. This better mimics biological neural networks, where the timing of neuronal spikes plays an important role. However, training SNNs remains challenging due to the added temporal complexity. @fig-spiking provides an overview of the spiking methodology: (a) Diagram of a neuron; (b) Measuring an action potential propagated along the axon of a neuron. Only the action potential is detectable along the axon; (c) The neuron's spike is approximated with a binary representation; (d) Event-Driven Processing; (e) Active Pixel Sensor and Dynamic Vision Sensor. -![Neuromoprhic spiking. Source: @eshraghian2023training.](images/png/aimage4.png){#fig-spiking} +![Neuromorphic spiking. Source: @eshraghian2023training.](images/png/aimage4.png){#fig-spiking} You can also watch @vid-snn linked below for a more detailed explanation. diff --git a/contents/ondevice_learning/ondevice_learning.qmd b/contents/ondevice_learning/ondevice_learning.qmd index c6d7fc79..08e3a234 100644 --- a/contents/ondevice_learning/ondevice_learning.qmd +++ b/contents/ondevice_learning/ondevice_learning.qmd @@ -195,7 +195,7 @@ A specific algorithmic technique is Quantization-Aware Scaling (QAS), which impr As we discussed in the Model Optimizations chapter, quantization is the process of mapping a continuous range of values to a discrete set of values. In the context of neural networks, quantization often involves reducing the precision of the weights and activations from 32-bit floating point to lower-precision formats such as 8-bit integers. This reduction in precision can significantly reduce the computational cost and memory footprint of the model, making it suitable for deployment on low-precision hardware. @fig-float-int-quantization is an example of float-to-integer quantization. -![Float to integer qunatization. Source: [Nvidia.](https://developer-blogs.nvidia.com/wp-content/uploads/2021/07/qat-training-precision.png)](images/png/ondevice_quantization_matrix.png){#fig-float-int-quantization} +![Float to integer quantization. Source: [Nvidia.](https://developer-blogs.nvidia.com/wp-content/uploads/2021/07/qat-training-precision.png)](images/png/ondevice_quantization_matrix.png){#fig-float-int-quantization} However, the quantization process can also introduce quantization errors that can degrade the model's performance. Quantization-aware scaling is a technique that aims to minimize these errors by adjusting the scale factors used in the quantization process. @@ -462,7 +462,7 @@ However, we cannot just reduce communication by sending pieces of those gradient ### Optimized Aggregation -In addition to reducing the communication overhead, optimizing the aggregation function can improve model training speed and accuracy in certain federated learning use cases. While the standard for aggregation is just averaging, various other approaches can improve model efficiency, accuracy, and security. One alternative is clipped averaging, which clips the model updates within a specific range. Another strategy to preserve security is differential privacy average aggregation. This approach integrates differential privacy into the aggregations tep to protect client identities. Each client adds a layer of random noise to their updates before communicating to the server. The server then updates the server with the noisy updates, meaning that the amount of noise needs to be tuned carefully to balance privacy and accuracy. +In addition to reducing the communication overhead, optimizing the aggregation function can improve model training speed and accuracy in certain federated learning use cases. While the standard for aggregation is just averaging, various other approaches can improve model efficiency, accuracy, and security. One alternative is clipped averaging, which clips the model updates within a specific range. Another strategy to preserve security is differential privacy average aggregation. This approach integrates differential privacy into the aggregation step to protect client identities. Each client adds a layer of random noise to their updates before communicating to the server. The server then updates the server with the noisy updates, meaning that the amount of noise needs to be tuned carefully to balance privacy and accuracy. In addition to security-enhancing aggregation methods, there are several modifications to the aggregation methods that can improve training speed and performance by adding client metadata along with the weight updates. Momentum aggregation is a technique that helps address the convergence problem. In federated learning, client data can be extremely heterogeneous depending on the different environments in which the devices are used. That means that many models with heterogeneous data may need help to converge. Each client stores a momentum term locally, which tracks the pace of change over several updates. With clients communicating this momentum, the server can factor in the rate of change of each update when changing the global model to accelerate convergence. Similarly, weighted aggregation can factor in the client performance or other parameters like device type or network connection strength to adjust the weight with which the server should incorporate the model updates. Further description of specific aggregation algorithms is described by @moshawrab2023reviewing. @@ -713,16 +713,16 @@ By sparsely updating layers tailored to the device and task, TinyTrain significa +:=======================+:=======================================================================+:==========================================================+ | Tiny Training Engine | - On-device training | - Traces forward & backward graphs | | | - Optimize memory & computation | - Prunes frozen weights | -| | - Leverage pruning, sparsity, etc | - Interleaves backprop & gradients | +| | - Leverage pruning, sparsity, etc. | - Interleaves backprop & gradients | | | | - Code generation | +------------------------+------------------------------------------------------------------------+-----------------------------------------------------------+ | TinyTL | - On-device training | - Freezes most weights | | | - Optimize memory & computation | - Only adapts biases | -| | - Leverage freezing, sparsity, etc | - Uses residual model | +| | - Leverage freezing, sparsity, etc. | - Uses residual model | +------------------------+------------------------------------------------------------------------+-----------------------------------------------------------+ | TinyTrain | - On-device training | - Meta-training in pretraining | | | - Optimize memory & computation | - Task-adaptive sparse updating | -| | - Leverage sparsity, etc | - Selective layer updating | +| | - Leverage sparsity, etc. | - Selective layer updating | +------------------------+------------------------------------------------------------------------+-----------------------------------------------------------+ : Comparison of frameworks for on-device training optimization. {#tbl-framework-comparison .striped .hover} diff --git a/contents/ops/ops.qmd b/contents/ops/ops.qmd index 8231b570..def2055a 100644 --- a/contents/ops/ops.qmd +++ b/contents/ops/ops.qmd @@ -460,7 +460,7 @@ Project managers play a vital role in MLOps by coordinating the activities betwe * Facilitating communication through status reports, meetings, workshops, and documentation and enabling seamless collaboration. * Driving adherence to timelines and budget and escalating anticipated overruns or shortfalls for mitigation. -For example, a project manager would create a project plan for developing and enhancing a customer churn prediction model. They coordinate between data engineers building data pipelines, data scientists experimenting with models, ML engineers productionalizing models, and DevOps setting up deployment infrastructure. The project manager tracks progress via milestones like dataset preparation, model prototyping, deployment, and monitoring. To enact preventive solutions, they surface any risks, delays, or budget issues. +For example, a project manager would create a project plan for developing and enhancing a customer churn prediction model. They coordinate between data engineers building data pipelines, data scientists experimenting with models, ML engineers productizing models, and DevOps setting up deployment infrastructure. The project manager tracks progress via milestones like dataset preparation, model prototyping, deployment, and monitoring. To enact preventive solutions, they surface any risks, delays, or budget issues. Skilled project managers enable MLOps teams to work synergistically to rapidly deliver maximum business value from ML investments. Their leadership and organization align with diverse teams. diff --git a/contents/optimizations/optimizations.qmd b/contents/optimizations/optimizations.qmd index 27b53917..0ed26957 100644 --- a/contents/optimizations/optimizations.qmd +++ b/contents/optimizations/optimizations.qmd @@ -85,7 +85,7 @@ With **channel** pruning, which is predominantly applied in convolutional neural Finally, **layer** pruning takes a more aggressive approach by removing entire layers of the network. This significantly reduces the network's depth and thereby its capacity to model complex patterns and hierarchies in the data. This approach necessitates a careful balance to ensure that the model's predictive capability is not unduly compromised. -@fig-channel-layer-pruning demonstrates the difference between channel/filter wise pruning and layer pruning. When we prune a channel, we have to reconfigure the model's architecture in order to adapt to the structural changes. One adjustment is changing the number of input channels in the subsequent layer (here, the third and deepest layer): changing the depths of the filters that are applied to the layer with the pruned channel. On the other hand, pruning an entire layer (removing all the channels in the layer) requires more drastic adjustements. The main one involves modifying the connections between the remaining layers to replace or bypass the pruned layer. In our case, we reconfigure to connect the first and last layers. In all pruning cases, we have to fine-tune the new structure to adjust the weights. +@fig-channel-layer-pruning demonstrates the difference between channel/filter wise pruning and layer pruning. When we prune a channel, we have to reconfigure the model's architecture in order to adapt to the structural changes. One adjustment is changing the number of input channels in the subsequent layer (here, the third and deepest layer): changing the depths of the filters that are applied to the layer with the pruned channel. On the other hand, pruning an entire layer (removing all the channels in the layer) requires more drastic adjustments. The main one involves modifying the connections between the remaining layers to replace or bypass the pruned layer. In our case, we reconfigure to connect the first and last layers. In all pruning cases, we have to fine-tune the new structure to adjust the weights. ![Channel vs layer pruning.](images/jpg/modeloptimization_channel_layer_pruning.jpeg){#fig-channel-layer-pruning} @@ -110,7 +110,7 @@ The pruning strategy orchestrates how structures are removed and integrates with **Iterative pruning** gradually removes structures across multiple cycles of pruning followed by fine-tuning. In each cycle, a small set of structures are pruned based on importance criteria. The model is then fine-tuned, allowing it to adjust smoothly to the structural changes before the next pruning iteration. This gradual, cyclic approach prevents abrupt accuracy drops. It allows the model to slowly adapt as structures are reduced across iterations. -Consider a situation where we wish to prune the 6 least effective channels (based on some specific critera) from a convolutional neural network. In @fig-iterative-pruning, we show a simplified pruning process carried over 3 iterations. In every iteration, we only prune 2 channels. Removing the channels results in accuracy degradation. In the first iteration, the accuracy drops from 0.995 to 0.971. However, after we fine-tune the model on the new structure, we are able to recover from the performance loss, bringing the accuracy up to 0.992. Since the structural changes are minor and gradual, the network can more easily adapt to them. Running the same process 2 more times, we end up with a final accuracy of 0.991 (a loss of only 0.4% from the original) and 27% decrease in the number of channels. Thus, iterative pruning enables us to maintain performance while benefiting from increased computational efficiency due to the decreased model size. +Consider a situation where we wish to prune the 6 least effective channels (based on some specific criteria) from a convolutional neural network. In @fig-iterative-pruning, we show a simplified pruning process carried over 3 iterations. In every iteration, we only prune 2 channels. Removing the channels results in accuracy degradation. In the first iteration, the accuracy drops from 0.995 to 0.971. However, after we fine-tune the model on the new structure, we are able to recover from the performance loss, bringing the accuracy up to 0.992. Since the structural changes are minor and gradual, the network can more easily adapt to them. Running the same process 2 more times, we end up with a final accuracy of 0.991 (a loss of only 0.4% from the original) and 27% decrease in the number of channels. Thus, iterative pruning enables us to maintain performance while benefiting from increased computational efficiency due to the decreased model size. ![Iterative pruning.](images/jpg/modeloptimization_iterative_pruning.jpeg){#fig-iterative-pruning} @@ -169,7 +169,7 @@ Unstructured pruning, while offering the potential for significant model size re : Comparison of structured versus unstructured pruning. {#tbl-pruning_methods .striped .hover} -In @fig-structured-unstructured we have exapmles that illustrate the differences between unstructured and structured pruning. Observe that unstructured pruning can lead to models that no longer obey high-level structural guaruntees of their original unpruned counterparts: the left network is no longer a fully connected network after pruning. Structured pruning on the other hand maintains those invariants: in the middle, the fully connected network is pruned in a way that the pruned network is still fully connected; likewise, the CNN maintains its convolutional structure, albeit with fewer filters. +In @fig-structured-unstructured we have examples that illustrate the differences between unstructured and structured pruning. Observe that unstructured pruning can lead to models that no longer obey high-level structural guarantees of their original unpruned counterparts: the left network is no longer a fully connected network after pruning. Structured pruning on the other hand maintains those invariants: in the middle, the fully connected network is pruned in a way that the pruned network is still fully connected; likewise, the CNN maintains its convolutional structure, albeit with fewer filters. ![Unstructured vs structured pruning. Source: @qi2021efficient.](images/png/modeloptimization_pruning_comparison.png){#fig-structured-unstructured} @@ -181,7 +181,7 @@ A breakthrough finding that catalyzed this evolution was the [lottery ticket hyp The intuition behind this hypothesis is that, during the training process of a neural network, many neurons and connections become redundant or unimportant, particularly with the inclusion of training techniques encouraging redundancy like dropout. Identifying, pruning out, and initializing these "winning tickets'' allows for faster training and more efficient models, as they contain the essential model decision information for the task. Furthermore, as generally known with the bias-variance tradeoff theory, these tickets suffer less from overparameterization and thus generalize better rather than overfitting to the task. -In @fig-lottery-ticket-hypothesis we have an example experiment showing pruning and training experiments on a fully connected LeNet over a variety of pruning ratios. In the left plot, notice how heavy pruning reveals a more efifcient subnetwork (in green) that is 21.1% the size of the original network (in blue), The subnetwork achieves higher accuracy and in a faster manner than the unpruned version (green line is above the blue line). However, pruning has a limit (sweet spot), and further pruning will produce performance degredations and eventually drop below the unpruned version's performance (notice how the red, purple, and brown subnetworks gradually drop in accuracy performance) due to the significant loss in the number of parameters. +In @fig-lottery-ticket-hypothesis we have an example experiment showing pruning and training experiments on a fully connected LeNet over a variety of pruning ratios. In the left plot, notice how heavy pruning reveals a more efficient subnetwork (in green) that is 21.1% the size of the original network (in blue), The subnetwork achieves higher accuracy and in a faster manner than the unpruned version (green line is above the blue line). However, pruning has a limit (sweet spot), and further pruning will produce performance degradations and eventually drop below the unpruned version's performance (notice how the red, purple, and brown subnetworks gradually drop in accuracy performance) due to the significant loss in the number of parameters. ![Lottery ticket hypothesis experiments.](images/png/modeloptimization_lottery_ticket_hypothesis.png){#fig-lottery-ticket-hypothesis} @@ -699,7 +699,7 @@ Activation Quantization: Involves quantizing the activation values (outputs of l Quantization invariably introduces a trade-off between model size/performance and accuracy. While it significantly reduces the memory footprint and can accelerate inference, especially on hardware optimized for low-precision arithmetic, the reduced precision can degrade model accuracy. -Model Size: A model with weights represented as Float32 being quantized to INT8 can theoretically reduce the model size by a factor of 4, enabling it to be deployed on devices with limited memory. The model size of large language models is developing at a faster pace than the GPU memory in recent years, leading to a big gap between the supply and demand for memory. @fig-model-size-pace illustrates the recent trend of the widening gap between model size (red line) and acceleartor memory (yellow line). Quantization and model compression techniques can help bridge the gap +Model Size: A model with weights represented as Float32 being quantized to INT8 can theoretically reduce the model size by a factor of 4, enabling it to be deployed on devices with limited memory. The model size of large language models is developing at a faster pace than the GPU memory in recent years, leading to a big gap between the supply and demand for memory. @fig-model-size-pace illustrates the recent trend of the widening gap between model size (red line) and accelerator memory (yellow line). Quantization and model compression techniques can help bridge the gap ![Model size vs. accelerator memory. Source: @xiao2022smoothquant.](images/png/efficientnumerics_modelsizes.png){#fig-model-size-pace} @@ -744,7 +744,7 @@ Efficient hardware implementation transcends the selection of suitable component Focusing only on the accuracy when performing Neural Architecture Search leads to models that are exponentially complex and require increasing memory and compute. This has lead to hardware constraints limiting the exploitation of the deep learning models at their full potential. Manually designing the architecture of the model is even harder when considering the hardware variety and limitations. This has lead to the creation of Hardware-aware Neural Architecture Search that incorporate the hardware contractions into their search and optimize the search space for a specific hardware and accuracy. HW-NAS can be categorized based how it optimizes for hardware. We will briefly explore these categories and leave links to related papers for the interested reader. -#### Single Target, Fixed Platfrom Configuration +#### Single Target, Fixed Platform Configuration The goal here is to find the best architecture in terms of accuracy and hardware efficiency for one fixed target hardware. For a specific hardware, the Arduino Nicla Vision for example, this category of HW-NAS will look for the architecture that optimizes accuracy, latency, energy consumption, etc. @@ -758,11 +758,11 @@ Here, the search space is restricted to the architectures that perform well on t #### Single Target, Multiple Platform Configurations -Some hardwares may have different configurations. For example, FPGAs have Configurable Logic Blocks (CLBs) that can be configured by the firmware. This method allows for the HW-NAS to explore different configurations. [@jiang2019accuracy; @yang2020coexploration] +Some hardware may have different configurations. For example, FPGAs have Configurable Logic Blocks (CLBs) that can be configured by the firmware. This method allows for the HW-NAS to explore different configurations. [@jiang2019accuracy; @yang2020coexploration] #### Multiple Targets -This category aims at optimizing a single model for multiple hardwares. This can be helpful for mobile devices development as it can optimize to different phones models. [@chu2021discovering; @jiang2019accuracy] +This category aims at optimizing a single model for multiple hardware. This can be helpful for mobile devices development as it can optimize to different phones models. [@chu2021discovering; @jiang2019accuracy] #### Examples of Hardware-Aware Neural Architecture Search @@ -770,7 +770,7 @@ This category aims at optimizing a single model for multiple hardwares. This can TinyNAS adopts a two stage approach to finding an optimal architecture for model with the constraints of the specific microcontroller in mind. -First, TinyNAS generate multiple search spaces by varying the input resolution of the model, and the number of channels of the layers of the model. Then, TinyNAS chooses a search space based on the FLOPs (Floating Point Operations Per Second) of each search space. Spaces with a higher probability of containiung architectures with a large number of FLOPs yields models with higher accuracies - compare the red line vs. the black line in @fig-search-space-flops. Since a higher number FLOPs means the model has a higher computational capacity, the model is more likely to have a higher accuracy. +First, TinyNAS generate multiple search spaces by varying the input resolution of the model, and the number of channels of the layers of the model. Then, TinyNAS chooses a search space based on the FLOPs (Floating Point Operations Per Second) of each search space. Spaces with a higher probability of containing architectures with a large number of FLOPs yields models with higher accuracies - compare the red line vs. the black line in @fig-search-space-flops. Since a higher number FLOPs means the model has a higher computational capacity, the model is more likely to have a higher accuracy. Then, TinyNAS performs a search operation on the chosen space to find the optimal architecture for the specific constraints of the microcontroller. [@lin2020mcunet] @@ -782,11 +782,11 @@ Focuses on creating and optimizing a search space that aligns with the hardware ### Challenges of Hardware-Aware Neural Architecture Search -While HW-NAS carries high potential for finding optimal architectures for TinyML, it comes with some challenges. Hardware Metrics like latency, energy consumption and hardware utilization are harder to evaluate than the metrics of accuracy or loss. They often require specilized tools for precise measurements. Moreover, adding all these metrics leads to a much bigger search space. This leads to HW-NAS being time-consuming and expensive. It has to be applied to every hardware for optimal results, moreover, meaning that if one needs to deploy the model on multiple devices, the search has to be conducted multiple times and will result in different models, unless optimizing for all of them which means less accuracy. Finally, hardware changes frequently, and HW-NAS may need to be conducted on each version. +While HW-NAS carries high potential for finding optimal architectures for TinyML, it comes with some challenges. Hardware Metrics like latency, energy consumption and hardware utilization are harder to evaluate than the metrics of accuracy or loss. They often require specialized tools for precise measurements. Moreover, adding all these metrics leads to a much bigger search space. This leads to HW-NAS being time-consuming and expensive. It has to be applied to every hardware for optimal results, moreover, meaning that if one needs to deploy the model on multiple devices, the search has to be conducted multiple times and will result in different models, unless optimizing for all of them which means less accuracy. Finally, hardware changes frequently, and HW-NAS may need to be conducted on each version. ### Kernel Optimizations -Kernel Optimizations are modifications made to the kernel to improve the performance of machine learning models onf resource-constrained devices. We will separate kernel optimizations into two types. +Kernel Optimizations are modifications made to the kernel to improve the performance of machine learning models on resource-constrained devices. We will separate kernel optimizations into two types. #### General Kernel Optimizations @@ -794,7 +794,7 @@ These are kernel optimizations that all devices can benefit from. They provide t ##### Loop unrolling -Instead of having a loop with loop control (incrementing the loop counter, checking the loop termination condition) the loop can be unrolled and the overhead of loop control can be omitted. This may also provide additional opportunities for parallelism that may not be possible with the loop structure. This can be particularly beneficial for tight loops, where the boy of the loop is a small number of instructions with a lot of iterations. +Instead of having a loop with loop control (incrementing the loop counter, checking the loop termination condition) the loop can be unrolled and the overhead of loop control can be omitted. This may also provide additional opportunities for parallelism that may not be possible with the loop structure. This can be particularly beneficial for tight loops, where the body of the loop is a small number of instructions with a lot of iterations. ##### Blocking @@ -802,11 +802,11 @@ Blocking is used to make memory access patterns more efficient. If we have three ##### Tiling -Similarly to blocking, tiling divides data and computation into chunks, but extends beyond cache improvements. Tiling creates independent partitions of computation that can be run in parallel, which can result in significant performance improvements.: +Similarly to blocking, tiling divides data and computation into chunks, but extends beyond cache improvements. Tiling creates independent partitions of computation that can be run in parallel, which can result in significant performance improvements. ##### Optimized Kernel Libraries -This comprises developing optimized kernels that take full advantage of a specific hardware. One example is the CMSIS-NN library, which is a collection of efficient neural network kernels developed to optimize the performance and minimize the memory footprint of models on Arm Cortex-M processors, which are common on IoT edge devices. The kernel leverage multiple hardware capabilities of Cortex-M processors like Single Instruction Multple Data (SIMD), Floating Point Units (FPUs) and M-Profile Vector Extensions (MVE). These optimization make common operations like matrix multiplications more efficient, boosting the performance of model operations on Cortex-M processors. [@lai2018cmsisnn] +This comprises developing optimized kernels that take full advantage of a specific hardware. One example is the CMSIS-NN library, which is a collection of efficient neural network kernels developed to optimize the performance and minimize the memory footprint of models on Arm Cortex-M processors, which are common on IoT edge devices. The kernel leverage multiple hardware capabilities of Cortex-M processors like Single Instruction Multiple Data (SIMD), Floating Point Units (FPUs) and M-Profile Vector Extensions (MVE). These optimization make common operations like matrix multiplications more efficient, boosting the performance of model operations on Cortex-M processors. [@lai2018cmsisnn] ### Compute-in-Memory (CiM) @@ -818,11 +818,11 @@ Through algorithm-hardware co-design, the algorithms can be optimized to leverag ### Memory Access Optimization -Different devices may have different memory hierarchies. Optimizing for the specific memory hierarchy in the specific hardware can lead to great performance improvements by reducing the costly operations of reading and writing to memory. Dataflow optimization can be achieved by optimizing for reusing data within a single layer and across multiple layers. This dataflow optimization can be tailored to the specific memory hierarchy of the hardware, which can lead to greater benefits than general optimizations for different hardwares. +Different devices may have different memory hierarchies. Optimizing for the specific memory hierarchy in the specific hardware can lead to great performance improvements by reducing the costly operations of reading and writing to memory. Dataflow optimization can be achieved by optimizing for reusing data within a single layer and across multiple layers. This dataflow optimization can be tailored to the specific memory hierarchy of the hardware, which can lead to greater benefits than general optimizations for different hardware. #### Leveraging Sparsity -Pruning is a fundamental approach to compress models to make them compatible with resource constrained devices. This results in sparse models where a lot of weights are 0's. Therefore, leveraging this sparsity can lead to significant improvements in performance. Tools were created to achieve exactly this. RAMAN, is a sparseTinyML accelerator designed for inference on edge devices. RAMAN overlap input and output activations on the same memory space, reducing storage requirements by up to 50%. [@krishna2023raman] +Pruning is a fundamental approach to compress models to make them compatible with resource constrained devices. This results in sparse models where a lot of weights are 0's. Therefore, leveraging this sparsity can lead to significant improvements in performance. Tools were created to achieve exactly this. RAMAN, is a sparse TinyML accelerator designed for inference on edge devices. RAMAN overlap input and output activations on the same memory space, reducing storage requirements by up to 50%. [@krishna2023raman] #### Optimization Frameworks @@ -937,7 +937,7 @@ TensorFlow Lite - TensorFlow's platform to convert models to a lightweight forma ONNX Runtime - Performs model conversion and inference for models in the open ONNX model format. Provides optimized kernels, supports hardware accelerators like GPUs, and cross-platform deployment from cloud to edge. Allows framework-agnostic deployment. @fig-interop is an ONNX interoperability map, including major popular frameworks. -![Interoperablily of ONNX. Source: [TowardsDataScience](https://towardsdatascience.com/onnx-preventing-framework-lock-in-9a798fb34c92).](https://miro.medium.com/v2/resize:fit:1400/1*3N6uPaLNEYDjtWBW1vdNoQ.jpeg){#fig-interop} +![Interoperability of ONNX. Source: [TowardsDataScience](https://towardsdatascience.com/onnx-preventing-framework-lock-in-9a798fb34c92).](https://miro.medium.com/v2/resize:fit:1400/1*3N6uPaLNEYDjtWBW1vdNoQ.jpeg){#fig-interop} PyTorch Mobile - Enables PyTorch models to be run on iOS and Android by converting to mobile-optimized representations. Provides efficient mobile implementations of ops like convolution and special functions optimized for mobile hardware. diff --git a/contents/privacy_security/privacy_security.qmd b/contents/privacy_security/privacy_security.qmd index 5da5d0a2..696e6046 100644 --- a/contents/privacy_security/privacy_security.qmd +++ b/contents/privacy_security/privacy_security.qmd @@ -494,7 +494,7 @@ The fundamentals of TEEs contain four main parts: * **Isolated Execution:** Code within a TEE runs in a separate environment from the device's main operating system. This isolation protects the code from unauthorized access by other applications. -* **Secure Storage:** TEEs can securely store cryptographic keys,authentication tokens, and sensitive data, preventing access by regular applications running outside the TEE. +* **Secure Storage:** TEEs can securely store cryptographic keys, authentication tokens, and sensitive data, preventing access by regular applications running outside the TEE. * **Integrity Protection:** TEEs can verify the integrity of code and data, ensuring that they have not been altered before execution or during storage. @@ -948,7 +948,7 @@ Homomorphic encryption enables machine learning model training and inference on Homomorphic encryption thwarts attacks like model extraction and membership inference that could expose private data used in ML workflows. It provides an alternative to TEEs using hardware enclaves for confidential computing. However, current schemes have high computational overheads and algorithmic limitations that constrain real-world applications. -Homomorphic encryption realizes the decades-old vision of secure multipartymultiparty computation by allowing computation on ciphertexts. Conceptualized in the 1970s, the first fully homomorphic cryptosystems emerged in 2009, enabling arbitrary computations. Ongoing research is making these techniques more efficient and practical. +Homomorphic encryption realizes the decades-old vision of secure multiparty computation by allowing computation on ciphertexts. Conceptualized in the 1970s, the first fully homomorphic cryptosystems emerged in 2009, enabling arbitrary computations. Ongoing research is making these techniques more efficient and practical. Homomorphic encryption shows great promise in enabling privacy-preserving machine learning under emerging data regulations. However, given constraints, one should carefully evaluate its applicability against other confidential computing approaches. Extensive resources exist to explore homomorphic encryption and track progress in easing adoption barriers. @@ -990,7 +990,7 @@ Ready to unlock the power of encrypted computation? Homomorphic encryption is li ::: -### Secure MultipartyMultiparty Communication +### Secure Multiparty Communication #### Core Idea