diff --git a/article_distributed_machine_learning/README.md b/article_distributed_machine_learning/README.md new file mode 100644 index 0000000..c92607d --- /dev/null +++ b/article_distributed_machine_learning/README.md @@ -0,0 +1,4 @@ +# Distributed Machine Learning + +## Abstract +Distributed Machine Learning (DML) has become a critical paradigm in addressing the computational and scalability challenges of modern AI systems. This article provides an in-depth exploration of the motivations, architectures, and techniques behind DML, emphasizing how distributing computation across multiple nodes enables training on large datasets and complex models efficiently. Key architectural designs are detailed, including data and model parallelism, federated learning, and parameter server frameworks, supported by empirical findings and case studies from recent literature. The article also highlights challenges such as communication overhead, data heterogeneity, and system fault tolerance, proposing engineering solutions and discussing experimental validations. Trends like edge computing, privacy-preserving techniques, and the use of reinforcement learning for optimization in DML systems are analyzed to project the field’s future trajectory. Visual diagrams and practical design considerations are incorporated to guide professionals and researchers in implementing robust, scalable DML systems. Through a combination of theoretical insights and applied knowledge, this article aims to serve as a foundational resource for advancing research and practice in distributed machine learning. \ No newline at end of file diff --git a/article_distributed_machine_learning/article.md b/article_distributed_machine_learning/article.md new file mode 100644 index 0000000..6d3be1a --- /dev/null +++ b/article_distributed_machine_learning/article.md @@ -0,0 +1,158 @@ +# Distributed Machine Learning + +## Abstract +Distributed Machine Learning (DML) has become a critical paradigm in addressing the computational and scalability challenges of modern AI systems. This article provides an in-depth exploration of the motivations, architectures, and techniques behind DML, emphasizing how distributing computation across multiple nodes enables training on large datasets and complex models efficiently. Key architectural designs are detailed, including data and model parallelism, federated learning, and parameter server frameworks, supported by empirical findings and case studies from recent literature. The article also highlights challenges such as communication overhead, data heterogeneity, and system fault tolerance, proposing engineering solutions and discussing experimental validations. Trends like edge computing, privacy-preserving techniques, and the use of reinforcement learning for optimization in DML systems are analyzed to project the field’s future trajectory. Visual diagrams and practical design considerations are incorporated to guide professionals and researchers in implementing robust, scalable DML systems. Through a combination of theoretical insights and applied knowledge, this article aims to serve as a foundational resource for advancing research and practice in distributed machine learning. + +## Introduction +Distributed machine learning (DML) refers to training ML models across multiple machines or processors to handle very large datasets and models. In DML, data and computation are split over a cluster rather than a single machine​. This approach overcomes single-node memory/compute limits: one cluster can train a model on terabytes of data that would not fit on a single machine​. As one survey notes, “one machine’s storage and computation capabilities are insufficient" for very large-scale training, so DML divides the learning across several workstations to achieve scalability​. In practice, distributed training drastically accelerates learning – large models that might otherwise take weeks on one machine can be trained in days or hours when scaled across many GPUs or CPUs​. For example, Google’s BERT and OpenAI’s GPT-3 (175B parameters) were made possible only by distributing training (GPT-3 on ~1024 GPUs in ~1 month​). + +## Key Components +Distributed ML systems rely on three main ingredients: how data is partitioned, how the model is distributed, and how nodes communicate and synchronize. + +* **Data distribution.** In DML the input dataset is split into shards that reside on different workers or machines​. A good partition strategy is needed so each node has roughly equal work and the overall data is covered. However, splitting data raises challenges: ensuring data consistency and locality, moving data between nodes efficiently, and avoiding duplication or starvation​. For example, one review notes that choosing an effective partition and keeping data consistent "may become challenging" as you scale DML systems​. + +* **Model distribution.** Similarly, the model parameters or structure must be handled across nodes. In data-parallel setups, each worker has a full replica of the model; in model-parallel setups, a large model is split into parts on different devices. Managing the model distribution requires preserving model integrity and correctness when pieces are trained separately. For data parallelism, each node has a copy of the neural network and computes gradients locally​. For model parallelism, parts of the network are “placed” on different GPUs, which requires carefully splitting layers or tensors​. In either case, the system must synchronize parameter updates so all workers stay consistent. As one article observes: “the model must be spread among several computational nodes… preserving model integrity, synchronizing model changes, and reducing communication overhead may become challenging” + +* **Communication and synchronization.** To combine the work from all nodes, gradients or parameters must be exchanged. A synchronization step typically aggregates updates from all workers and broadcasts the new model state. Two broad architectures exist: a parameter server or a decentralized (All-Reduce) approach. In a parameter server, a central node (or nodes) holds the global model and workers push gradients to it and pull updated weights​. In a decentralized scheme (e.g. ring-AllReduce), workers communicate peer-to-peer to average gradients without a central server​. Synchronization can be synchronous (all nodes wait and update together each step) or asynchronous (workers update the model on independent schedules). Both approaches have trade-offs in staleness and throughput. In all cases, network communication is a major concern: minimizing communication cost and latency is critical. As one summary states, communication overhead is often the bottleneck in large-scale training. Ensuring efficient collective communication (e.g. using optimized libraries like NCCL) and overlapping compute with communication are key design concern. + +## Types of Distributed ML + +### Data Parallelism +In data parallelism, the model is replicated on each worker, and the training data is split into different batches. Each node processes its batch through the model copy and computes gradients independently​. Periodically, the gradients from all workers are aggregated (via sum or average) to update a shared model. This can be done with a central server or with All-Reduce operations​. Data parallelism is straightforward and scales well for large datasets: it allows training on data that cannot fit on one machine and can increase overall throughput by parallelism​. The trade-off is synchronization overhead – aggregating gradients requires communication. For example, Dong et al. showed that compressing gradients (a kind of communication optimization) can reduce training time substantially: using “natural compression” reduced ResNet-110 training time by 26% and AlexNet by 66% versus no compression​. + +![alt text](image.png) +_Figure 1: Illustration of data parallelism_ + +The process where input data is split into multiple shards (Data 1, Data 2, ...) is shown in figure 1. Each shard is processed by a separate worker (with a full copy of the model), and the resulting parameter updates (gradients) are combined to produce the trained model​. This allows parallel processing of large datasets that cannot fit on a single device. + +### Model Parallelism +Model parallelism tackles cases where the model itself is too large for one device. Here the neural network’s parameters are split across nodes, and each worker handles only part of the computation​. For example, layers 1–5 might reside on GPU A and layers 6–10 on GPU B. An input is then pipelined through the segments: GPU A processes the first part, sends the intermediate activations to GPU B, and so on. This lets very large models (billions of parameters) be trained despite per-device memory limits. The major challenge is finding an optimal partition: each model architecture is different, so deciding how to slice a model to minimize inter-device communication is non-trivial​. + +![alt text](image-1.png) +_Figure 2: Illustration of model parallelism_ + +The process where a single input is sent through different parts of a large model residing on separate workers is shown in Figure 2. Each worker computes its portion of the model on the input data, and the intermediate results are combined to produce the final output. This allows training models too large to fit on one device​, + +### Hybrid Approaches + +Modern large-scale training often combines parallelism strategies. Hybrid parallelism mixes data and model parallelism (and sometimes pipelining) to exploit strengths of each. For example, one system might shard data across worker groups (data parallelism) and within each group split the model across GPUs (model parallelism). Such hybrids aim to minimize communication while still leveraging many nodes. Recent research demonstrates big gains from hybrids: a “Heterogeneous Parameter Server” framework that combined data-parallel with pipeline-parallel strategies achieved 57.9% higher throughput than a baseline system, and was 14.5× faster in throughput than standard TensorFlow​. Similarly, Song et al.’s HYPAR algorithm optimally partitions tensors and reported a 3.39× performance gain (vs. pure data-parallel) while improving energy efficiency​. Experiments on CNNs have shown hybrid schemes can more than double speed: one team saw a 2.2× speedup using hybrid parallelism on a large CNN​. Adding pipeline parallelism into the mix can further improve utilization. For example, PipeDream overlapped forward/backward passes across layers and got a 5.3× speedup​, while adding weight-prediction techniques on 4 GPUs yielded an 8.9× speedup compared to vanilla data-parallel training​. In general, hybrid methods are an active area: by carefully scheduling computation, overlapping communication, and combining data/model splits, they squeeze out maximum efficiency from modern GPU clusters​. + +## Architectures and Frameworks +Distributed ML relies on networked architectures and specialized software frameworks. Two common architectures are Parameter Server and All-Reduce: + +* **Parameter Server (PS).** In this design, one or more servers hold the global model parameters in memory, and worker nodes compute gradients on data shards. Workers push their gradients to the PS, which aggregates and updates the model. The PS then broadcasts updated parameters back to workers. The parameter-server approach was popularized by Google’s DistBelief system​. A simple depiction (from a Cornell lecture) shows a PS at the top and many workers below: workers send gradients up and receive parameter updates down​.This centralized model is easy to reason about, but the PS can become a bottleneck or single point of failure if not carefully scaled. + +* **All-Reduce (Decentralized).** In this approach (often used by Horovod, PyTorch DDP, etc.), there is no dedicated server. Instead, workers form a peer-to-peer communication pattern to average gradients. For example, ring-AllReduce circulates gradient chunks among workers so that each ultimately holds the sum (or average) of all gradients. This avoids a single bottleneck and can leverage high-speed interconnects with optimized libraries (e.g. NCCL). Uber’s Horovod uses exactly this strategy: replacing a central PS with an MPI/NCCL-based ring-AllReduce allowed them to fully utilize hundreds of GPUs + +**Benchmarks & comparisons.** In practice, performance depends on network and hardware. Studies often find that decentralized all-reduce (e.g. Horovod) scales better than naive parameter-server at large GPU counts. For instance, Uber observed that standard TensorFlow with PS did not scale efficiently: on 128 GPUs it utilized only ~50% of capacity​. By contrast, Horovod (all-reduce) can sustain near-linear scaling up to hundreds of GPUs on modern clusters. In one test (Inception V3 on ResNet-101), scaling out to 128 GPUs lost half the resources under vanilla TF​, prompting the shift to Horovod. In general, real systems use high-speed networks (Infiniband), NCCL optimizations, and mixed parallelism (e.g. PyTorch DDP overlapping communication with backprop) to maximize throughput. Published benchmarks (from cloud and HPC studies) consistently show that well-tuned data-parallel training with all-reduce yields on the order of 70–90% parallel efficiency up to ~64–128 GPUs on large models. + +![alt text](image-2.png) +_Figure 3: Distributed Training Architectures: Parameter Server vs. All-Reduce_ + +Illustrations of two common approaches for distributed machine learning training are represented in Figure 3. In the Parameter Server architecture, a set of CPUs act as centralized servers responsible for managing and storing the model parameters. GPUs, functioning as workers, compute gradients based on their local data and send these gradients to the parameter servers. The servers then aggregate the received gradients, update the global model parameters, and distribute the updated parameters back to the GPUs. This approach is relatively simple to implement and can scale effectively for moderately large models; however, it is often constrained by communication bottlenecks as the number of workers increases. In contrast, the All-Reduce architecture removes the need for a central server. Instead, all GPUs are interconnected in a peer-to-peer network where each device shares its locally computed gradients with the others. Through an all-reduce operation, the GPUs collectively update the model parameters in a decentralized manner. This method reduces centralized communication overhead and improves scalability and fault tolerance, particularly in large-scale deep learning systems. The choice between these architectures depends on factors such as model size, network infrastructure, hardware topology, and the specific demands of the application. + +## Practical Example: Distributed Machine Learning with TensorFlow + +To better illustrate the concepts of distributed machine learning, I present a minimal working example using TensorFlow's `tf.distribute` strategy. This example demonstrates how to distribute training across multiple GPUs using `MirroredStrategy`, one of the simplest forms of synchronous training. + +### Setup and Environment + +The user needs to have TensorFlow installed and access to multiple GPUs. `MirroredStrategy` automatically copies all model variables to each processor and synchronizes updates. + +``` +pip install tensorflow +``` + +### Code Example + +``` +import tensorflow as tf +import numpy as np + +# Initialize the MirroredStrategy +strategy = tf.distribute.MirroredStrategy() + +print('Number of devices: {}'.format(strategy.num_replicas_in_sync)) + +# Create a simple model inside the strategy scope +with strategy.scope(): + model = tf.keras.Sequential([ + tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)), + tf.keras.layers.Dense(10, activation='softmax') + ]) + + model.compile(loss='sparse_categorical_crossentropy', + optimizer=tf.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Load and preprocess the dataset (e.g., MNIST) +(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data() +x_train = x_train.reshape(-1, 784).astype('float32') / 255 +x_test = x_test.reshape(-1, 784).astype('float32') / 255 + +# Train the model +model.fit(x_train, y_train, epochs=5, batch_size=256) + +# Evaluate the model +loss, accuracy = model.evaluate(x_test, y_test) +print(f"Test accuracy: {accuracy:.4f}") + +``` + +### How It Works +The MirroredStrategy replicates the model across all available GPUs. Each GPU computes the forward and backward passes independently on different slices of the input batch. Gradients are then aggregated across all GPUs, and the model parameters are updated synchronously, ensuring consistent training. + +Using this strategy, the workload is effectively parallelize the workload, achieving faster convergence without modifying the original model architecture significantly. + +### Design Considerations +* **Batch Size:** When using multiple replicas, the effective batch size increases. It is recommended to scale the batch size proportionally to the number of devices to maintain convergence behavior. + +* **Synchronization Overhead:** Although MirroredStrategy synchronizes GPUs efficiently, scaling beyond a certain number of devices may introduce communication overhead. + +* **Hardware Requirements:** All GPUs should be on the same machine for MirroredStrategy. For multi-node distribution, strategies like MultiWorkerMirroredStrategy or ParameterServerStrategy are better suited. + +## Challenges +While powerful, distributed ML introduces significant challenges: + +* **Communication Overhead.** Frequent synchronization can dominate runtime. As one survey put it, “the bottleneck lies in communication overhead” in large clusters​. Every gradient exchange or parameter update incurs network cost. Studies and practitioners consistently report that as the number of machines grows, the fraction of time spent communicating increases. For example, Uber’s tests showed that naïve TensorFlow training on 128 GPUs left ~50% of GPU time idle due to communication inefficiencies​. Reducing this overhead is a major research focus: techniques like gradient compression (quantization/sparsification​), asynchronous updates, and large-batch training (to amortize communication per sample) are used. One analysis recommends increasing mini-batch sizes to “diminish communication overhead and expedite training” in large clusters​. In practice, engineers employ libraries (e.g. NVIDIA NCCL) and methods (overlapping compute with comm) to mitigate overhead, but it remains a primary limiter of scalability. + +* **Fault Tolerance.** With hundreds of nodes, hardware failures become likely. A crash of one GPU or network link should not wreck an entire training run. Ensuring resilience is hard: the system must detect failures, recover state, and possibly reassign work. Unlike static HPC jobs, DML often runs for days/weeks, so checkpointing is essential. For example, Comet advice notes that periodic checkpoints allow training to continue even if a worker dies​. Beyond simple restart, some frameworks support elastic scaling (adding/replacing nodes on the fly) or asynchronous algorithms that tolerate stale workers. The importance of fault tolerance is underscored in large-model regimes: a recent review notes that “fault tolerance has also become more critical for large-model training” because prolonged jobs amplify the chance of failure​. In fact, studies on LLM training emphasize that frequent hardware failures in large clusters can block progress, making fault-tolerant scheduling crucial. + +* **Load Balancing & Stragglers.** Ensuring each worker has roughly equal work is tricky when clusters are heterogeneous or when data is uneven. If some nodes are slower (“stragglers”), synchronous training must wait for them, degrading efficiency. Heterogeneity in machine speeds or workload leads to idle time. One source points out that mixed hardware can “cause problems with load balancing, work scheduling, and ensuring the system runs smoothly on all nodes”​. Research addresses this by straggler mitigation: for example, grouping workers by speed and using different sync strategies for fast vs. slow nodes​. As an illustration, GSSP (Group Synchronous Parallel) partitions workers into fast/slow groups to reduce synchronization delays​. In practice, dynamic scheduling or backup workers are also used. Achieving perfect balance is hard, especially when each GPU node has a different number of GPUs or when layers in a model have different compute costs, so this remains an active challenge. + +* **Data Privacy and Security.** Distributed training often involves moving data or models across nodes, raising privacy and security concerns. In scenarios like federated learning, data resides on user devices, so protocols must prevent leakage. Meanwhile, any distributed system is susceptible to adversarial attacks. For example, an attacker could insert poisoned data on one node to corrupt the global model (data poisoning), or perform inference attacks to extract information about training data. Security reviews warn that “one successful attack can lead to other attacks; for instance, poisoning attacks can lead to membership inference and backdoor attacks”​. Protecting against these requires techniques like secure aggregation, differential privacy, and robust training. Modern surveys highlight the use of cryptographic and perturbation methods (homomorphic encryption, secure multi-party computation, differential privacy) to enhance privacy in ML​. However, these protections often add overhead. In short, designing DML systems that are both secure and private is a key challenge, especially in cross-silo or cross-device settings. + +## Real-World Applications +Distributed ML powers many large-scale applications in industry: + +* **Cloud AI and Virtual Assistants.** Companies like Amazon, Google, and Apple use distributed training for voice recognition and NLP models (Alexa, Google Assistant, Siri). These systems require enormous datasets (years of speech) and complex models, which are trained in parallel. For instance, a XenonStack industry blog notes that Amazon and Google train speech models across many servers concurrently to reduce training time​. Similarly, Google trained its Inception image recognition network on a cluster of dozens of machines​. Facebook uses distributed PyTorch to train translation and understanding models across user data​. +* **Search, Ads, and Recommendation.** Tech giants use distributed ML for ranking, recommendation, and personalization. Large recommender systems (e.g. YouTube, Amazon) often factorize huge user-item matrices using distributed algorithms (like Alternating Least Squares on Spark). Online ads and search ranking models are trained on terabytes of clickstream logs using distributed deep nets. These pipelines rely on frameworks like TensorFlow and Hive/Spark to scale. For example, Google’s ranking models (deep and shallow combined) are distributed over data centers, and Amazon’s SageMaker service enables multi-GPU training for massive personalization models. + + * **Training Large Language Models.** Distributed training is especially critical for cutting-edge NLP. BERT (Google, 2018) with ~340M parameters was trained on TPUs in parallel, giving a new benchmark for language understanding. GPT-3 (OpenAI, 2020) with 175B parameters required an enormous cluster: estimates suggest ~1024 GPUs over ~34 days (cost ~$4.6M) to train it​.These case studies demonstrate the scale of computation: GPT-3 “would require 355 years” on one GPU, but distributed computing reduced it to about a month​. Similar distributed feats include OpenAI’s DALL·E for image generation and large recommender models at Netflix or Alibaba. In sum, state-of-the-art AI (LLMs, vision transformers, etc.) all rely on distributed training. + +* **Other Domains.** Distributed ML is used wherever data is huge or naturally partitioned: genomics (DNA sequencing on clusters), autonomous driving (training vision and sensor models across fleets), finance (fraud detection on distributed transaction logs), healthcare (medical image analysis on cloud clusters), IoT analytics, and more. For example, autonomous vehicle companies train perception models on petabytes of street video using large GPU farms. Financial firms run distributed anomaly detection on streaming data. In each case, the common theme is that parallelizing the learning yields faster models and the ability to handle data at scale. + +## Future Trends +Looking ahead, several trends are shaping the evolution of distributed ML: +* **Federated Learning (FL).** FL extends DML to highly decentralized settings (e.g. smartphones, edge devices) with strong privacy guarantees. Instead of centralizing data, models are trained locally on each client and only weight updates are aggregated at a server. This allows collaborative learning without sharing raw data. FL introduces new challenges (non-IID data, untrusted clients) and solutions (secure aggregation, differential privacy). It’s seeing rapid adoption in mobile and healthcare apps, and research into FL algorithms is booming. + +* **Edge and Heterogeneous Computing.** As edge devices (phones, IoT sensors, drones) become more powerful, on-device or edge-distributed training is growing. Distributed ML techniques are being adapted for clusters of heterogeneous devices (e.g. combining cloud GPUs with low-power edge TPUs). Concepts like split learning, peer-to-peer training, and co-training between cloud and edge are emerging. This convergence of edge computing with ML pushes new architectures: for instance, using 5G networks for model updates, or training sub-models on-device while others train in data centers. + +* **New Communication Protocols.** The communication fabric is evolving. Beyond TCP/IP and MPI, specialized interconnects can be seen (RoCE, GPUDirect), RDMA offloads, and even hardware accelerators for collective operations. Researchers are exploring gossip protocols and peer-to-peer averaging to relax synchrony. Network topologies (torus, fat-tree) are being leveraged more intelligently. On the software side, novel synchronization schemes (e.g. hierarchical All-Reduce, asynchronous pipelining) are maturing. + * **Gradient Compression and Optimization.** To alleviate bandwidth bottlenecks, communication-efficient algorithms are advancing. Techniques like gradient quantization (1-bit SGD, QSGD), sparsification (only sending large updates), and low-rank compression (PowerSGD) are becoming practical. For example, Dong et al. showed that a custom gradient compressor cut training time by over 25%​. More adaptive schemes are expected(learning when/how much to communicate) and hardware support for compressed comm. Additionally, optimizers tailored for distributed settings (e.g. LARS, Lamb for large-batch, Elastic Averaging) continue to emerge. + * **Auto-Parallelism and Scheduling.** Future frameworks will likely automate more of the distribution logic. Ideas like compiler-driven model partitioning, automated batch-size scaling, or elasticity (auto-scaling cluster size) are on the horizon. Better load-balancing schedulers that adapt to stragglers and network conditions will improve efficiency. It is also anticipated tighter integration with hardware: for instance, training on chiplets or photonic networks, and co-design of models that are "distribution-friendly.” +In summary, distributed ML remains an active field of innovation, fueled by both the practical needs of industry and theoretical advances in parallel algorithms. + +## Conclusion +Distributed machine learning has become essential for modern AI. By spreading data and computation across clusters, it enables training on scales that would be impossible on a single machine​. It has been seen how data parallelism and model parallelism (and their hybrids) allow handling large datasets and massive models, respectively, with numerous studies demonstrating substantial speedups​. Architectures like parameter servers and all-reduce clusters (and frameworks such as TensorFlow, PyTorch, Horovod) provide the infrastructure to coordinate this work​. However, challenges abound: communication overhead often dominates training time​, system faults can derail long jobs​, and issues like load imbalance and data security require careful solutions​. Overcoming these has driven innovations like gradient compression, fault-tolerant SGD variants, and new networking approaches. Going forward, trends like federated learning, edge training, and smarter optimization protocols will continue to push DML capabilities. The field is moving toward ever larger models and more distributed data, so scalability and efficiency will remain at the forefront. In the end, distributed ML is crucial for AI’s progress: it lets us tackle previously intractable problems, but also demands new system designs and algorithms. + +## Links + 1. https://www.comet.com/site/blog/guide-to-distributed-machine-learning/#:~:text=Distributed%20machine%20learning%20is%20the,rather%20than%20a%20single%20machine +2. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-023-00829-x#:~:text=In%20order%20to%20train%20ML,24%5D.%20Furthermore%2C%20it%20allows +3. https://arxiv.org/html/2404.06114v1#:~:text=The%20bottleneck%20lies%20in%20communication,challenges%20that%20need%20attention%2C%20including +4. https://www.uber.com/blog/horovod/#:~:text=The%20second%20issue%20dealt%20with,when%20training%20on%20128%20GPUs +5. https://www.uber.com/blog/horovod/#:~:text=The%20second%20issue%20dealt%20with,when%20training%20on%20128%20GPUs +6. https://www.xenonstack.com/blog/distributed-ml-framework#:~:text=The%20data%20in%20DML%20is,communication%20overhead%20may%20become%20challenging +6. https://www.cs.cornell.edu/courses/cs4787/2019sp/notes/lecture22.pdf#:~:text=Here%20is%20a%20simple%20diagram,3%20%C2%B7%20%C2%B7%20%C2%B7%20worker + + diff --git a/article_distributed_machine_learning/images/42bc9114-ece7-45ba-a0c9-2a7c90cf0a64.png b/article_distributed_machine_learning/images/42bc9114-ece7-45ba-a0c9-2a7c90cf0a64.png new file mode 100644 index 0000000..8e3e6ab Binary files /dev/null and b/article_distributed_machine_learning/images/42bc9114-ece7-45ba-a0c9-2a7c90cf0a64.png differ diff --git a/article_distributed_machine_learning/images/image-1.png b/article_distributed_machine_learning/images/image-1.png new file mode 100644 index 0000000..f24b34f Binary files /dev/null and b/article_distributed_machine_learning/images/image-1.png differ diff --git a/article_distributed_machine_learning/images/image-2.png b/article_distributed_machine_learning/images/image-2.png new file mode 100644 index 0000000..fd01de8 Binary files /dev/null and b/article_distributed_machine_learning/images/image-2.png differ diff --git a/article_distributed_machine_learning/images/image.png b/article_distributed_machine_learning/images/image.png new file mode 100644 index 0000000..ed0c224 Binary files /dev/null and b/article_distributed_machine_learning/images/image.png differ