- This project aims to simulate the data flow mechanism, explore performance and bottlenecks in Infiniband networks specifically for distributed machine-learning systems in data centers using OMNeT++ (based on OMNeT++ Infiniband open source).
- The details of the network simulation model are as below:
- Networking Protocol:
Infiniband - Data Transmission Principle:
Credit-based Flow Control - Congestion Control Mechanism:
Congestion Control Table Indexing (BECN, FECN) - Virtual Lane Selection Algorithm:
Weighted Round Robin (WRR) - Data Transmission Framework:
Ring Allreduce - Data Granularity:
Flow Control Unit (FLIT)
- Creates AI training or computational tasks, initializes relevant data information and size in every
flit, and forwards it to CPU to assign computing tasks among the nodes in the networking cluster.
- Manages the entire networking cluster.
- Breaks down AI training tasks received from Task Generator and distributes them evenly among nodes.
- Generates routing information for every
messageand updates the relevant routing tables in Switches.
- Responsible for generating computational tasks for GPU.
- Receives AI training tasks from Central Controller and forwards them to GPU for computational training.
- Undertakes AI training and relevant computations.
- Receives training data from CPU, performs computations, and forwards the processed data to HCA for transmission.
- Functions similarly to a
Network Interface Card (NIC). - Integrates App, Gen, Virtual Lane Arbiter, Output Buffer, Input Buffer, and Sink.
- Encapsulates trained data for transmission through the network link layer to update AI model parameters.
- Responsible for
flitexchange between different nodes in the networking cluster. - Integrates Virtual Lane Arbiter, Output Buffer, Input Buffer, and Packet Forwarder.
- Coordinates information flow between the HCA and the GPU.
- Forwards
flitfrom GPU to Generator. - Notifies the GPU upon receiving a complete
message.
- Implements
Infinibandprotocol. - Breaks down messages received from App into finer packets and flits.
- Controls injection rate based on congestion protocol using
Backward Explicit Congestion Notification (BECN).
- Arbitrates and selects virtual lanes based on real-time conditions.
- Temporarily stores the data before transmission and ensures nodes possess enough credit to forward the
flit. - Utilizes the
Weighted Round Robin (WRR)algorithm for lane selection.
- Temporarily stores
flitbefore transmission. - Updates credit values upon processing data and queues
flitif destination buffer lacks space.
- Temporarily stores received
flit. - Encapsulates
flitintomessage, forwards data to Output Buffer and Virtual Lane Arbiter, and triggers routing via Packet Forwarder.
- Collects and processes
flitreceived at destination nodes. - Alerts Input Buffer and updates credit values upon processing packets.
- Generates congestion notifications for alleviation when required.
- Routes data through switches.
- Establishes a routing table and determines output ports for transmission based on destination node information.
- Clone the repository:
git clone https://github.com/branchonly/InfinibandSim-Optimized.git. - Install dependencies including OMNeT++.
- Follow simulation setup instructions in the repository documentation.
- Run simulations and analyze performance metrics.
- Contributions are welcome! Please fork the repository and submit pull requests for enhancements, bug fixes, or documentation improvements.
This project is under the MIT License. See the LICENSE file for details.