Skip to content

High-performance Infiniband network simulation and bottleneck exploration for distributed machine-learning systems using OMNeT++. Components include task generator, routing controller, and virtual lane arbitration.

License

Notifications You must be signed in to change notification settings

branchonly/InfinibandSim-Optimized

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

  • This project aims to simulate the data flow mechanism, explore performance and bottlenecks in Infiniband networks specifically for distributed machine-learning systems in data centers using OMNeT++ (based on OMNeT++ Infiniband open source).
  • The details of the network simulation model are as below:
  1. Networking Protocol: Infiniband
  2. Data Transmission Principle: Credit-based Flow Control
  3. Congestion Control Mechanism: Congestion Control Table Indexing (BECN, FECN)
  4. Virtual Lane Selection Algorithm: Weighted Round Robin (WRR)
  5. Data Transmission Framework: Ring Allreduce
  6. Data Granularity: Flow Control Unit (FLIT)

Components

1. Task Generator

  • Creates AI training or computational tasks, initializes relevant data information and size in every flit, and forwards it to CPU to assign computing tasks among the nodes in the networking cluster.

2. Central Controller

  • Manages the entire networking cluster.
  • Breaks down AI training tasks received from Task Generator and distributes them evenly among nodes.
  • Generates routing information for every message and updates the relevant routing tables in Switches.

3. Central Processing Unit (CPU)

  • Responsible for generating computational tasks for GPU.
  • Receives AI training tasks from Central Controller and forwards them to GPU for computational training.

4. Graphic Processing Unit (GPU)

  • Undertakes AI training and relevant computations.
  • Receives training data from CPU, performs computations, and forwards the processed data to HCA for transmission.

5. Host Channel Adapters (HCA)

6. Switch

7. Application

  • Coordinates information flow between the HCA and the GPU.
  • Forwards flit from GPU to Generator.
  • Notifies the GPU upon receiving a complete message.

8. Generator

  • Implements Infiniband protocol.
  • Breaks down messages received from App into finer packets and flits.
  • Controls injection rate based on congestion protocol using Backward Explicit Congestion Notification (BECN).

9. Virtual Lane Arbiter

  • Arbitrates and selects virtual lanes based on real-time conditions.
  • Temporarily stores the data before transmission and ensures nodes possess enough credit to forward the flit.
  • Utilizes the Weighted Round Robin (WRR) algorithm for lane selection.

10. Output Buffer

  • Temporarily stores flit before transmission.
  • Updates credit values upon processing data and queues flit if destination buffer lacks space.

11. Input Buffer

12. Sink

  • Collects and processes flit received at destination nodes.
  • Alerts Input Buffer and updates credit values upon processing packets.
  • Generates congestion notifications for alleviation when required.

13. Packet Forwarder

  • Routes data through switches.
  • Establishes a routing table and determines output ports for transmission based on destination node information.

Getting Started

  1. Clone the repository: git clone https://github.com/branchonly/InfinibandSim-Optimized.git.
  2. Install dependencies including OMNeT++.
  3. Follow simulation setup instructions in the repository documentation.
  4. Run simulations and analyze performance metrics.

Contributing

  • Contributions are welcome! Please fork the repository and submit pull requests for enhancements, bug fixes, or documentation improvements.

License

This project is under the MIT License. See the LICENSE file for details.

About

High-performance Infiniband network simulation and bottleneck exploration for distributed machine-learning systems using OMNeT++. Components include task generator, routing controller, and virtual lane arbitration.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published