Implement parallel model preloading #211

aybanda · 2024-09-09T11:56:26Z

Implement Parallel Model Preloading

Description

This PR introduces parallel model preloading to significantly reduce startup times for large models distributed across multiple nodes. By leveraging asyncio, we now preload model shards into memory concurrently, followed by a sequential initialization step.

Changes

Added preload_model method to the InferenceEngine abstract class
Implemented preload_model in MLXDynamicShardInferenceEngine
Updated ensure_shard method to work with preloaded models
Modified main.py to use parallel preloading

Implementation Details

InferenceEngine now has an abstract preload_model method
MLXDynamicShardInferenceEngine.preload_model loads model config and weights without full initialization
ensure_shard completes initialization using preloaded data
Main script uses asyncio.gather for parallel preloading

Performance Improvements

Startup time for multi-shard models is expected to decrease significantly
Resource utilization during startup is more efficient

How to Test

Run the main script with a multi-shard model
Observe logs for parallel preloading and sequential initialization
Compare startup times with the previous sequential loading approach

Future Work

Fine-tune the balance between parallel preloading and sequential initialization
Implement similar optimizations for other inference engines (e.g., TinyGrad)

If you feel like supporting me:

https://buymeacoffee.com/aybanda

AlexCheema · 2024-09-09T13:09:47Z

Hey, is this AI generated?

We don't accept AI generated PR's.

This doesn't really achieve its intended purpose: calling preload_model in main.py doesn't really make sense since exo doesn't know up front which shards you are going to use.

aybanda · 2024-09-09T13:33:30Z

Hey @AlexCheema I got your point and yes I have generated using AI

Instead of preloading in main.py, we could modify the ensure_shard method to implement a more efficient loading process. Here's a approach that might work better with your design

In MLXDynamicShardInferenceEngine modifying ensure_shard
This approach will be more suitable I guess
Loads config and weights concurrently
Doesn't require changes to main.py or other parts of exo
Keeps the loading process within the ensure_shard method, maintaining your existing architecture

If you are interested in this let me know, I will change the code accordingly.

Implement parallel model preloading

bbc0182

aybanda mentioned this pull request Sep 9, 2024

[BOUNTY - $100] Parallelise Model Loading #202

Open

aybanda marked this pull request as draft September 9, 2024 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement parallel model preloading #211

Implement parallel model preloading #211

aybanda commented Sep 9, 2024

AlexCheema commented Sep 9, 2024

aybanda commented Sep 9, 2024

Implement parallel model preloading #211

Are you sure you want to change the base?

Implement parallel model preloading #211

Conversation

aybanda commented Sep 9, 2024

Implement Parallel Model Preloading

Description

Changes

Implementation Details

Performance Improvements

How to Test

Future Work

AlexCheema commented Sep 9, 2024

aybanda commented Sep 9, 2024