-
Notifications
You must be signed in to change notification settings - Fork 228
Description
DeepSpeed is an excellent framework for training LLMs on a large scale, while the mpi-operator is the ideal tool to facilitate this within the Kubernetes ecosystem.
I'm planning to submit a series of PRs to make this project more ready-to-use for very large scale training with the DeepSpeed/mpi-style training framework.
The upcoming features may include the following modifications:
Support for IP-style hostfile
This is for performance efficiency and to prevent the environment variable length from exceeding its limit when using svc for those who wish to wrap it into an environment variable.
Support for fault tolerance and elasticity
This is a quasi-fault tolerance since NCCL communication must always be recreated when an error occurs. However, it's still worth implementing because recreating pods can be costly on a very large scale.
Configuration decoupling
There are some requirements that are currently left to the docker image maker to handle, such as ssh_config and sshd_config. Perhaps the operator can manage all of these.
There are also some minor changes under consideration. Please feel free to share your thoughts on this topic.