Skip to content

Work with DeepSpeed for large scale training #611

@kuizhiqing

Description

@kuizhiqing

DeepSpeed is an excellent framework for training LLMs on a large scale, while the mpi-operator is the ideal tool to facilitate this within the Kubernetes ecosystem.

I'm planning to submit a series of PRs to make this project more ready-to-use for very large scale training with the DeepSpeed/mpi-style training framework.

The upcoming features may include the following modifications:

Support for IP-style hostfile
This is for performance efficiency and to prevent the environment variable length from exceeding its limit when using svc for those who wish to wrap it into an environment variable.

Support for fault tolerance and elasticity
This is a quasi-fault tolerance since NCCL communication must always be recreated when an error occurs. However, it's still worth implementing because recreating pods can be costly on a very large scale.

Configuration decoupling
There are some requirements that are currently left to the docker image maker to handle, such as ssh_config and sshd_config. Perhaps the operator can manage all of these.

There are also some minor changes under consideration. Please feel free to share your thoughts on this topic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions