Skip to content

MPI environment variables are not set #6895

@fabiogeraci

Description

@fabiogeraci

System Info
HPC ubuntu 22.04 2nodesx8H100

LSF as scheduler

[tool.poetry.dependencies]
python = "^3.10"

importlib-metadata = { version = "~=1.0", python = "<3.8" }
tensorboard = "^2.16.2"
sge-data-package = {version = "", source = "sgedata"}
torch = "2.2.1"
torchvision = "0.17.1"
torchaudio = "2.2.1"
transformers = "4.42.0"
datasets = "2.18."
accelerate = "0.28.0"
deepspeed = "0.13.4"
safetensors = "0.4.2"
mpi4py = "^4.0.0"

module load cuda-12.1.1
module load ISG/experimental/fg12/openmpi/5.0.4-cuda12.1-lsf
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

deepspeed \
    --hostfile=${HOSTFILE_PATH} \
    --launcher=OPENMPI \
    --launcher_args="-bind-to none -map-by slot --mca pml ob1 --oversubscribe --display-allocation --display-map" \
    --master_addr=${MASTER_ADDR} \
    --master_port=${_M_PORT} \
    --no_ssh_check \
    src/dna_mlm/runner.py
def setup_env_ranks() -> tp.Tuple[int, int, int]:

    # Map MPI environment variables to those expected by DeepSpeed/PyTorch
    if 'OMPI_COMM_WORLD_LOCAL_RANK' in os.environ:
        os.environ['LOCAL_RANK'] = os.environ['OMPI_COMM_WORLD_LOCAL_RANK']
        os.environ['RANK'] = os.environ['OMPI_COMM_WORLD_RANK']
        os.environ['WORLD_SIZE'] = os.environ['OMPI_COMM_WORLD_SIZE']
    else:
        raise EnvironmentError(
            "MPI environment variables are not set. "
            "Ensure you are running the script with an MPI-compatible launcher."
        )
 
 setup_env_ranks()

the function should set the env vars but instaed it raises the error

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions