Skip to content

[BUG] Incorrect type check in engine.py for CPU training  #3837

@learning-chip

Description

@learning-chip

Describe the bug

The type check inside function split_half_float_double_sparse of runtime/engine.py does not recognize CPU tensors. Simply because it is checking for string "torch.cpu.FloatTensor" while the CPU tensor type is "torch.FloatTensor" (no "cpu")

https://github.com/microsoft/DeepSpeed/blob/fc9e1ee00e673b2ce2a433c4e34c5440e76c9f3e/deepspeed/runtime/engine.py#L115-L124

To Reproduce

Enable Intel CPU backend of deepspeed:

pip install torch==1.13.1+cpu
pip install deepspeed==0.9.5
pip install intel_extension_for_pytorch==1.13+cpu -f https://developer.intel.com/ipex-whl-stable-cpu
pip install oneccl_bind_pt==1.13+cpu -f https://developer.intel.com/ipex-whl-stable-cpu
# also need to build oneCCL itself

Modify small_model_debugging/test_model.py to work on CPU:

  • set "torch_adam": True in config_dict to skip op builder, and set "enabled": False for "fp16"
  • set dtype=torch.float for train_data to avoid half precision

Running the script leads to error AssertionError: attempting to reduce an unsupported grad type: torch.FloatTensor. Remove the assertion (line 123~124 in DS 0.9.5) then it trains fine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions