[BUG] MP-sharded checkpoint loading does not work for models except BLOOM #2442

pai4451 · 2022-10-24T13:15:16Z

Describe the bug

We currently want to run inference on EleutherAI/gpt-j-6B model with tensor parallelism on multiple GPUs, similarly to what BLOOM model does. But it seems the way DeepSpeed inference saves and loads the pre-shared checkpoints are not consistent and general enough for other models.

To Reproduce

I tried using the DeepSpeed inference script for BLOOM and modifying lines 140-141 to

model = GPTJForCausalLM.from_pretrained(
    "EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
)

and line 100 to

model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    base_dir=repo_root,
    dtype=getattr(torch, infer_dtype),
    save_mp_checkpoint_path =<some path to save mp checkpoint>,
    **kwargs,
)

After the first run, on my 2x A6000 server, I was able to get the tensor parallelism-sharded checkpoints under the path <some path to save mp checkpoint> and a configuration file ds_inference_config.json shown below

{"type": "ds_model",
"base_dir": <some path to save mp checkpoint>, 
"checkpoints": {"non-tp":["non-tp.pt"], "tp":["tp_00_00.pt", "tp_01_00.pt", "tp_00_01.pt", "tp_01_01.pt", 
    "tp_00_02.pt", "tp_01_02.pt", "tp_00_03.pt", "tp_01_03.pt", "tp_00_04.pt", "tp_01_04.pt",
    , "tp_00_05.pt", "tp_01_05.pt", "tp_00_06.pt", "tp_01_06.pt", "tp_01_07.pt", "tp_01_07.pt"]},
"version": 1.0, 
"parallelization": "tp", 
"tp_size": 2,
"dtype": "float16}

For the second round, I undo the changes for lines 140-141 as well as save_mp_checkpoint_path and use checkpoint=<some path to save mp checkpoint>/ds_inference_config.json in deepspeed.init_inference. This is the standard way for loading the preshared model for BLOOM which speeds up the loading process. However, the above code raises the following error

AssertionError: ds_model checkpoint type is not supported

, which comes from the following code that DeepSpeed inference loads the JSON state_dict

DeepSpeed/deepspeed/runtime/state_dict_factory.py

Line 34 in a524864

if 'bloom' in sd_type.lower():

DeepSpeed/deepspeed/runtime/state_dict_factory.py

Line 42 in a524864

    
           def get_sd_loader(ckpt_list, checkpoint_engine, sd_type='Megatron', version=None):

I also tried to change the type in ds_inference_config.json to BLOOM, since the only supported format for JSON checkpoints are BLOOM and Megatron, but this time the following line cause error

DeepSpeed/deepspeed/module_inject/load_checkpoint.py

Line 199 in a524864

if child.weight.is_meta:

AttributeError: 'NoneType' object has no attribute 'is_meta'

Is the preshared checkpoints loading feature only limited to the BLOOM model? How can I use tensor parallelism to split a single model to run on multiple GPUs?

Similar threads:
#2379
#2132

The text was updated successfully, but these errors were encountered:

RezaYazdaniAminabadi · 2022-10-28T17:46:31Z

Hi @pai4451,

Thanks for pointing this out. I am gonna work on this and send a PR for you to try this for GPT-j. Just, I want to better understand the usage of this feature on your side. If what you need to do is just running GPT-J with MP, DeepSpeed-Inference already supports that. You only need to initialize the model on CPU and call the init_inferences to do the MP for you and run it on multi-GPU setup. Please let me know if I am missing something here.
Thanks,
Reza

RezaYazdaniAminabadi · 2022-11-11T19:47:32Z

Hi @pai4451,

Sorry for my delay. DeepSpeed-Inference is going through some reorganization, and we are working on a solution to make this feature supported.

Best,
Reza

pai4451 · 2022-11-12T04:01:19Z

Thanks @RezaYazdaniAminabadi, I am glad that the DeepSpeed team is working on this feature :D
By the way, I do try the method you suggest (first load in CPU), but I'm not sure why I always encounter CUDA OOM on my 2x A6000 GPUs with GPT-J 6B.

I saw a similar issue #2466, but it seems that the issue is closed. I will try the latest DeepSpeed version to check if I can I initialize GPT-J on CPU first and then let the init_inferences handles the tensor parallelism again.

RezaYazdaniAminabadi · 2022-11-29T01:15:05Z

Hi @pai4451,

Can you please try this PR and let me know if it works for you?
Thanks,
Reza

pai4451 added bug Something isn't working inference labels Oct 24, 2022

RezaYazdaniAminabadi self-assigned this Oct 28, 2022

martincai added enhancement New feature or request and removed bug Something isn't working labels Nov 11, 2022

RezaYazdaniAminabadi mentioned this issue Nov 28, 2022

Fix quantized-inference & Add generic support of checkpoint loading #2547

Merged

RezaYazdaniAminabadi closed this as completed Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] MP-sharded checkpoint loading does not work for models except BLOOM #2442

[BUG] MP-sharded checkpoint loading does not work for models except BLOOM #2442

pai4451 commented Oct 24, 2022 •

edited

Loading

RezaYazdaniAminabadi commented Oct 28, 2022

RezaYazdaniAminabadi commented Nov 11, 2022

pai4451 commented Nov 12, 2022 •

edited

Loading

RezaYazdaniAminabadi commented Nov 29, 2022 •

edited

Loading

[BUG] MP-sharded checkpoint loading does not work for models except BLOOM #2442

[BUG] MP-sharded checkpoint loading does not work for models except BLOOM #2442

Comments

pai4451 commented Oct 24, 2022 • edited Loading

RezaYazdaniAminabadi commented Oct 28, 2022

RezaYazdaniAminabadi commented Nov 11, 2022

pai4451 commented Nov 12, 2022 • edited Loading

RezaYazdaniAminabadi commented Nov 29, 2022 • edited Loading

pai4451 commented Oct 24, 2022 •

edited

Loading

pai4451 commented Nov 12, 2022 •

edited

Loading

RezaYazdaniAminabadi commented Nov 29, 2022 •

edited

Loading