Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] MP-sharded checkpoint loading does not work for models except BLOOM #2442

Closed
pai4451 opened this issue Oct 24, 2022 · 4 comments
Closed
Assignees
Labels
enhancement New feature or request inference

Comments

@pai4451
Copy link

pai4451 commented Oct 24, 2022

Describe the bug

We currently want to run inference on EleutherAI/gpt-j-6B model with tensor parallelism on multiple GPUs, similarly to what BLOOM model does. But it seems the way DeepSpeed inference saves and loads the pre-shared checkpoints are not consistent and general enough for other models.

To Reproduce

I tried using the DeepSpeed inference script for BLOOM and modifying lines 140-141 to

model = GPTJForCausalLM.from_pretrained(
    "EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
)

and line 100 to

model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    base_dir=repo_root,
    dtype=getattr(torch, infer_dtype),
    save_mp_checkpoint_path =<some path to save mp checkpoint>,
    **kwargs,
)

After the first run, on my 2x A6000 server, I was able to get the tensor parallelism-sharded checkpoints under the path <some path to save mp checkpoint> and a configuration file ds_inference_config.json shown below

{"type": "ds_model",
"base_dir": <some path to save mp checkpoint>, 
"checkpoints": {"non-tp":["non-tp.pt"], "tp":["tp_00_00.pt", "tp_01_00.pt", "tp_00_01.pt", "tp_01_01.pt", 
    "tp_00_02.pt", "tp_01_02.pt", "tp_00_03.pt", "tp_01_03.pt", "tp_00_04.pt", "tp_01_04.pt",
    , "tp_00_05.pt", "tp_01_05.pt", "tp_00_06.pt", "tp_01_06.pt", "tp_01_07.pt", "tp_01_07.pt"]},
"version": 1.0, 
"parallelization": "tp", 
"tp_size": 2,
"dtype": "float16}

For the second round, I undo the changes for lines 140-141 as well as save_mp_checkpoint_path and use checkpoint=<some path to save mp checkpoint>/ds_inference_config.json in deepspeed.init_inference. This is the standard way for loading the preshared model for BLOOM which speeds up the loading process. However, the above code raises the following error

AssertionError: ds_model checkpoint type is not supported

, which comes from the following code that DeepSpeed inference loads the JSON state_dict

if 'bloom' in sd_type.lower():

def get_sd_loader(ckpt_list, checkpoint_engine, sd_type='Megatron', version=None):

I also tried to change the type in ds_inference_config.json to BLOOM, since the only supported format for JSON checkpoints are BLOOM and Megatron, but this time the following line cause error

if child.weight.is_meta:

AttributeError: 'NoneType' object has no attribute 'is_meta'

Is the preshared checkpoints loading feature only limited to the BLOOM model? How can I use tensor parallelism to split a single model to run on multiple GPUs?

Similar threads:
#2379
#2132

@pai4451 pai4451 added bug Something isn't working inference labels Oct 24, 2022
@RezaYazdaniAminabadi
Copy link
Contributor

Hi @pai4451,

Thanks for pointing this out. I am gonna work on this and send a PR for you to try this for GPT-j. Just, I want to better understand the usage of this feature on your side. If what you need to do is just running GPT-J with MP, DeepSpeed-Inference already supports that. You only need to initialize the model on CPU and call the init_inferences to do the MP for you and run it on multi-GPU setup. Please let me know if I am missing something here.
Thanks,
Reza

@RezaYazdaniAminabadi RezaYazdaniAminabadi self-assigned this Oct 28, 2022
@martincai martincai added enhancement New feature or request and removed bug Something isn't working labels Nov 11, 2022
@RezaYazdaniAminabadi
Copy link
Contributor

Hi @pai4451,

Sorry for my delay. DeepSpeed-Inference is going through some reorganization, and we are working on a solution to make this feature supported.

Best,
Reza

@pai4451
Copy link
Author

pai4451 commented Nov 12, 2022

Thanks @RezaYazdaniAminabadi, I am glad that the DeepSpeed team is working on this feature :D
By the way, I do try the method you suggest (first load in CPU), but I'm not sure why I always encounter CUDA OOM on my 2x A6000 GPUs with GPT-J 6B.

I saw a similar issue #2466, but it seems that the issue is closed. I will try the latest DeepSpeed version to check if I can I initialize GPT-J on CPU first and then let the init_inferences handles the tensor parallelism again.

@RezaYazdaniAminabadi
Copy link
Contributor

RezaYazdaniAminabadi commented Nov 29, 2022

Hi @pai4451,

Can you please try this PR and let me know if it works for you?
Thanks,
Reza

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request inference
Projects
None yet
Development

No branches or pull requests

3 participants