Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][0.6.7] garbage output for multi-gpu with tutorial #2113

Closed
lanking520 opened this issue Jul 19, 2022 · 9 comments
Closed

[BUG][0.6.7] garbage output for multi-gpu with tutorial #2113

lanking520 opened this issue Jul 19, 2022 · 9 comments
Assignees
Labels
bug Something isn't working inference

Comments

@lanking520
Copy link

lanking520 commented Jul 19, 2022

Describe the bug
When running GPU = 2 started to see garbage output generated.

[{'generated_text': 'DeepSpeed is����極��極\\\\\\\\\\ \n\nの (  (  "\n090 nodot\x0c �\n �$, "\xa0\n\n \n\n \\ �\n �\n\n � �\n �osa\n\n � oldaran � � �aran======\\'}

To Reproduce
I am running with 2 GPU instance with V100, also reproducible using A100.

Just follow this example: https://www.deepspeed.ai/tutorials/inference-tutorial/

# Filename: gpt-neo-2.7b-generation.py
import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
                     device=local_rank)



generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.float,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)
deepspeed --num_gpus 2 gpt-neo-2.7b-generation.py

Expected behavior

Should be just normal?

ds_report output

# ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.6.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3

Screenshots

f': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 29, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': True, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 30, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 30, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 31, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': True, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 31, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': True, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
[2022-07-19 23:58:00,131] [INFO] [engine.py:144:__init__] Place model to device: 1
[2022-07-19 23:58:00,153] [INFO] [engine.py:144:__init__] Place model to device: 0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'DeepSpeed is����極��極\\\\\\\\\\ \n\nの (  (  "\n090 nodot\x0c �\n �$, "\xa0\n\n \n\n \\ �\n �\n\n � �\n �osa\n\n � oldaran � � �aran======\\'}]
[2022-07-19 23:58:04,674] [INFO] [launch.py:210:main] Process 811 exits successfully.
[2022-07-19 23:58:05,675] [INFO] [launch.py:210:main] Process 810 exits successfully.

System info (please complete the following information):

  • OS:Ubuntu 20.04
  • GPU count and types 2 V100
  • Python version 3.8
@lanking520
Copy link
Author

This is also reproducible on GPT-J-6B model if you simply switch it

@zcrypt0
Copy link

zcrypt0 commented Jul 21, 2022

I am also seeing this, but not with every model. I do see it when using the tutorial model as well though.

@jeffra
Copy link
Collaborator

jeffra commented Jul 25, 2022

Thank you for reporting this! I've verified we can repro this on our side as well, but only when using >1 gpus. There's a gap currently in our CI tests for multi-gpu and certain models. We'll be fixing as soon as possible.

@RezaYazdaniAminabadi
Copy link
Contributor

RezaYazdaniAminabadi commented Aug 9, 2022

Hi @zcrypt0 and @lanking520 ,

Sorry for my delay! I just pushed a fix for this. Could you please try to see if the issue is fixed?
Thanks,
Reza

@zcrypt0
Copy link

zcrypt0 commented Aug 9, 2022

@RezaYazdaniAminabadi

I installed from your PR commit.
pip install git+https://github.com/microsoft/DeepSpeed@73fc0303bf723386df95be0e55259197e540506e

With the bigscience/bloom-350m model I don't see any change in the output, it still doesn't make any sense.

In fact, with that model, i see the issue even when using --num_gpus 1

I also tested the script that @lanking520 posted and I get the following error:

venv/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 415, in selfAttention_fp
    qkv_out = qkv_func(
RuntimeError: Fail to create cublas handle.

I double checked by reverting the deepspeed installation to master and the test script still gives that error, so it's possible its something in my environment, although other models seem to work.

@RezaYazdaniAminabadi
Copy link
Contributor

@zcrypt0 I think this must be related to some issue on your CUDA driver/library, since you even did not pass the first phase of creating a CUBLAS handle. Could you please try reinstalling them?
Thanks,
Reza

@zcrypt0
Copy link

zcrypt0 commented Aug 10, 2022

@RezaYazdaniAminabadi
I just saw #2194 and am thinking my issue may be related to that as I ran on 1080tis.

I am going to test out this script on a set of ampere gpus and see how it goes.

EDIT: I installed from master and ran the script on an 2xA100s. This was the output.

[{'generated_text': 'DeepSpeed is a software house that makes software that solves very hard problems\n\n"Why we do what we do"\n\nIn most cases, Fastest.fm\'s original business plan was to monetize the\ncontent its users provided. This'}]

@mallorbc
Copy link

I was also getting junk output following the tutorial. I can confirm that after building DeepSpeed from master that the issue seems resolved from GPT Neo 2.7B.

I am however having another issue with regard to memory usage. Even when I specify torch.half(or torch.float16), the model seems to use the full VRAM on both GPUS. For example, running GPTJ on dual 3090s leads to OOM issues with usage over 24GB on each.

Also, and perhaps I am misunderstanding the use of this tool, but isnt the VRAM usage supposed to be split over the multiple GPUs? So I would expect roughly 6-7GB usage per GPU rather than 24GB for each.

I give more details here #2227

@jeffra
Copy link
Collaborator

jeffra commented Dec 2, 2022

Closing, the original issue is resolved and new issue is moved to #2227

@jeffra jeffra closed this as completed Dec 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working inference
Projects
None yet
Development

No branches or pull requests

5 participants