Ds-inference Int8 support through ZeroQuant technology #2217

RezaYazdaniAminabadi · 2022-08-13T01:14:23Z

This PR adds the Int8 support using the ZeroQuant technology introduced here . Note that the kernels added in this PR is just doing a dequantization of the int8-weight matrices and the GeMM operation is still done in the FP16 format using FP16 tensor-core ops. However, the kernels mentioned in this paper can directly operate on the int8 data, by using the int8 tensor-cores if available (such as on A100), which can improve the GeMM throughput by 2x. There will be another PR which uses these kernels to give a much faster inference through MII system.

Here is some performance evaluation:

FP16 on 8 A100-80G:

bsz	latency	tput	tput-per-GPU	Tflops/GPU
1	60.8	16.5	2.1	0.7
8	58.7	136.2	17.0	6.0
32	64.6	495.5	61.9	21.8
64	68.4	936.2	117.0	41.2
100	78.4	1274.7	159.3	56.1

INT8 on 4 A100-80G:

bsz	latency	tput	tput-per-GPU	Tflops/GPU
1	160.5	6.2	1.6	0.5
8	165.1	48.5	12.1	4.3
32	178.7	179.1	44.8	15.8
64	190.7	335.6	83.9	29.5
100	213.0	469.4	117.4	41.3

INT8 on 8 A100-80G:

bsz	latency	tput	tput-per-GPU	Tflops/GPU
1	92.9	10.8	1.3	0.5
8	95.5	83.7	10.5	3.7
32	107.5	297.6	37.2	13.1
64	116.5	549.6	68.7	24.2
128	133.4	959.3	119.9	42.2
196	166.7	1175.5	146.9	51.7
210	170.3	1232.8	154.1	54.2
240	177.9	1349.1	168.6	59.4
256	181.8	1408.2	176.0	62.0

cc: @stas00 @jeffra

sdpmas · 2022-08-19T14:21:42Z

hey @RezaYazdaniAminabadi I have been waiting for this PR for a long time and recently tried out your branch for ZeroQuant (with GPT-J). I found a couple of issues: quantizer here seems to have been used without initializing. I tried using GroupQuantizer and it works but during forward pass, why is it not doing selfAttention_int8 here. I get !!!! kernel execution error. at that point.

RezaYazdaniAminabadi · 2022-08-19T16:06:10Z

hey @RezaYazdaniAminabadi I have been waiting for this PR for a long time and recently tried out your branch for ZeroQuant (with GPT-J). I found a couple of issues: quantizer here seems to have been used without initializing. I tried using GroupQuantizer and it works but during forward pass, why is it not doing selfAttention_int8 here. I get !!!! kernel execution error. at that point.

Hey @sdpmas
Thanks for letting me know of this issue. I am actually excited to hear someone is using it :) The thing is that we are using FP16 GeMM for now and we have a plan to release the INT8 kernels soon. I will check this and resolve the issue.
Thanks,
Reza

RezaYazdaniAminabadi · 2022-08-19T16:09:08Z

@sdpmas, btw, can you please paste the whole trace here to help me debug this better? Thanks

sdpmas · 2022-08-19T16:25:31Z

yea, sure, here's what the trace looks like:

***************** Creating model in RANK (0) with WORLD_SIZE = 1 *****************
[2022-08-19 16:22:31,093] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.1+84e0d03b, git-hash=84e0d03b, git-branch=ds-inference/ZeroQuant-Int8
[2022-08-19 16:22:31,094] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Using /home/samip_dahal/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/samip_dahal/.cache/torch_extensions/py37_cu113/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.22836589813232422 seconds
[2022-08-19 16:22:35,624] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 32, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': 64, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': False, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
!!!! kernel execution error. (m: 2560, n: 22, k: 10240, error: 13) 
Traceback (most recent call last):
  File "deep/infer.py", line 53, in <module>
    output=model.generate(**inp_tokens,max_length=150,min_length=100)
  File "/home/samip_dahal/.deep/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/samip_dahal/.deep/lib/python3.7/site-packages/transformers/generation_utils.py", line 1303, in generate
    **model_kwargs,
  File "/home/samip_dahal/.deep/lib/python3.7/site-packages/transformers/generation_utils.py", line 1693, in greedy_search
    output_hidden_states=output_hidden_states,
  File "/home/samip_dahal/.deep/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/samip_dahal/.deep/lib/python3.7/site-packages/transformers/models/gptj/modeling_gptj.py", line 832, in forward
    return_dict=return_dict,
  File "/home/samip_dahal/.deep/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/samip_dahal/.deep/lib/python3.7/site-packages/transformers/models/gptj/modeling_gptj.py", line 682, in forward
    output_attentions=output_attentions,
  File "/home/samip_dahal/.deep/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/samip_dahal/deep/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 872, in forward
    output = output.to(input_type)
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

and my code looks something like this:

model=deepspeed.init_inference(model,
                                 mp_size=world_size,
                                 dtype=torch.int8,
                                 replace_method='auto',
                                 replace_with_kernel_inject=True,
                                 quantization_setting=1)
        
model.cuda().to(f'cuda:{local_rank}')
inp_tokens = tokenizer("#function to determine if a number is prime and do a lot of things related to fibonacci\nimport", return_tensors="pt")
for token in inp_tokens:
    if torch.is_tensor(inp_tokens[token]):
        inp_tokens[token] = inp_tokens[token].to(f'cuda:{local_rank}')
output=model.generate(**inp_tokens,max_length=150,min_length=100)

sdpmas · 2022-08-19T16:27:36Z

@RezaYazdaniAminabadi any estimates on when you will release int8 kernels? that would be really helpful. I've been trying to speed up generation for GPT-J and ZeroQuant seems to be the way to go!

mayank31398 · 2022-08-31T04:15:28Z

@pai4451 this has been fixed.
Please reinstall deepspeed from master branch
This was a bug in loading checkpoints

pai4451 · 2022-08-31T04:24:23Z

@pai4451 this has been fixed.
Please reinstall deepspeed from master branch
This was a bug in loading checkpoints

@mayank31398 I did install the latest master branch of DeepSpeed (you can see my output of pip freeze). But I initialize DeepSpeed by specifying the keyword save_mp_checkpoint_path for the first time, and for the next time and so on I load the checkpoints with my saved mp checkpoint path, and then I got these repetitive output. Am I doing wrong in these steps? Should I always load from the downloaded checkpoint path from the hub?

pai4451 · 2022-08-31T09:10:32Z

@mayank31398

After upgrade to transformers==4.21.2 the repetitive output seems to be solved for this input (DeepSpeed is a machine learning framework). However, if I change to other input the output of the model is still very repetitive. For example, this is another input I tried

in=He has a
out=He has a lot of money.
He has a lot of money.
He has a lot of money.
He has a lot of money.
He has a lot of money.
He has a lot of money.
He has a lot of money.
…(repeated)

I can mitigate the repetitiveness by increasing the value of repetition_penalty in model.generate(), but I’m still wondering is the repetitiveness from my ZeroQuant BLOOM normal? Did you try out other inputs as well and conclude that such issue has been fixed?

RezaYazdaniAminabadi · 2022-08-31T16:26:19Z

generate_kwargs = {'min_length': 100, 'max_new_tokens': 100, 'do_sample': False} batch_size = 1
in = DeepSpeed is a machine learning framework out = DeepSpeed is a machine learning frameworkRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTCRTC This is the output for me for int8 model
@RezaYazdaniAminabadi This is what is generated for me. :) Sounds about right for a 176 B model

Hi @mayank31398 can you show me how you run this? Thanks, Reza

Hi @RezaYazdaniAminabadi,

I am also getting the repetitive output from the ZeroQuant Int8 version of bloom.
in=DeepSpeed is a machine learning framework
out=DeeSpeed is a machine learning framework for deep deep deep deep deep deep deep deep deep deep deep…(repeated)
My generate_kwargs are max_new_tokens=100, do_sample=False. I installed the latest DeepSpeed from master branch and the versions are
deepspeed==0.7.3+afdc7287
transformers==4.20.1
accelerate==0.12.0
The code I use is from bloom-ds-inference-tp-sharded.txt. Do you know how to solve it or how it come from?

Hi @pai4451
I unfortunately don't see this issue neither with this input nor others. Can you please tell me how I can reproduce this?
Thanks,
Reza

RezaYazdaniAminabadi · 2022-08-31T18:55:05Z

I have tested this with "He has a" input text and it seems the repetition is not related to float16 vsInt8 or DeepSpeed vs HF. This is an issue with the greedy mode of text-generation for this model. So, if you want to generate good quality text, you can pass do_sample as True for now.

Here is the text I see when using fp16 version of checkpoint:

in=He has a                                                                                                                                                                                                                                                 
out=He has a very good point.                                                                                                 
I mean, you know, I don't know what I was thinking.                                                                                                                                                                                                         I mean, I just...                                                                                                                                                                                                                                           I just...                                                                                                                                                                                                                                                   
I just...                                                                                                                     
I just...                                                                                                                     
I just...                                                                                                                     
I just...                                                                                                                     
I just...                                                                                                                     
I just...                                                                                                                     
I just...                                                                                                                     
I just...                                                                                                                                                                                                                                                   
I just...                                                                                                                                                                                                                                                   
I just...                                                                                                                                                                                                                                                   
I just...                                                                                                                                                                                                                                                   
I just...                                                                                                                                                                                                                                                   
I just...                                                                                                                     
I just...                                                                                                                                                                                                                                                   
I just...                                                                                                                                                                                                                                                   
I just...                                                                                                                                                                                                                                                   
I just...                                                                                                                                                                                                                                                   
I just...                                                                                                                                                                                                                                                   
I just...                                                                                                                     
I just...                                                                                                                                                                                                                                                   
I just...                                                                                                                                                                                                                                                   
I just...                                                                                                                     
I just...                                                                                                                     
I just.

By passing do_sample as True, you resolve the repetition issue:

in=He has a
out=He has a big head, but...
- But there is something different.
- Yes.
- He can be a good guy.
- Yes.
You just need to find out who he is, what he likes...
So that I find out who he is, and what he likes?
I just got back to town.
I have things to take care of... that haven't been taken of for a while.
You mean your daughter.
Diana...
Don't try to read too much in what I say.
I'm not asking

Here is the INT8 version of the text:

in=He has a
out=He has a bad reputation now as a corrupt cop.I don't know if you know this.You don't know anything about my brother.He killed three of them and a cop, and he's got an attitude, just like you.
I'm on his side.
He wants to end the business.
So now, what about my brother?
Don't worry about that.
We want to know the truth, no matter who it is.
Are you telling me it's my brother?
Of course it's him, who else?
We know that

pai4451 · 2022-09-01T00:34:17Z

Thanks @RezaYazdaniAminabadi for your explanation. I will keep monitoring the output from FP16 and Int8. I guess both of them will return repetitive result in some cases. Thanks a lot!

xk503775229 · 2022-09-07T06:42:54Z

Is there any guide to running inference on compressed models(especially ZeroQuant)?
Any help would be appreciated.

stas00 · 2022-09-07T17:44:05Z

Please see the demo scripts for BLOOM Inference here:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/scripts/bloom-inference-scripts#deepspeed-inference
scroll down to --dtype int8

wanghaoshuang · 2022-09-14T07:18:40Z

@RezaYazdaniAminabadi
Is there will be another PR for fusing the dequantization with GeMM schedule and fusing the token-wise activation quantization with gelu?

zcrypt0 · 2022-09-16T20:18:58Z

When trying to generate int8 shards (from bigscience/bloom) on a 7xA6000 node I get the following error:

RuntimeError: The size of tensor a (6144) must match the size of tensor b (14336) at non-singleton dimension 1

Full Traceback

Traceback (most recent call last):
  File "t.py", line 402, in 
    model = deepspeed.init_inference(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 305, in init_inference
    engine = InferenceEngine(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 145, in __init__
    self._apply_injection_policy(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 342, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 865, in replace_transformer_layer
    load_model_with_checkpoint(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 229, in load_model_with_checkpoint
    load_module_recursive(r_module)
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 224, in load_module_recursive
    load_module_recursive(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 224, in load_module_recursive
    load_module_recursive(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 222, in load_module_recursive
    layer_policies[child.__class__](child, prefix + name + '.')
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 142, in load_transformer_layer
    module.attention.attn_qkvw = mp_replace.copy(module.attention.attn_qkvw,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 109, in copy
    dst.data.copy_(weight_split.contiguous())
RuntimeError: The size of tensor a (6144) must match the size of tensor b (14336) at non-singleton dimension 1

My code looks like this (infer_dtype==int8):

model = deepspeed.init_inference(model,
                                 mp_size=world_size,
                                 dtype=getattr(torch, infer_dtype),
                                 checkpoint=init_inference_checkpoints_json,
                                 save_mp_checkpoint_path=save_mp_path,
                                 **kwargs,
                                 )

Is it enough to simply change the dtype parameter on init_inference() or is there some special set of parameters to generate quantized params?

EDIT: I am synced to the tip of master

/cc @stas00 @RezaYazdaniAminabadi

mayank31398 · 2022-09-16T21:30:16Z

@zcrypt0 you can't quantize just be specifying int8 as dtype
You can read ZeroQuant for details.
It might be compute heavy, not sure how long it takes.
For BLOOM-176B, Microsoft has released quantized weights

Not sure how these weights are quantized. But ZeroQuant also does layer distillation for quantization (we don't know what data Microsoft used for this). -> is this documentation available @RezaYazdaniAminabadi ?

This is an example to quantize gpt2

zcrypt0 · 2022-09-16T22:12:35Z

Much appreciated @mayank31398

I was hoping it was just magic, but I will read the paper.

When I try to use the quantized weights (microsoft/bloom-deepspeed-inference-int8) on a 7xA6000 node, I get a size mismatch error. I suspect it's because the weights are sharded for 8 gpus. Any insight on this?

The error (note this doesn't occur on an 8xA6000 node):
RuntimeError: The size of tensor a (6144) must match the size of tensor b (4608) at non-singleton dimension 0

Full Traceback

Traceback (most recent call last):
  File "t.py", line 422, in 
    model = deepspeed.init_inference(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 305, in init_inference
    engine = InferenceEngine(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 145, in __init__
    self._apply_injection_policy(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 342, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 895, in replace_transformer_layer
    load_model_with_checkpoint(replaced_module,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 229, in load_model_with_checkpoint
    load_module_recursive(r_module)
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 224, in load_module_recursive
    load_module_recursive(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 224, in load_module_recursive
    load_module_recursive(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 222, in load_module_recursive
    layer_policies[child.__class__](child, prefix + name + '.')
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 138, in load_transformer_layer
    load_parameters(child, prefix + n + '.')
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 128, in load_parameters
    p.data.copy_(bias_split)
RuntimeError: The size of tensor a (6144) must match the size of tensor b (4608) at non-singleton dimension 0

mayank31398 · 2022-09-17T10:09:02Z

I haven't tried on 7 GPUs. I can give it a shot.

@RezaYazdaniAminabadi

* Fix the layer-past for GPT based models (microsoft#2196) * Add gradient_average flag support for sparse grads (microsoft#2188) * Add gradient_average flag support for sparse grads * formatting fixes * Add tests Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Adding additional instructiosn in the compression tutorial on pre-training distillation and quantization for GPT (microsoft#2197) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Log user config exactly (microsoft#2201) * Fix the tensor-slicing copy for qkv parameters (microsoft#2198) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Refactor Distributed Tests (microsoft#2180) Refactor Distributed unit tests * fix table syntax (microsoft#2204) Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Correctly detect offload configuration (microsoft#2208) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * add cuda 11.7 (microsoft#2211) * add cuda 11.7 * formatting * use torch 1.9 (microsoft#2215) * [zero-3] print warning once and support torch parameter (microsoft#2127) * print warning only once. * add support for torch param and only warn on gpu 0 * remove type checking. will be done on a new PR with more tests. Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Add support of OPT models (microsoft#2205) * add opt replace policy * simplify inf. api * fix opt replace policy * fix use-cash & add relu * Add support of custom MLP act. function * Revert "simplify inf. api" This reverts commit 9e910fc. * fix the inference API (temp. solution) * fix code formatting * add unit tests for OPT models. * refactor pre-attention layer norm configuration * add support of opt-350m model * refactor the HF model config initialization * fix hf model config issue Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * fix typos in readme. (microsoft#2218) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [device abstraction] add device abstraction to allow other device than CUDA be used * Fix regression w. dist_init_required (microsoft#2225) * add doc for new bert example (microsoft#2224) * Remove the random-generator from context during inference (microsoft#2228) * Fix the tensor-slicing copy for qkv parameters * remove the random-generator from context during inference * formatting Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * allow saving ckpt w/o ckpt json + bloom copy fix (microsoft#2237) * Correctly detect zero_offload (microsoft#2213) * Correctly detect offload configuration * Correctly detect offload configuration * Handle deprecated cpu offload setting * Correcly detect zero_offload setting * Minor tweak Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * update videos (microsoft#2249) * Refactor dist tests: Checkpointing (microsoft#2202) Refactor distributed tests: checkpointing Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Make OPT policy backward compatible with pre-OPT transformers versions (microsoft#2254) * fix ds-inference without policy (microsoft#2247) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump to 0.7.2 * Enable contiguous gradients with Z1+MoE (microsoft#2250) MoE training with zero stage 1 only works with `contiguous gradients=True`. * [rebase-202208] additional changes needed when rebase to 202208 * [rebase] cleanup direct cuda usage after merge * Correctly detect CPU optimizer usage (microsoft#2257) * Correctly detect CPU optimizer usage * Update nv-transformers-v100.yml (microsoft#2259) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [precommit] fix pre-commit issues * Update half precision header guards (microsoft#2261) * fix microsoft#2240: wrong time unit in flops_profiler (microsoft#2241) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump to 0.7.3 * Add blob storage to CI runners (microsoft#2260) Add blob storage to CI runners and enable for transformers cache on inference tests * Update replace_module.py, test-gptj.py related fix (microsoft#2269) Fix RuntimeError: Boolean value of Tensor with more than one value is ambiguous when running test-gptj.py * Fix OrderedDict import for python3.6 (microsoft#2267) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Ds inference/fix mp2 (microsoft#2270) * Trajepl: nebula load fix (microsoft#2182) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: chenguo <chenguo@microsoft.com> * prevent torch ext folder mkdir at tmp (microsoft#2274) * Ds-inference Int8 support through ZeroQuant technology (microsoft#2217) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * add a new unit test for cuda ops (microsoft#2278) Co-authored-by: cmikeh2 <connorholmes@microsoft.com> * Add to codeowners file (microsoft#2279) * [pin_memory] make pin_memory select device type * Memory Access Utility (microsoft#2276) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Fp32 accuracy bug fix (microsoft#2285) Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Arash Bakhtiari <arashb@users.noreply.github.com> * Refactor universal checkpointing and tensor fragments (microsoft#2253) * Refactor universal checkpointing and tensor fragments * Formatting * [ds-inference] fix progress bar (microsoft#2286) when loading the non-sharded checkpoint update the progress bar (fix by @RezaYazdaniAminabadi) - I've just tested it to work. Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Offload all gradients to nvme (microsoft#2282) * fused bias relu unittest (microsoft#2297) * fix for pytest picking up local deepspeed dir instead of installed deepspeed (microsoft#2299) * Fix for Zero3 when MP>1 and at least one batch param undefined (microsoft#2289) Co-authored-by: anthony.301 <anthony.301@mri.cluster> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [downstream] merge from xpu support downstream * Unit test for bias add kernel (microsoft#2298) * added unit test * Update pt_binding.cpp * formatting * Update test_bias_add.py * Update relu.cu with mem_access_utils (microsoft#2306) * Add tensor parallel inference unit tests (microsoft#2232) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com> * Fix the residual add mp scaling for GPTNeoX (microsoft#2310) * Add unit tests for residual_add kernels (microsoft#2307) * add inference eval scripts (microsoft#2303) * Upgrade P40 tests to torch 1.8 (microsoft#2316) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO-Inference blog (microsoft#2271) * ZeRO-Inference blog * ZeRO-Inference blog * Format fixes * Apply feedback * Feedback * Update docs/_posts/2022-08-27-zero-inference.md Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update docs/_posts/2022-08-27-zero-inference.md Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Address feedback * Format fixes * More tweaks * long sequence, nvme offload * Add image Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO-Inference blog - wrap up (microsoft#2321) * ZeRO-Inference blog - Update README (microsoft#2322) * refactor to use mem_access (microsoft#2317) * add quant unit test (microsoft#2315) * add quant unit test * add codeowner * format fix * fix undefined symbol: curandSetPseudoRandomGeneratorSeed * modify ref fn name and add comment * add comments * add 4bit quant 16groups * fix * modify groups in ref code * parameterize tensor shape * single param * detach tensor * remove -lcurand flag * add back -lcurand flag Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * only override forward if using cuda-graph (microsoft#2291) * Add more options to inference benchmark (microsoft#2325) * bump to 0.7.4 * MOE residual matmult unit test (microsoft#2323) MOE residual matmul unit tests Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * [device] port cuda device to literal_device() in new tests * MOE matmult with memaccess (microsoft#2336) * Fix formatting * Remove redundant variable * Refactor residual add kernels (microsoft#2333) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * [accel_runtime] add pin_memory to accelerator runtime interface. * mem access for quantize kernel (microsoft#2331) * mem access for quantize kernel * format * format fp32 * modify quant kernel * modify quant kernel2 * modify format * format * fix comments in pytest * fix comments in pytest * format * rerun Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> * increase min pre-commit versions (microsoft#2346) * Extend scratch buffer for long prompts (microsoft#2212) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fix zero docs (microsoft#2350) * Inference profiling updates/fixes (microsoft#2348) (microsoft#2349) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Kernel Data Conversion Utility (microsoft#2327) * Unify macro definitions and constants in a single file * Conversion utility implementation. * Fix reversion from formatting * Bugfixes after testing with correct DeepSpeed * Inline markers are available on both HIP + CUDA * Add Onebit Optimzers in __init__ (microsoft#2340) Co-authored-by: Saeyeol Lee <sylee@si-anlaytics.ai> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [accelerator abstraction] merge from microsoft#2320 * docs(mixture-of-experts-inference): fix typo in tuto (microsoft#2345) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * download cifar to blob storage (microsoft#2342) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Refactor gptj_residual_add kernels for better readability (microsoft#2358) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * Updated issue templates (microsoft#2363) * Update issue templates * fix cuda invalid config error in dequant kernel (microsoft#2362) * format * remove round fn * Add missing pytest fixture scope (microsoft#2353) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Extend residual_add kernel tests to conver pre_attn_norm (microsoft#2354) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Refactor fused_bias_residual kernels for better readability (microsoft#2356) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Capture error message during sweep tests (microsoft#2351) * Collect error messages in results.csv Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * fix an exception when recursively casting dicts to fp16 (microsoft#2370) * Refactor remaining distributed tests (microsoft#2216) * batch of refactored tests * more test refactoring * fp16 test refactor * more refactors * added DistributedFixture class * applied DistributedFixture to first batch of tests as a trial * added DistributedFixture test and documentation * last tests * fixes for refactored tests * remove subdirs in workflow files * fix pytest syntax error * fix another syntax error * update imports * use DistFixture with elastic checkpoint test * missing import * update to shared class tmpdir for elastic test * moved test files * avoid duplicate test file name * last refactor and moving test files * formatting * fix broken import * testing forked AMD tests * update abstract method * use blob storage for accelerate and transformers tests * upgrade torch for acclerate CI Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix the MLP output tensor's shape (microsoft#2380) * allow building with latest CUDA (11.8), it is backwards compatible (microsoft#2390) * pin transformers version for unit tests (microsoft#2402) * Change type to tuple in replace_wo_policy isinstance check (microsoft#2387) Update the isinstance check inside the `replace_wo_policy` function to `tuple` and `str` instead of `dict`, since the layers are provided as a `tuple` type. Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Molly Smith <mosm@microsoft.com> Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Checkpoint backwards-compatbility workaround (microsoft#2384) * Add predicated global load (microsoft#2373) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * change call site of literal_device, on_accel_device and accel_runtime to get_accelerator() call * add new interface definition from olruwase/accelerator_abstraction * MII blog post (microsoft#2418) Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Fix figure reference (microsoft#2419) * [docs] update news items * [docs] add mii repo link * Add SLURM Multinode Runner (microsoft#2404) Signed-off-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Dashiell Stander <dashiell@ip-172-31-45-20.ec2.internal> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix issue with corrupted output on long generation for GPT (microsoft#2359) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * MII blog title update on Readme * DeepSpeed-MII title change in website * Fix GPT Neo-X multi-gpu inference (microsoft#2401) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * MII-Public and MII-Azure subheading in mii post * CI fixes related to triton (microsoft#2422) * [docs] update mii blog title (microsoft#2423) * add SD injection policy (microsoft#2381) Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * [accelerator abstraction] remove name() from interface, device_name() should be used. * merge with master (ec13da6) * fix checkpoint loading when it is a dictionary (microsoft#2425) * Make error regex more generic in collect_results.py (microsoft#2415) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fixes microsoft#2389 (microsoft#2411) truncating expert param storage for checkpointing Co-authored-by: Alexander Jipa <azzhipa@amazon.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Fix for inference gpt-j test (microsoft#2430) * fix for gpt-j failing due to tokenizer error * limit number of gpt-j tokens generated due to low memory * Fixing bug 2361 (microsoft#2410) * fixing bug 2361 * adding pytest for config initialization * chaning expected output to FusedAdam * remove print statement * running yapf on modified files * running pre-commit formatting Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Universal checkpoint for zero stage 1 (microsoft#2284) * Refactor universal checkpointing and tensor fragments * Formatting * Support zero stage1; Expand TP dim * Remove debug prints * Detect sharded optimizer state * Format fixes * Encode reshaping guide * More symbolic constants Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * only add deps if extra is explictly called (microsoft#2432) * Add TestInjectionPolicy inference unittest class for testing custom injection policies (microsoft#2426) This PR adds a TestInjectionPolicy inference unittest class for testing custom injection policies. This test differs from the existing tests in that the injection_policy dictionary is explicitly specified when calling the DeepSpeed init_inference API. The google/t5-v1_1-small text2text-generation model and the roberta-large fill-mask model are added as tests with the injection policy explicitly specified. This is done to expand our unittest coverage to test the path where the replace_wo_policy function is invoked (see microsoftGH-2387). Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * [memory estimators] new config args sync (microsoft#2431) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * parallelize writing of layer checkpoint files across data parallel instances (microsoft#1419) * parallelize layer checkpoints across data parallel groups * use partition_uniform to determine start/end index values * formatting fix * config: add option for parallel write of layer checkpoints in pipeline stage * yapf fixes * enable parallel layer write according to config param * avoid extraneous makedir when rank 0 writes all layers Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix broken link to DeepSpeed Megatron fork (microsoft#2440) Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> * bump to 0.7.5 * [OpBuilder] Add op builder abstraction * convert op builder usage in merged code * merge diff files from upstream * [OpBuilder] add create_op_builder interface in abstract_accelerator.py * remove files that is deleted from upstream * [OpBuilder] add left over op builder usage in tests * [OpBuilder] fix op builder usage in tests * [OpBuilder] fix <op builder>.NAME usage in tests to follow op builder abstraction design * import get_accelerator from deepspeed.accelerator directly * [OpBuilder] remove unused function and sync with main * add missing import * revert changes in device.py to avoid conflict with main * fix alexnet_model to use /tmp instead of /blob * Mingzhi/solve pr108 b (microsoft#115) * move ALL_OPs from __init__.py to all_Op.py to solve circular import * delete deepspeedexamples * fix import * fix regression (microsoft#117) * fix pin_memory * fix regression * fix error Signed-off-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Mikhail Druzhinin <dipetm@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Minjia Zhang <33713995+minjiaz@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Kamal Raj <kamalraj97@gmail.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Arash Bakhtiari <arashb@users.noreply.github.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Zhihong Chen <gdst_czh@163.com> Co-authored-by: Siddharth Singh <siddharth9820@gmail.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: 叶志晟 <yzs981130@126.com> Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com> Co-authored-by: trajep <trajepl@gmail.com> Co-authored-by: chenguo <chenguo@microsoft.com> Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Quentin Anthony <qganthony@yahoo.com> Co-authored-by: anthony.301 <anthony.301@mri.cluster> Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> Co-authored-by: Saeyeol Lee <78332687+l4d2boomer@users.noreply.github.com> Co-authored-by: Saeyeol Lee <sylee@si-anlaytics.ai> Co-authored-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org> Co-authored-by: Matt Smith <matt@mjksmith.com> Co-authored-by: Thomas-MMJ <112830596+Thomas-MMJ@users.noreply.github.com> Co-authored-by: lekurile <113481193+lekurile@users.noreply.github.com> Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Molly Smith <mosm@microsoft.com> Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Dashiell Stander <dashiell@ip-172-31-45-20.ec2.internal> Co-authored-by: Andrey Chernykh <andrew.chernyh@gmail.com> Co-authored-by: Alexander Jipa <alexander.jipa@gmail.com> Co-authored-by: Alexander Jipa <azzhipa@amazon.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Adam Moody <moody20@llnl.gov> Co-authored-by: AGUL <mingzhi.liu@intel.com>

JingfengYang · 2022-12-22T10:41:53Z

I tried to run 8-bit quantized inference of BLOOM-176B on 8 40G A100 GPUs, but encountered the error: "AttributeError: 'GroupQuantizer' object has no attribute 'num_groups' ". I think it's because'GroupQuantizer' doe not initialize 'num_groups' attribute. Could you please help fix it?

jeffra · 2022-12-22T22:13:49Z

@JingfengYang, I've reproduced this and will update this thread once we have a fix. Sorry for the inconvenience and thanks for reporting this to us as well.

JingfengYang · 2022-12-22T23:19:51Z

Thanks! Also, when I'm using a prior committed version of this repo, there is not such error but a new error "AttributeError: 'Parameter' object has no attribute 'scale'" occurs. FYI, I'm using this script: https://github.com/huggingface/transformers-bloom-inference/blob/main/bloom-inference-scripts/bloom-ds-inference.py recommended by Huggingface official website to run 8-bit quantized inference of BLOOM-176B on 8 40G A100 GPUs.

jeffra · 2022-12-23T00:50:09Z

@JingfengYang, I think we've come up with a fix for the num_groups issue. I've pushed a PR #2645 that should fix this. I need to consult w. @RezaYazdaniAminabadi after the winter break to ensure I am not missing anything here, but feel free to give it a try on your side.

liangxiaoyun · 2023-03-08T04:30:12Z

Is this problem solved? I also encountered the same problem here.

hey @RezaYazdaniAminabadi I have been waiting for this PR for a long time and recently tried out your branch for ZeroQuant (with GPT-J). I found a couple of issues: quantizer here seems to have been used without initializing. I tried using GroupQuantizer and it works but during forward pass, why is it not doing selfAttention_int8 here. I get !!!! kernel execution error. at that point.

Tracin · 2023-03-22T11:21:47Z

@RezaYazdaniAminabadi After some simple modification of code, I ran my model in INT8 with low cuda memory usage, very appreciated for this. I wonder when will the full ZeroQuant, I mean INT8 calculation of Gemm will be released?

kiucho · 2024-07-06T05:58:50Z

Is there any guide I can use ZeroQuant? And is next example just about quantizing(ZeroQuant) gpt2? Then how can I inference to see how much latency is improved?
https://github.com/microsoft/DeepSpeedExamples/tree/master/compression/gpt2

kiucho · 2024-07-06T09:52:55Z

Sorry, I leave this comment cuz there' no specific guide for inferencing ZeroQuant model.

https://github.com/microsoft/DeepSpeedExamples/blob/master/compression/gpt2/bash_script/run_zero_quant.sh
I have quantized(zeroquant) gpt-j 6b with above script.
At the end of run_clm_no_trainer.py (https://github.com/microsoft/DeepSpeedExamples/blob/master/compression/gpt2/run_clm_no_trainer.py)
I appended next code to benchmark the quantized one refering gpt-bench.py(https://github.com/microsoft/DeepSpeedExamples/blob/master/benchmarks/inference/gpt-bench.py)

Setting dtype=torch.float16 works, but setting dtype=torch.int8 gives an error, and it seems that the Int8 kernel is not yet supported.
What I want to know is if there's something wrong with the code I added, and if the int8 kernel is supported but I'm not utilizing it properly.

Thanks.

    import time
    from transformers import pipeline
    from deepspeed.accelerator import get_accelerator

    def print_latency(latency_set, title, warmup=3):
        # trim warmup queries
        latency_set = list(latency_set)
        latency_set = latency_set[warmup:]
        count = len(latency_set)
        if count > 0:
            latency_set.sort()
            n50 = (count - 1) * 0.5 + 1
            n90 = (count - 1) * 0.9 + 1
            n95 = (count - 1) * 0.95 + 1
            n99 = (count - 1) * 0.99 + 1
            n999 = (count - 1) * 0.999 + 1

            avg = sum(latency_set) / count
            p50 = latency_set[int(n50) - 1]
            p90 = latency_set[int(n90) - 1]
            p95 = latency_set[int(n95) - 1]
            p99 = latency_set[int(n99) - 1]
            p999 = latency_set[int(n999) - 1]

            print(f"====== latency stats {title} ======")
            print("\tAvg Latency: {0:8.2f} ms".format(avg * 1000))
            print("\tP50 Latency: {0:8.2f} ms".format(p50 * 1000))
            print("\tP90 Latency: {0:8.2f} ms".format(p90 * 1000))
            print("\tP95 Latency: {0:8.2f} ms".format(p95 * 1000))
            print("\tP99 Latency: {0:8.2f} ms".format(p99 * 1000))
            print("\t999 Latency: {0:8.2f} ms".format(p999 * 1000))

    deepspeed.init_distributed()
    dtype = torch.float16

    pipe = pipeline("text-generation", model=model, framework="pt", tokenizer=tokenizer)

    if True: # using deepspeed
        pipe.model = deepspeed.init_inference(
            pipe.model,
            dtype=dtype,
            tensor_parallel={"tp_size": 1},
            replace_with_kernel_inject=True,
        )
        pipe.model.profile_model_time()
    
    responses = []
    times = []
    mtimes = []
    for i in range(30):
        get_accelerator().synchronize()
        start = time.time()
        r = pipe("DeepSpeed is", do_sample=False, max_new_tokens=50)
        get_accelerator().synchronize()
        end = time.time()
        responses.append(r)
        times.append(end - start)  # / (args.max_tokens - 3))
        if True: # using deepspeed
            mtimes.append(sum(pipe.model.model_times()))
    if args.local_rank == 0:
        print_latency(times, "(e2e) latency")
        if True: # using deepspeed
            print_latency(mtimes, "(model-only) latency")
        print_latency(map(lambda t: t / (50 - 3), times), "(e2e) per token latency")
        print(f"RESPONSE 0:")
        print("-" * 30)
        print(responses[0][0]["generated_text"])
        print("-" * 30)

Reza Yazdani added 2 commits August 9, 2022 04:02

Fix the layer-past for GPT based models

cf2fe01

add the Int8 support for ds-inference using ZeroQuant technology

c2cf304

RezaYazdaniAminabadi requested review from jeffra, samyam, tjruwase, ShadenSmith, conglongli, awan-10, cli99, eltonzheng, minjiaz, duli2012, mrwyattii, yaozhewei, arashb, xiaoxiawu-microsoft and samadejacobs as code owners August 13, 2022 01:14

Reza Yazdani and others added 6 commits August 15, 2022 07:37

fixing some issue with loading checkpoint and bias-add

d98f1f9

adding the logic to store/restore scale for INT8 checkpoint

ebc82bb

add empty quantization scale for different models to run with fp16

43a7023

Empty-Commit

00aa188

Merge branch 'master' into ds-inference/ZeroQuant-Int8

9bed645

fix sevral issues after merging with master

84e0d03

RezaYazdaniAminabadi mentioned this pull request Aug 19, 2022

ZeroQuant not compressing and making BERT slower #2239

Open

several fixes for generating the INT8 sharded checkpoint

f6cb028

mayank31398 mentioned this pull request Sep 19, 2022

Wrong prediction from "bloom-deepspeed-inference-int8" huggingface/transformers-bloom-inference#10

Closed

zcrypt0 mentioned this pull request Sep 19, 2022

[Question] 4-GPU shard microsoft/bloom-deepspeed-inference-int8 huggingface/transformers-bloom-inference#4

Closed

yaozhewei mentioned this pull request Nov 2, 2022

hi ，when the ZeroQuant inference will be released? #2326

Closed

tomeras91 mentioned this pull request Nov 5, 2022

[BUG] [0.7.4] Attribute error with DeepSpeedTransformerInference #2478

Closed

RezaYazdaniAminabadi mentioned this pull request Nov 18, 2022

Add 4-bit quantized inference to run BLOOM-176B on 2 A100 GPUs #2526

Open

jeffra mentioned this pull request Dec 23, 2022

Fix issue w. bloom int8 when changing tp size #2645

Merged

liangxiaoyun mentioned this pull request Mar 8, 2023

ZeroQuant (with GPT-J)，I get !!!! kernel execution error. #2965

Closed

zhangjun mentioned this pull request Apr 17, 2023

LLM zhangjun/zhangjun.github.io#32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ds-inference Int8 support through ZeroQuant technology #2217

Ds-inference Int8 support through ZeroQuant technology #2217

RezaYazdaniAminabadi commented Aug 13, 2022 •

edited

Loading

sdpmas commented Aug 19, 2022

RezaYazdaniAminabadi commented Aug 19, 2022 •

edited

Loading

RezaYazdaniAminabadi commented Aug 19, 2022

sdpmas commented Aug 19, 2022

sdpmas commented Aug 19, 2022

mayank31398 commented Aug 31, 2022

pai4451 commented Aug 31, 2022 •

edited

Loading

pai4451 commented Aug 31, 2022

RezaYazdaniAminabadi commented Aug 31, 2022

RezaYazdaniAminabadi commented Aug 31, 2022

pai4451 commented Sep 1, 2022

xk503775229 commented Sep 7, 2022

stas00 commented Sep 7, 2022

wanghaoshuang commented Sep 14, 2022 •

edited

Loading

zcrypt0 commented Sep 16, 2022 •

edited

Loading

mayank31398 commented Sep 16, 2022 •

edited

Loading

zcrypt0 commented Sep 16, 2022

mayank31398 commented Sep 17, 2022

JingfengYang commented Dec 22, 2022

jeffra commented Dec 22, 2022

JingfengYang commented Dec 22, 2022

jeffra commented Dec 23, 2022

liangxiaoyun commented Mar 8, 2023

Tracin commented Mar 22, 2023 •

edited

Loading

kiucho commented Jul 6, 2024 •

edited

Loading

kiucho commented Jul 6, 2024 •

edited

Loading

Ds-inference Int8 support through ZeroQuant technology #2217

Ds-inference Int8 support through ZeroQuant technology #2217

Conversation

RezaYazdaniAminabadi commented Aug 13, 2022 • edited Loading

sdpmas commented Aug 19, 2022

RezaYazdaniAminabadi commented Aug 19, 2022 • edited Loading

RezaYazdaniAminabadi commented Aug 19, 2022

sdpmas commented Aug 19, 2022

sdpmas commented Aug 19, 2022

mayank31398 commented Aug 31, 2022

pai4451 commented Aug 31, 2022 • edited Loading

pai4451 commented Aug 31, 2022

RezaYazdaniAminabadi commented Aug 31, 2022

RezaYazdaniAminabadi commented Aug 31, 2022

pai4451 commented Sep 1, 2022

xk503775229 commented Sep 7, 2022

stas00 commented Sep 7, 2022

wanghaoshuang commented Sep 14, 2022 • edited Loading

zcrypt0 commented Sep 16, 2022 • edited Loading

mayank31398 commented Sep 16, 2022 • edited Loading

zcrypt0 commented Sep 16, 2022

mayank31398 commented Sep 17, 2022

JingfengYang commented Dec 22, 2022

jeffra commented Dec 22, 2022

JingfengYang commented Dec 22, 2022

jeffra commented Dec 23, 2022

liangxiaoyun commented Mar 8, 2023

Tracin commented Mar 22, 2023 • edited Loading

kiucho commented Jul 6, 2024 • edited Loading

kiucho commented Jul 6, 2024 • edited Loading

RezaYazdaniAminabadi commented Aug 13, 2022 •

edited

Loading

RezaYazdaniAminabadi commented Aug 19, 2022 •

edited

Loading

pai4451 commented Aug 31, 2022 •

edited

Loading

wanghaoshuang commented Sep 14, 2022 •

edited

Loading

zcrypt0 commented Sep 16, 2022 •

edited

Loading

mayank31398 commented Sep 16, 2022 •

edited

Loading

Tracin commented Mar 22, 2023 •

edited

Loading

kiucho commented Jul 6, 2024 •

edited

Loading

kiucho commented Jul 6, 2024 •

edited

Loading