support autoTP with weight only quantization in DS inference path#4750
support autoTP with weight only quantization in DS inference path#4750ftian1 wants to merge 3 commits intodeepspeedai:masterfrom
Conversation
|
@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code? |
Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124, |
|
It should be better to detect custom kernel existance by check attribute of the loaded ops, and call custom kernel accordingly, so any accelerator implement these kernels would be plugged.
|
| ds_output = pipe(query, **inf_kwargs) | ||
|
|
||
| #print(local_rank, "baseline", bs_output) | ||
| print(local_rank, "deepspeed", ds_output) |
There was a problem hiding this comment.
Hi, @ftian1 I have run this test. But the result I got is 'deepspeed [{'generated_text': 'DeepSpeed is the greatest,,,,,,,,,,,,,,,'}]'. This result is not right. Can you figure out what's wrong with this test? BTW, I can pass all tests in test_intX_quantization.py.
There was a problem hiding this comment.
@baodii may I know which device you are running on? cuda or cpu?
|
@ftian1 Is usage of WoQ with AutoTP similiar to with kernel injection? Can you post a sample code show WoQ in DeepSpeed looks like withy kernel injection? |
|
@loadams I have solved the merge conflicts. pls check it |
Signed-off-by: Feng Tian <feng.tian@intel.com>
Signed-off-by: Feng Tian <feng.tian@intel.com>
This PR is used to make weight only quantization work with autoTP.
The sample code is like below:
by this way, user can enable WOQ on multiple cards.