fix llama meta data error and LLaMa lm_head wrongly loading error#3914
fix llama meta data error and LLaMa lm_head wrongly loading error#3914baodii wants to merge 7 commits intodeepspeedai:masterfrom
Conversation
… wrongly loading error
|
master llama container doesn't support meta tensor, this this PR still cannot work |
delete lm_head weight load part
This PR is used with autoTP, not kernel injection. It will work when the model you put in init_inference API is meta device. |
delete debug code
|
appreciate for this pr |
|
When will this merge ? |
|
resolved conflict. |
|
@baodii Sorry to bother you. Could you please append a new commit or create a new pr for the meta tensor loading when using injected kernel? As the following lines in this pr https://github.com/microsoft/DeepSpeed/pull/3608/files#diff-ad3c4426f1e24b0f6abe2a5b01757eb9d621f67917d46aec05f5e8bc8d757553L88-L89 and https://github.com/microsoft/DeepSpeed/pull/3608/files#diff-ad3c4426f1e24b0f6abe2a5b01757eb9d621f67917d46aec05f5e8bc8d757553R23 In a nut shell, the problem occurs in these two lines https://github.com/microsoft/DeepSpeed/blob/94c7233a8bb51e068ff8dd5d3e03f2e9b5ab248e/deepspeed/module_inject/containers/llama.py#L105-L106 It tries to:
Originally, That is
It is said that the kernel injection with meta tensor loading is partially fixed according to the feedback and test. When testing with 7B llama, the output equals to the huggingface one. But in 65B llama, the output differs, but it doesn't output the garbage, some truly meaningful words but complete different from huggingface. Some extra words: I fixed meta tensor loading (pr by you), kernel injected inference these months before the individual PRs because I did come across these issues and debug out them. But no one reviewed my pr even the feedback from others are positive. So I turn to you, hoping that what I found can be fixed, reducing the duplicate effects in debuging this codebase. analysis**norm_w**That is, as for The gamma parameter is used in Then take the python implement of llama as reference https://github.com/huggingface/transformers/blob/904e7e0f3cee944bffc54e2a084dfcab47ef2036/src/transformers/models/llama/modeling_llama.py#L410 So, the attn_nw
and Which implementation is at And gamma parameter is used at Considering that it is a MLP module following the attention, and the rms_norm is right before the MLP operation, taking the python llama as reference, it is a post attention layer norm That is |
Uh oh!
There was an error while loading. Please reload this page.