-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[models] Microsoft Phi 1.5 #1664
Conversation
@zhuohan123 Just a heads up to avoid any unnecessary overlap, I'll review this PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maximzubkov Awesome! Thanks for submitting the PR! The PR generally looks good to me, but needs minor modifications before getting merged. Please take a look at my reviews.
state_dict = self.state_dict() | ||
for name, loaded_weight in hf_model_weights_iterator( | ||
model_name_or_path, cache_dir, load_format, revision): | ||
if "rotary_emb.inv_freq" in name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, why is the inv_freq
here different from the one in our RotaryEmbedding? According to this line, the code looks the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also thought so until I checked the weight:
inv_freq
fromRotaryEmbedding
tensor([1.0000e+00, 5.6234e-01, 3.1623e-01, 1.7783e-01, 1.0000e-01, 5.6234e-02,
3.1623e-02, 1.7783e-02, 1.0000e-02, 5.6234e-03, 3.1623e-03, 1.7783e-03,
1.0000e-03, 5.6234e-04, 3.1623e-04, 1.7783e-04], device='cuda:0',
dtype=torch.float32)
loaded_weight
ofrotary_emb.inv_freq
tensor([1.0000e+00, 5.6250e-01, 3.1616e-01, 1.7786e-01, 9.9976e-02, 5.6244e-02,
3.1616e-02, 1.7776e-02, 1.0002e-02, 5.6229e-03, 3.1624e-03, 1.7786e-03,
1.0004e-03, 5.6219e-04, 3.1614e-04, 1.7786e-04])
But even more surprising is that after I copied the cached weight, the output on Test 4
was still different. However, @Linzecong says that it might be an issue of fp16 link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some investigations, I found that the issue is because inv_freq
is stored in FP16 in the weight checkpoint (I guess it is calculated in FP32 and converted to FP16 for some reason). As we use the same logic to calculate inv_freq
, I think this slight difference should be acceptable.
Can we remove the special weight loading logic for inv_freq
and use our current implementation instead?
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
@maximzubkov A big PR #1622 just got merged. This will make it easier to implement/debug the tensor parallelism support. Please adapt the PR with the new main branch. |
@WoosukKwon, thanks for the above PR, it actually helped. I updated the code with respect to the modifications in this PR, and tested it with
To be completely sure could you please run your test as well, since I tested on multiple GPUs and sometimes I faced some bugs not only with phi-1.5 but also with GPT-NeoX. I'm not sure that the code works correctly with |
The Phi model was just updated by Microsoft and made some breaking changes (like the model name etc.): https://huggingface.co/microsoft/phi-1_5/tree/main @maximzubkov Could you update the PR? I'm sorry for the redundant work. |
Sure, will check in few hours |
@WoosukKwon, done 🥳 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maximzubkov Thanks for the quick update! Could you fix the links in the code?
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
@WoosukKwon done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I've checked that the tensor parallelism is working now. @maximzubkov Thanks again for the great work!
Thank you for reviewing, it was a pleasure! See you in the next PR, since I need one more feature for my projects |
@maximzubkov Looking forward to it! |
Hello, vLLM team!
Thanks a lot for the awesome work you do!
I implemented a new model phi-1.5 from Microsoft (paper, code). I used some comments from this issue, tested the implementation and here are a few notes I have:
I followed GPT-Neo implementation, except for some modifications such as:
query_key_value
used in GPT-NeoX (link), since its not relevant for phi-1.5rotary_emb.inv_freq
since values computed inPagedAttentionWithRoPE
differ from the values in phi-1.5use_parallel_residual
a bit to make it more similar to phi-1.5To ensure that the environment is correct I used the Dockerfile from your repo, building it using the following command:
After successfully creating the docker container, I ran tests with the command
and got the following error message:
so I installed the latest version of
einops
via pip, everything works as expected, but let me know if it makes more sense to fixeinops
version to avoid problems in other parts of the repoI tested everything running
tests/models/test_models.py
on a single A100 (40G), CUDA Version: 12.1, however, vLLM keeps generating text that slightly differs from the HF implementation. So I would be super grateful if someone could have a look and check if everything is done correctly. My guess is that there is a problem withRoPE
Here is an example of unexpected behavior from
tests/models/test_models.py
. Note that for the first 3 tests, the outputs of vLLM and HF implementations match.