-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLaMA support #506
Comments
Can you introduce what's difference between GPT-J and LLaMA? |
+1 for this |
They look very similar. In HuggingFace's doc page, they say that the implementation is based on the GPT-NeoX codebase, which seems to be supported by FasterTransformer: https://huggingface.co/docs/transformers/main/model_doc/llama. Do you think it'll work? |
+1 @byshiue According to our investigation, it is not difficult to portal this model to Megatron as well. But I am not sure will one convert script works. |
Thank you for the suggestion and discussion. We may not have time to work on that issue right now. If you are interesting, you can try to support it. |
+1 for this |
It seems to be quite a simple implementation @byshiue. All that needs to be done is implement RMS layer norm in GPT-NeoX, as well as support the SILU activation. It seem that both of these features are already implemented elsewhere in FasterTransformer. I'd be happy to take the lead if you can help me with the general steps. |
+1 for this |
1 similar comment
+1 for this |
I compared the GPT-j and llama models in huggingface, they have the same attention layer. There are some differences in FFN, llama uses 3 weights, and the forward function is as follows
I checked the relevant code of the ffn layer in the source code, and it seems that there is no similar structure. Or such a layer already exists in the current code and I have not found it, I hope to get some tips. |
It looks like a standard gated silu. Can you explain what difference do you think? |
Thanks for the reminder, I missed this part. |
Wow, thank you @moonscar. Want any help? What's the status of your PR? |
need this too |
have you started this work? or i can help with it. |
Don't think it's been started yet @Anychnn |
Given the interest and activity here, I'd like to offer a bounty of $2,500 USD to whoever can get Llama implemented in FT. Please email me at michael@phind.com if you're interested. @moonscar @AnShengqiang @Anychnn @byshiue It seems that all that needs to be done is copy over T5's RMS layer norm (already implemented in FT) and UL2's gated-silu (also already implemented elsewhere in FT) into GPT-NeoX. As per the Huggingface's implementation of Llama, it is otherwise completely identical to GPT-NeoX (which is already implemented in FT). |
The bounty will be $3,000 if a correct and working PR is opened by the end of Friday, April 21st (Pacific Time). |
would be glad to help do a part of the work, for example converting the weights to FT |
Made alot of progress on this, but my current FT model is outputting seemingly random tokens, so there's something wrong with my weight conversion or maybe even the exact layer implementation. If someone wants to pick up the torch (I am done for now 😞) the next step would prob be to compare layer-by-layer the output of the Huggingface model vs. this FT model: Weights conversion: https://github.com/cameronfr/FasterTransformer/blob/main/examples/cpp/llama/huggingface_llama_convert.py Everything is modified from the respective GPTNeoX versions. |
@cameronfr The default layernorm_eps_ of llama.h is set to be 1e-5, but llama-7b-torch set it default to 1e-6. And the attention module output is also incorrect, I am fixing this. |
@cameronfr I think the reshape of qkv here might not be correct https://github.com/cameronfr/FasterTransformer/blob/45d48f9d06713cd006f7d95d4b2f99a4bd3abb11/examples/cpp/llama/huggingface_llama_convert.py#L97 |
Great progress @cameronfr @Anychnn @jinluyang. I'm doubling the bounty to $6k to whoever can get this working and merged in. |
Hey @michaelroyzen @cameronfr @Anychnn @jinluyang , I got a self-tested working version and opened a pull request with it. Could you guys please take a look? Any chances we could get it merged? |
Nice! Works well so far in limited tests and is consistent with the Huggingface output using beam_size 1. One comment is that it should support max_position_embeddings (max_pos_seq_len in FT), but this is likely a simple change. Will continue testing and post the updates here. |
@michaelroyzen Does FT support the fine-tuned LLaMA with Lora? Training code is as follows: https://github.com/tloen/alpaca-lora/blob/main/finetune.py |
Use the |
Hey community, here are some updates:
|
Does llama able to run correctly with ft now? |
你好, ?xml:namespace>
|
llama支持dynamic batching吗? |
Llama 2 released: https://ai.meta.com/resources/models-and-libraries/llama/ Is it possible to serve it with Triton? |
worth trying, if there is no structural change, llama-2 may be supported. from MetaAI blogs and their paper, they seemed to train new parameters on different methods. |
structural changes may exist. |
llama-2-70B: The model configuration has one more parameter. Converting model weights using huggingface_llama_convert.py indicates an error: Comparing the two versions of llama model, the weight dimensions of k, q and v are different. llama-65B: Can model transformation give some suggestions for changes? This dimension change is not the inference implementation code in FT also needs to be modified synchronously. |
Hello, I would like to know if there are any structural changes between llama2 7B and 13B, and can they be directly converted and deployed using the FT framework? |
I have tested llama2 13B with FT framework + int8 and I did not encounter any error. |
@void-main will there be work done to implement MQA? |
Hey @fmac2000 , I'd like to try implement MQA based on FlashAttention2, but I can't promise when this feature would be ready. |
@void-main Maybe can refer to the implementation of this submission, this project is also implemented using the FT framework, recently supported the GQA function of llama2-70B, but for the llama2 models of 7B and 13B, the existing implementation should be directly usable |
@CN-COTER |
@void-main - that’s great news, thank you for all the work you’ve put in so far - it’s extremely appreciated. Let us know 👍 |
Is there any bugs in batching inference? The response of the model always appear garbled characters when request batches of input, like: |
I have also encountered this issue. There is a problem with the output token id. Have you resolved it? |
Can you share your code? My output token id is incorrect, I would like to compare it.Thank you! |
Sorry for late Reply. I have test Llama2-13b-chat on hf-transfomer and FT. The input_id is
So, according to this example output from FT is consistant with HF transformer. |
Thank you very much for your reply. When I used the commit on July 2nd, I received the correct results, but there was a problem with using the commit on April 23rd. I will use a new version to solve this problem. |
Thank you for your reply. However, even after updating to the latest commits, my 13B model still produces garbled output when multiple requests are made concurrently. |
Do we have a working implementation for Llama1 using FlashAttention? I tried to set $FMHA_ENABLE=ON but did not observe any difference in the output or the performance. I'm wondering if anyone has tested this feature and would like to share some more details? |
same error . |
#716 |
The same problem, but these two requests did not solve the garbled code problem. |
我根据llama_guide编译FasterTransformer,
使用的模型是llama-7b-hf,设备是A100-80G |
您的邮件已收到!谢谢
|
Hi, FYI |
Given existing support for GPT-J and its rotary embeddings, is LLaMA supported as well? Huggingface just shipped their implementation: huggingface/transformers@464d420
@byshiue
The text was updated successfully, but these errors were encountered: