-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mistral 7b and Mixtral 8x7b experience degraded performance (using official docs) #1305
Comments
@iibw do you see the same issue with Mistral-7B-v0.1 (https://huggingface.co/mistralai/Mistral-7B-v0.1) ? just trying to rule out some potential factors that might lead to this. |
None of the four prompts I provided experienced degraded performance when I tested them with Mistral-7b-Instruct-v0.1. Three out of the four prompts produced exactly the same output between TensorRT-LLM and Transformers and the last one wasn't the same as the Transformers output, but it was similar. None of them produced repeating outputs and after experimenting with other prompts, trying to get the same issue to happen, I failed to do so. So, it seems your hunch was correct, and this problem does not affect Mistral v0.1. |
does one of your prompts have large than 4096 sequence length (input + output) ? Mistral-instruct-v0.2 doesn't have a sliding window attention, so you would better remove that line ( |
No because the sequence length is never more than around 2000 tokens due to Just in case that was the issue, I ran Mistral-7B-Instruct-v0.2 without For example, given the prompt
I noticed at the start there are some json config errors. Maybe they contribute to this? Although I believe they exist with v0.1 as well. |
@PerkzZheng it looks like this bug affects Mixtral as well. Using the same system info as my Mistral testing and these commands:
I built Mixtral for my system (I can't do full precision because I don't have enough VRAM) and ran the four prompts above. Three of the four prompts worked without issue, but the prompt The output of the broken prompt
|
I should probably mention, for all these tests with Transformers, I'm using version 4.36.1. If needed, I can provide the Transformers code as well. |
So to summarize,
|
I have the very same issue with Mixtral on Nvidia H100 in version 0.8.0. |
Mistral v0.1 is perfectly fine as far as I have seen. Mixtral 8x7b is the one appearing to also have this problem. I can't test Mixtral 8x7b without int8 weight only applied because it's too large for the A100 GPU I have access to, but according to @bprus, it doesn't seem to make a difference.
I haven't tried Mistral v0.2 with int8 weight only so I can't say for sure, but given what @bprus said, int8 weight only doesn't seem to change anything. So, to recap:
|
Thanks for the summary. I will see if I can reproduce and find the root cause of this. |
I think it's the same as: #722 |
@PerkzZheng I'm not sure if this helps or not, but it looks like there are more changes between Mistral v0.1 and Mistral v0.2 than just the removal of the sliding window |
Also, it doesn't seem like the Nvidia demo at https://build.nvidia.com/mistralai/mixtral-8x7b-instruct has this issue and it says that it uses Triton Inference Server which is probably using TensorRT-LLM as its backend. |
@iibw can you have a try with the main branch ?
|
I tried with the main branch on mistral v0.2 and I'm experiencing the same error. The model repeats texts after max output length is reached. |
it should throw an error if the output length exceeds the max_output_length. |
Sorry I made a typo in my original response. What I mean was the model repeats texts continually until the max output length is reached. But I made another try today and the issue is now surprisingly gone. |
the relevant bugs/issues might have been fixed in the main branch. let us know if you find other issues or you can close this. thanks. |
Hi! I've built the image with Then converted with:
And built with:
Here are 2 examples of outputs I get:
and
I ran more comprehensive tests and for ~400 requests the average number of output tokens is 985. For other methods of serving (like TGI or vLLM or Transformers) the same requests generate around 550 tokens. If you need any more information I'm glad to provide it. |
@bprus have you observed this issue using fp16 weight instead of int8 weight-only ? |
So I built with
And unfortunately, it didn't change anything and the results are exactly the same. I'll try to re-run the test when I get the chance. |
Thanks. I will see if I can reproduce this issue. |
HI @bprus does the issue persist on the latest main branch for you? I tried following your steps, but was unable reproduce it. Here is what I tested
Output:
|
Hi, @djns99 ! Yet, I stumbled upon another minor issue. In
Just a heads up that you might want to fix it sometime 😉 Once again, thanks for all the help! |
Most of my prompts aren't repeating endlessly now, but there's still one. Passing "What is machine learning?" as the prompt to Mixtral 8x7b continues to loop endlessly. I don't think this happens with Transformers so the root problem hasn't been fixed yet. |
I tested "What is machine learning?" with transformers and it also loops endlessly for that prompt so it seems like this is an expected output. This clears everything up for me so I'll go ahead and close this issue. Thanks for the help! |
I am facing same issue on lllama3-8b-instruct and phi3-mini-128k-instruct |
@anubhav-agrawal-mu-sigma Which version of TensorRT-LLM do you use? And can you share more details that can help us reproduce the issue? |
@lfr-0531 Using tensorrt-llm v0.11.0, using docker image nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
|
@anubhav-agrawal-mu-sigma can you try with the main branch ?
is this output expected ? And please open another issue with more detailed information like GPU architecture. |
System Info
pip install tensorrt_llm==0.8.0 --extra-index-url https://pypi.nvidia.com
Who can help?
@kaiyux @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
or if you disturb the formatting a bit as with the spaces here
Expected behavior
The LLM will provide a complete response which ends instead of deteriorating into an infinite loop. This problem does not happen with Transformers as far as I can tell. Every one of the above example prompts ends and does not infinitely loop.
For example with Transformers:
prompt:
[INST] Please write an essay on the thermodynamics of pizza. [/INST]
output:
actual behavior
Generation does not end until the
max_output_len
is reached and the farther it goes, the worse it gets. From what I've seen it starts repeating itself and then outputting random tokens which decode as random unicode symbols.For example,
prompt:
[INST] Please write an essay on the thermodynamics of pizza. [/INST]
output:
starts off well with
but after awhile it starts a continuous loop until finally it reaches the end
additional notes
Increasing the repetition penalty has an effect, but it doesn't always work and degrades the output whereas I've never seen this issue with Mistral 7b using Transformers.
The text was updated successfully, but these errors were encountered: