Replies: 1 comment 2 replies
-
ping @hzhwcmhf |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
https://qwenlm.github.io/blog/qwen2.5-turbo/
I noticed in the blog that Qwen2.5-7B's TTFT on an A100 is reported to be around 200 seconds in the full attention setting. However, when I deployed Qwen2.5-7B-1M using vllm 0.6.4.post1, the TTFT was approximately 10 minutes. Is the Qwen2.5-7B mentioned in the blog the same as the open-source Qwen2.5-7B? What inference framework was used? Could the version of vllm I am using be causing the issue?
Beta Was this translation helpful? Give feedback.
All reactions