You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am happily genning at 11-16t/s. Suddenly random messages go really slow and the t/s drops. After some more messages it goes back up.
I went into nvtop to check GPU temps but they were all in the 60s. One GPU is cranking and the others are at low % for some reason as if it was processing sequentially.
Am not sure if it's related to my machine but it's a new behavior for me since 2.2 and dev. Mostly probing to see if anyone has experienced the same.
Reproduction steps
Generate as normal. Some messages will be slow.
Expected behavior
Consistent speeds.
Logs
No response
Additional context
Acknowledgements
I have looked for similar issues before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.
The text was updated successfully, but these errors were encountered:
You're not rolling over the KV cache at ~4K context or something like that, right? I don't experience this with the latest dev pull, but I also usually don't have conversations longer than the pre-allocated sequence length. That said, paged attention is now pretty good at handling rolling cache too... so maybe not relevant.
Nope, it's 32k model. Haven't seen it crop up lately though.
Well it happened again. On qwen2.5 at about 10k context. Switching to different prompts didn't fix it. I had to stop and restart the server and suddenly, replies with the same long context are fast again.
OS
Linux
GPU Library
CUDA 11.8
Python version
3.10
Pytorch version
2.4
Model
Luminum 123b 4.0bpw H6
Describe the bug
I am happily genning at 11-16t/s. Suddenly random messages go really slow and the t/s drops. After some more messages it goes back up.
I went into nvtop to check GPU temps but they were all in the 60s. One GPU is cranking and the others are at low % for some reason as if it was processing sequentially.
Am not sure if it's related to my machine but it's a new behavior for me since 2.2 and dev. Mostly probing to see if anyone has experienced the same.
Reproduction steps
Generate as normal. Some messages will be slow.
Expected behavior
Consistent speeds.
Logs
No response
Additional context
Acknowledgements
The text was updated successfully, but these errors were encountered: