-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src: make tail invalid when kv cell is intersection for mamba #9249
Conversation
src/llama.cpp
Outdated
if ((0 < p0 && p0 <= cell.pos) || (0 < p1 && p1 <= cell.pos)) { | ||
return false; | ||
} | ||
if (p0 <= cell.pos && p1 < cell.pos) { | ||
tail_id = -1; | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not fix the problem. Building llama-server
from this branch, I still get
llama_kv_cache_find_slot: wrong tail for seq_id 1, (0 instead of -1)
/.../src/llama.cpp:3507: GGML_ASSERT(cell.has_seq_id(seq_id)) failed
When using the first request from #9170 with curl
and then using the server Web UI for something unrelated. I'll investigate the root cause, but it does seem like the tail is not invalidated properly somewhere (which by the way also has to be done in sync with clearing the actual tail cell too, which is why the tail was previously invalidated here only when not returning and when the cell (which was the tail) would be cleared with the range conditions below).
Strangely enough, there is no problem when I try using the server UI directly.
Hmm, re-reading my parenthesis above, the problem in the code could actually simply be
diff --git a/src/llama.cpp b/src/llama.cpp
index 2274296b..6c3d4691 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -3689,7 +3689,8 @@ static bool llama_kv_cache_seq_rm(
if ((0 < p0 && p0 <= cell.pos) || (0 < p1 && p1 <= cell.pos)) {
return false;
}
- if (p0 <= cell.pos && p1 < cell.pos) {
+ // invalidate tails which will be cleared
+ if (p0 <= cell.pos && cell.pos < p1) {
tail_id = -1;
}
}
(when diffed from master
)
Re-running the requests from #9170 no longer seems to fail with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks the feedback! I haven't tried this through curl
, and don't know why there are different behavior between curl
and web ui, maybe because of the different times of response? I have seen the times of response maybe inconsistent for the same request(same parameters).
1082138
to
4e8f9a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does! Although I see that the full prompt/history is being reprocessed on each request despite the prefix being same as in the previous request (i.e. no hidden state/context caching). Is the cache supposed to work for Mamba at this point or is it maybe still a work-in-progress? |
This happens when It should be fixed in #7531, but it will probably be "unfixed" before merging because the method used is too over-engineered and too complicated. (tree of sequences to fairly share available cells adds a lot of complexity) The plan is to add explicit checkpoints requested by |
Reset the tail of kv cell for the common prefix tokens for mamba. That maybe fix #9170