-
Notifications
You must be signed in to change notification settings - Fork 13.7k
server : fix speculative decoding with context shift #10641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Do you think we should add a test case for this? Something like: # test_speculative.py
def test_with_ctx_shift():
global server
server.n_ctx = 64
server.start()
res = server.make_request("POST", "/completion", data={
"prompt": "Hello " * 64,
"temperature": 0.0,
"top_k": 1,
})
assert res.status_code == 200
assert len(res.body["content"]) > 0 |
|
Yes, the error can be triggered with |
05837cf to
b436eda
Compare
|
I've been running this PR for about an hour. it seems stable. |
|
Works well for me. Thanks. |
* server : fix speculative decoding with context shift ggml-ci * server : take into account speculative limits ggml-ci * server : add tests
* server : fix speculative decoding with context shift ggml-ci * server : take into account speculative limits ggml-ci * server : add tests
fix #10547
Make sure the speculative batch does not exceed the slot's context.