-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add infill sampler #9896
Conversation
// normalize probs | ||
for (size_t i = 0; i < cur_p->size; ++i) { | ||
cur_p->data[i].p /= p_sum; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few samplers do this, but I don't see the point because every sampler that needs the probabilities calls softmax first anyway and recomputes the probabilities.
During the refactor I came to the conclusion that the we only really store logits. Every time probabilities are needed a softmax is done to get them, llama_token_data::p
is only used as temporary storage for the result of the softmax, and could be removed entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think currently there is a scenario that uses the p
s - call dist
sampler without explicit softmax
before that. We don't do it in any of the examples, but it's technically possible?
Anyway, I agree that the p
should be removed completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that should be considered a bug in the dist sampler then, because there is no way to know if the probabilities are valid without calling softmax. So any sampler that needs them, must call softmax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed it is a bug. There are a few places where we do:
llama_sampler * smpl = llama_sampler_chain_init(sparams);
llama_sampler_chain_add(smpl, llama_sampler_init_top_k(params.sparams.top_k));
llama_sampler_chain_add(smpl, llama_sampler_init_top_p(params.sparams.top_p, params.sparams.min_keep));
llama_sampler_chain_add(smpl, llama_sampler_init_temp (params.sparams.temp));
llama_sampler_chain_add(smpl, llama_sampler_init_dist (params.sparams.seed));
This would render the temperature sampler useless as it modifies only the logits. I think we should remove the explicit softmax
calls in places like common/sampling.cpp
:
llama_sampler_chain_add(result->chain, llama_sampler_init_softmax()); // remove this
llama_sampler_chain_add(result->chain, llama_sampler_init_dist(params.seed));
And update the dist
sampler to do softmax at the start. Sounds good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. We should probably remove llama_sampler_init_softmax
entirely since it is useless to applications.
f91ce94
to
8a22f2a
Compare
ggml-ci
8a22f2a
to
d8d0eea
Compare
Add a new sampler that I think is suitable for infill tasks. It promotes end-of-generation (EOG) tokens if it is not very confident and also combines (merges) tokens with common prefix.
I think there are more improvements that can be made specifically for fill-in-the-middle, but this version seems to work OK for now.
Also add the following function to the
libllama
API: