Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add infill sampler #9896

Merged
merged 1 commit into from
Oct 15, 2024
Merged

llama : add infill sampler #9896

merged 1 commit into from
Oct 15, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Oct 15, 2024

Add a new sampler that I think is suitable for infill tasks. It promotes end-of-generation (EOG) tokens if it is not very confident and also combines (merges) tokens with common prefix.

I think there are more improvements that can be made specifically for fill-in-the-middle, but this version seems to work OK for now.

curl \
    --silent --no-buffer --request POST --url http://127.0.0.1:8012/infill \
    --header "Content-Type: application/json" \
    --data '{"input_suffix": "    return 0;\n}\n", "input_prefix": "#include <cstdio>\n\nint main() {\n    printf(", "prompt": "", "top_k": 40, "top_p": 0.99, "samplers": ["top_k", "top_p", "infill"]}' | jq

Also add the following function to the libllama API:

    // check if token0 is contained as a prefix in token1
    LLAMA_API bool llama_token_is_prefix(
              const struct llama_model * model,
                           llama_token   token0,
                           llama_token   token1);

Comment on lines +1707 to +1805
// normalize probs
for (size_t i = 0; i < cur_p->size; ++i) {
cur_p->data[i].p /= p_sum;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few samplers do this, but I don't see the point because every sampler that needs the probabilities calls softmax first anyway and recomputes the probabilities.

During the refactor I came to the conclusion that the we only really store logits. Every time probabilities are needed a softmax is done to get them, llama_token_data::p is only used as temporary storage for the result of the softmax, and could be removed entirely.

Copy link
Owner Author

@ggerganov ggerganov Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently there is a scenario that uses the ps - call dist sampler without explicit softmax before that. We don't do it in any of the examples, but it's technically possible?

Anyway, I agree that the p should be removed completely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should be considered a bug in the dist sampler then, because there is no way to know if the probabilities are valid without calling softmax. So any sampler that needs them, must call softmax.

Copy link
Owner Author

@ggerganov ggerganov Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed it is a bug. There are a few places where we do:

llama_sampler * smpl = llama_sampler_chain_init(sparams);                                                              
                                                                                                                             
llama_sampler_chain_add(smpl, llama_sampler_init_top_k(params.sparams.top_k));                                      
llama_sampler_chain_add(smpl, llama_sampler_init_top_p(params.sparams.top_p, params.sparams.min_keep));             
llama_sampler_chain_add(smpl, llama_sampler_init_temp (params.sparams.temp));                                                                                                                                                             
llama_sampler_chain_add(smpl, llama_sampler_init_dist (params.sparams.seed));

This would render the temperature sampler useless as it modifies only the logits. I think we should remove the explicit softmax calls in places like common/sampling.cpp:

llama_sampler_chain_add(result->chain, llama_sampler_init_softmax()); // remove this
llama_sampler_chain_add(result->chain, llama_sampler_init_dist(params.seed));

And update the dist sampler to do softmax at the start. Sounds good?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. We should probably remove llama_sampler_init_softmax entirely since it is useless to applications.

@ggerganov ggerganov merged commit 755a9b2 into master Oct 15, 2024
58 checks passed
@ggerganov ggerganov deleted the gg/infill-3 branch October 15, 2024 13:35
drollings pushed a commit to drollings/llama.cpp that referenced this pull request Oct 18, 2024
@iboB iboB mentioned this pull request Oct 28, 2024
4 tasks
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants