forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Note: This issue was copied from ggml-org#4218
Original Author: @ggerganov
Original Issue Number: ggml-org#4218
Created: 2023-11-25T17:04:06Z
There have been a few reports where the grammar sampling can significantly degrade the performance.
It would be nice to profile and optimize the implementation - there should be room for improvements.
Already on-going efforts:
reservespace indecode_utf8ggml-org/llama.cpp#4210- Allow reusing results from
llama_token_to_piecewhen sampling grammars ggml-org/llama.cpp#4213
Probably worth looking in multi-threading the implementation as well.