You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So far we have focused on the pre-filling phase as most use cases involving long context are related to a long prompt. This is changing with reasoning models and kvpress might evolve in this direction in the future too.
The default forward_hook method used by all presses starts with the following lines:
I was thinking about a similar solution. Indeed, this approach would require us to modify the function names in all presses, splitting the original compress method into two separate functions: compress_prefilling and compress_decoding. While this involves some refactoring work, from the perspective of code structure and maintainability, this solution appears to be quite clear and intuitive.
If this is currently the most straightforward implementation approach, I'll proceed with trying this method first. Although this change involves some refactoring work, it would better distinguish between the pre-filling phase and decoding phase logic, resulting in a clearer code structure.
Do you know how to do decoding time compression?
Is there any code example?
The text was updated successfully, but these errors were encountered: