Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding time compression #55

Open
Dominic789654 opened this issue Mar 5, 2025 · 2 comments
Open

Decoding time compression #55

Dominic789654 opened this issue Mar 5, 2025 · 2 comments
Labels
feature request New feature or request

Comments

@Dominic789654
Copy link
Contributor

Do you know how to do decoding time compression?
Is there any code example?

@Dominic789654 Dominic789654 added the feature request New feature or request label Mar 5, 2025
@SimJeg
Copy link
Collaborator

SimJeg commented Mar 6, 2025

So far we have focused on the pre-filling phase as most use cases involving long context are related to a long prompt. This is changing with reasoning models and kvpress might evolve in this direction in the future too.

The default forward_hook method used by all presses starts with the following lines:

        # Don't compress after pre-filling
        if kwargs["cache_position"][-1] > q_len:
            return output
[...]

        keys, values = self.compress(module, hidden_states, keys, values, output[1], kwargs)

This could be replaced by something like:

[...]
        if kwargs["cache_position"][-1] <= q_len:
            keys, values = self.compress_prefilling(module, hidden_states, keys, values, output[1], kwargs)
        else:
            keys, values = self.compress_decoding(module, hidden_states, keys, values, output[1], kwargs)

What do you have in mind ?

@Dominic789654
Copy link
Contributor Author

Thank you for your detailed explanation!

I was thinking about a similar solution. Indeed, this approach would require us to modify the function names in all presses, splitting the original compress method into two separate functions: compress_prefilling and compress_decoding. While this involves some refactoring work, from the perspective of code structure and maintainability, this solution appears to be quite clear and intuitive.

If this is currently the most straightforward implementation approach, I'll proceed with trying this method first. Although this change involves some refactoring work, it would better distinguish between the pre-filling phase and decoding phase logic, resulting in a clearer code structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants