Skip to content

Implement Llama longest prefix cache #158

Closed
@abetlen

Description

@abetlen

Opening this up to track the development of the new caching behaviour I'm planning to implement. This will leverage 2 significant improvements

  • Reduced llama state size which is now a function of evaluated tokens
  • Improved efficiency of Llama.generate which now only eval's prompt tokens that are not already in the context window

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions