Closed
Description
Opening this up to track the development of the new caching behaviour I'm planning to implement. This will leverage 2 significant improvements
- Reduced llama state size which is now a function of evaluated tokens
- Improved efficiency of Llama.generate which now only eval's prompt tokens that are not already in the context window