New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

streaming slowdown when context rises #93

Open

PierrunoYT opened this issue Nov 17, 2024 · 0 comments

PierrunoYT commented Nov 17, 2024

The streaming slowdown when context rises can be attributed to several factors in the current implementation:

Context Management Mechanism:

Uses a static window approach instead of dynamic sliding to preserve prompt caching
When context gets too large, it triggers truncateHalfConversation which can cause delays
The system waits until context is too large before compressing, rather than preemptively managing it

Streaming Implementation Bottlenecks:

Current debouncer has a fixed 25ms delay for processing chunks
All chunks are processed in sequence, which can cause backpressure when context is large
The system retries up to 3 times when context is too long, each retry adding latency

Memory Management:

Large contexts are kept in memory until they hit the maximum token limit
The smart truncation system keeps 8 recent messages intact, which could be excessive for very large contexts
Context compression only happens reactively when hitting limits rather than proactively

The slowdown is primarily caused by:

The reactive nature of context compression (only happens when hitting limits)
Sequential processing of chunks with fixed delays
Keeping too many recent messages intact during truncation
Multiple retry attempts when context is too long

To improve performance, consider:

Implementing proactive context compression before hitting limits
Adjusting the RECENT_MESSAGES_TO_PRESERVE count based on context size
Using a dynamic debouncer delay based on context size
Implementing parallel chunk processing for large contexts
Adding progressive context compression instead of waiting for full truncation

These changes would help maintain consistent streaming performance even as context size increases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment