Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streaming slowdown when context rises #93

Open
PierrunoYT opened this issue Nov 17, 2024 · 0 comments
Open

streaming slowdown when context rises #93

PierrunoYT opened this issue Nov 17, 2024 · 0 comments

Comments

@PierrunoYT
Copy link

The streaming slowdown when context rises can be attributed to several factors in the current implementation:

  1. Context Management Mechanism:
  • Uses a static window approach instead of dynamic sliding to preserve prompt caching
  • When context gets too large, it triggers truncateHalfConversation which can cause delays
  • The system waits until context is too large before compressing, rather than preemptively managing it
  1. Streaming Implementation Bottlenecks:
  • Current debouncer has a fixed 25ms delay for processing chunks
  • All chunks are processed in sequence, which can cause backpressure when context is large
  • The system retries up to 3 times when context is too long, each retry adding latency
  1. Memory Management:
  • Large contexts are kept in memory until they hit the maximum token limit
  • The smart truncation system keeps 8 recent messages intact, which could be excessive for very large contexts
  • Context compression only happens reactively when hitting limits rather than proactively

The slowdown is primarily caused by:

  • The reactive nature of context compression (only happens when hitting limits)
  • Sequential processing of chunks with fixed delays
  • Keeping too many recent messages intact during truncation
  • Multiple retry attempts when context is too long

To improve performance, consider:

  1. Implementing proactive context compression before hitting limits
  2. Adjusting the RECENT_MESSAGES_TO_PRESERVE count based on context size
  3. Using a dynamic debouncer delay based on context size
  4. Implementing parallel chunk processing for large contexts
  5. Adding progressive context compression instead of waiting for full truncation

These changes would help maintain consistent streaming performance even as context size increases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant