fix: Manual compaction does not update context window.#6682
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes manual compaction so that session token/context metrics are updated alongside the conversation history, and adds tests intended to validate compaction behavior across manual, auto-threshold, and context-limit-recovery triggers.
Changes:
- Update
/compactcommand to apply compactionProviderUsageto session metrics (input/output/total + accumulated). - Add a mock provider and new integration-style tests asserting conversation visibility and token accounting after compaction paths.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| crates/goose/src/agents/execute_commands.rs | Updates manual compaction to record usage into session metrics after replacing the conversation. |
| crates/goose/tests/agent.rs | Adds new compaction-focused tests and a mock provider to validate conversation state + token metrics. |
crates/goose/tests/agent.rs
Outdated
| // Note: The current session state reflects the compaction operation, | ||
| // as the agent records compaction metrics before retrying | ||
| let final_input = updated_session.input_tokens.unwrap(); | ||
| let final_output = updated_session.output_tokens; | ||
| let final_total = updated_session.total_tokens.unwrap(); | ||
|
|
||
| // After compaction during recovery, the session shows the compaction tokens | ||
| // Input: system (6000) + long_tool_call messages (~15,400) + new message (100) = ~21,500 | ||
| // Output: 200 (compaction summary) | ||
| // Total: ~21,700 | ||
| assert!( | ||
| final_input >= 21000 && final_input <= 22000, | ||
| "Final input should reflect compaction input (~21,500). Got: {}", | ||
| final_input | ||
| ); | ||
|
|
||
| assert_eq!( | ||
| final_output, | ||
| Some(200), | ||
| "Final output should be compaction output (200). Got: {:?}", | ||
| final_output | ||
| ); | ||
|
|
||
| assert_eq!( | ||
| final_total, | ||
| final_input + final_output.unwrap(), | ||
| "Final total should equal input + output" | ||
| ); |
There was a problem hiding this comment.
test_context_limit_recovery_compaction asserts that the final session metrics reflect the compaction call (e.g., output_tokens == Some(200)), but the agent loop retries after recovery and calls update_session_metrics(..., is_compaction_usage=false) for the successful reply, so the final session state should reflect the reply usage (e.g., output tokens ~100) rather than the compaction usage.
crates/goose/tests/agent.rs
Outdated
| // Setup session with many messages to have substantial context | ||
| // 20 exchanges = 40 messages * 100 tokens = ~4000 tokens in conversation | ||
| let mut messages = vec![]; | ||
| for i in 0..20 { | ||
| messages.push(Message::user().with_text(format!("User message {}", i))); | ||
| messages.push(Message::assistant().with_text(format!("Assistant response {}", i))); | ||
| } | ||
|
|
||
| let session = | ||
| setup_test_session(&agent, &temp_dir, "auto-compact-test", messages).await?; | ||
|
|
||
| // Capture initial context size before triggering reply | ||
| // Should be: system (6000) + 40 messages (4000) = ~10000 tokens | ||
| let initial_session = agent | ||
| .config | ||
| .session_manager | ||
| .get_session(&session.id, true) | ||
| .await?; | ||
| let initial_input_tokens = initial_session.input_tokens.unwrap_or(0); | ||
|
|
||
| // Setup mock provider (no context limit enforcement) | ||
| let provider = Arc::new(MockCompactionProvider::new()); | ||
| agent.update_provider(provider, &session.id).await?; |
There was a problem hiding this comment.
test_auto_compaction_during_reply is unlikely to ever exercise auto-compaction because setup_test_session sets session.total_tokens to Some(1000), and check_if_compaction_needed uses that metadata (instead of estimating from the 40 messages) against a large default context limit (128k for an unknown model), so needs_auto_compact will stay false even with lots of messages.
crates/goose/tests/agent.rs
Outdated
| // For compaction: system prompt length is a good proxy for conversation size | ||
| // Base: 6000 (system) + conversation content embedded in prompt | ||
| 6000 + (system_prompt.len() as i32 / 4).max(400) |
There was a problem hiding this comment.
MockCompactionProvider::calculate_input_tokens derives compaction input tokens from system_prompt.len(), which makes these tests brittle to unrelated edits in the compaction.md template or prompt formatting (token assertions/ranges may fail even if compaction logic is correct); consider making the mock's compaction token accounting independent of the rendered prompt text (e.g., based on number/size of messages or a fixed value).
| // For compaction: system prompt length is a good proxy for conversation size | |
| // Base: 6000 (system) + conversation content embedded in prompt | |
| 6000 + (system_prompt.len() as i32 / 4).max(400) | |
| // For compaction: use a fixed heuristic value so tests are stable | |
| // Approximate previous minimum of 6000 (system) + 400 (conversation content) | |
| 6400 |
| schedule_id: session.schedule_id, | ||
| max_turns: None, | ||
| retry_config: None, | ||
| }; |
There was a problem hiding this comment.
uhm, that doesn't seem great. change update_session_metrics to just take a session_id or get your hands on the actual SessionConfig (I think we can do the first thing)
There was a problem hiding this comment.
Done although, this schedule_id leaked out? We had it in all places so just continuing to pass it along.
crates/goose/tests/agent.rs
Outdated
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod compaction_tests { |
There was a problem hiding this comment.
this is a lot of lines; should this live in its own test module maybe? also the setup of three tests seems similar; can we unify?
There was a problem hiding this comment.
Moved it. Lots of shared code on the setup of the provider and session + validations on post-compaction session state. Maybe could squeeze a little more with the messages, but I think the individual tests tell a clear story of the context before/after compaction and I am seeding the one that triggers out of context with extra messages.
| if compaction_occurred { | ||
| // Verify that current input context decreased after compaction | ||
| let tokens_after = input_tokens_after_compaction | ||
| .expect("Should have captured tokens after compaction"); | ||
|
|
||
| // Before compaction: system (6000) + 40 messages (4000) = 10,000 tokens | ||
| // After compaction: only the summary (200 tokens) - this becomes the new input | ||
| assert!( | ||
| tokens_after < initial_input_tokens, | ||
| "Input tokens should decrease after compaction. Before: {}, After: {}", | ||
| initial_input_tokens, | ||
| tokens_after | ||
| ); | ||
|
|
||
| // After compaction, input should be exactly the summary: 200 tokens | ||
| assert_eq!( | ||
| tokens_after, 200, | ||
| "Input tokens after compaction should be exactly 200 (summary). Got: {}", | ||
| tokens_after | ||
| ); | ||
|
|
There was a problem hiding this comment.
test_auto_compaction_during_reply doesn’t actually exercise the auto-compaction path: setup_test_session hard-codes session.total_tokens to 1000 and MockCompactionProvider uses the default model context limit (128k), so check_if_compaction_needed will never request compaction and the compaction_occurred branch is effectively dead; make this deterministic by setting the mock model’s context_limit (e.g., via ModelConfig::with_context_limit) and initializing session.total_tokens high enough (or None to force estimation) so the threshold is exceeded.
| // as the agent records compaction metrics before retrying | ||
| let final_input = updated_session.input_tokens.unwrap(); | ||
| let final_output = updated_session.output_tokens; | ||
| let final_total = updated_session.total_tokens.unwrap(); | ||
|
|
||
| // After compaction during recovery, the session shows the compaction tokens | ||
| // Input: system (6000) + long_tool_call messages (~15,400) + new message (100) = ~21,500 | ||
| // Output: 200 (compaction summary) | ||
| // Total: ~21,700 | ||
| assert!( | ||
| (21000..=22000).contains(&final_input), | ||
| "Final input should reflect compaction input (~21,500). Got: {}", | ||
| final_input | ||
| ); | ||
|
|
||
| assert_eq!( | ||
| final_output, | ||
| Some(200), | ||
| "Final output should be compaction output (200). Got: {:?}", |
There was a problem hiding this comment.
In test_context_limit_recovery_compaction, the assertions on updated_session.input_tokens/output_tokens after the stream finishes assume the session metrics still reflect the compaction call, but agent.rs continues the loop after HistoryReplaced and then calls update_session_metrics(..., is_compaction_usage=false) for the retry response, which overwrites the current token window; assert the post-compaction state using the input_tokens_after_compaction snapshot and expect the final session metrics to match the successful retry instead.
| // as the agent records compaction metrics before retrying | |
| let final_input = updated_session.input_tokens.unwrap(); | |
| let final_output = updated_session.output_tokens; | |
| let final_total = updated_session.total_tokens.unwrap(); | |
| // After compaction during recovery, the session shows the compaction tokens | |
| // Input: system (6000) + long_tool_call messages (~15,400) + new message (100) = ~21,500 | |
| // Output: 200 (compaction summary) | |
| // Total: ~21,700 | |
| assert!( | |
| (21000..=22000).contains(&final_input), | |
| "Final input should reflect compaction input (~21,500). Got: {}", | |
| final_input | |
| ); | |
| assert_eq!( | |
| final_output, | |
| Some(200), | |
| "Final output should be compaction output (200). Got: {:?}", | |
| // as the agent records compaction metrics before retrying. | |
| // After compaction during recovery, the snapshot should show the compaction tokens: | |
| // Input: system (6000) + long_tool_call messages (~15,400) + new message (100) = ~21,500 | |
| assert!( | |
| (21000..=22000).contains(&input_tokens_after_compaction), | |
| "Compaction input should be ~21,500. Got: {}", | |
| input_tokens_after_compaction | |
| ); | |
| // After the retry, the session window should reflect the successful reply, not the compaction call. | |
| let final_input = updated_session.input_tokens.unwrap(); | |
| let final_output = updated_session.output_tokens; | |
| let final_total = updated_session.total_tokens.unwrap(); | |
| // Reply: ~6,300 input + 100 output = 6,400 | |
| assert!( | |
| (6000..=6500).contains(&final_input), | |
| "Final input should reflect reply input (~6,300). Got: {}", | |
| final_input | |
| ); | |
| assert_eq!( | |
| final_output, | |
| Some(100), | |
| "Final output should be reply output (100). Got: {:?}", |
…upport * origin/main: (79 commits) fix[format/openai]: return error on empty msg. (#6511) Fix: ElevenLabs API Key Not Persisting (#6557) Logging uplift for model training purposes (command injection model) [Small change] (#6330) fix(goose): only send agent-session-id when a session exists (#6657) BERT-based command injection detection in tool calls (#6599) chore: [CONTRIBUTING.md] add Hermit to instructions (#6518) fix: update Gemini context limits (#6536) Document r slash command (#6724) Upgrade GitHub Actions to latest versions (#6700) fix: Manual compaction does not update context window. (#6682) Removed the Acceptable Usage Policy (#6204) Document spellcheck toggle (#6721) fix: docs workflow cleanup and prevent cancellations (#6713) Docs: file bug directly (#6718) fix: dispatch ADD_ACTIVE_SESSION event before navigating from "View All" (#6679) Speed up Databricks provider init by removing fetch of supported models (#6616) fix: correct typos in documentation and Justfile (#6686) docs: frameDomains and baseUriDomains for mcp apps (#6684) docs: add Remotion video creation tutorial (#6675) docs: export recipe and copy yaml (#6680) ... # Conflicts: # ui/desktop/src/hooks/useChatStream.ts
…ovider * 'main' of github.com:block/goose: fix: Manual compaction does not update context window. (#6682) Removed the Acceptable Usage Policy (#6204) Document spellcheck toggle (#6721) fix: docs workflow cleanup and prevent cancellations (#6713) Docs: file bug directly (#6718) fix: dispatch ADD_ACTIVE_SESSION event before navigating from "View All" (#6679) Speed up Databricks provider init by removing fetch of supported models (#6616) fix: correct typos in documentation and Justfile (#6686) docs: frameDomains and baseUriDomains for mcp apps (#6684)
* 'main' of github.com:block/goose: Create default gooseignore file when missing (#6498) fix slash and @ keyboard navigation popover background color (#6550) fix[format/openai]: return error on empty msg. (#6511) Fix: ElevenLabs API Key Not Persisting (#6557) Logging uplift for model training purposes (command injection model) [Small change] (#6330) fix(goose): only send agent-session-id when a session exists (#6657) BERT-based command injection detection in tool calls (#6599) chore: [CONTRIBUTING.md] add Hermit to instructions (#6518) fix: update Gemini context limits (#6536) Document r slash command (#6724) Upgrade GitHub Actions to latest versions (#6700) fix: Manual compaction does not update context window. (#6682) Removed the Acceptable Usage Policy (#6204) Document spellcheck toggle (#6721) fix: docs workflow cleanup and prevent cancellations (#6713) Docs: file bug directly (#6718)
Summary
Simple fix, but added some pretty extensive tests with assertions about conversation state + token counts after all 3 compaction trigger types.