Skip to content

fix: Manual compaction does not update context window.#6682

Merged
katzdave merged 11 commits intomainfrom
dkatz/compaction-usage-fix
Jan 26, 2026
Merged

fix: Manual compaction does not update context window.#6682
katzdave merged 11 commits intomainfrom
dkatz/compaction-usage-fix

Conversation

@katzdave
Copy link
Collaborator

Summary

Simple fix, but added some pretty extensive tests with assertions about conversation state + token counts after all 3 compaction trigger types.

@katzdave katzdave marked this pull request as ready for review January 23, 2026 21:37
Copilot AI review requested due to automatic review settings January 23, 2026 21:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes manual compaction so that session token/context metrics are updated alongside the conversation history, and adds tests intended to validate compaction behavior across manual, auto-threshold, and context-limit-recovery triggers.

Changes:

  • Update /compact command to apply compaction ProviderUsage to session metrics (input/output/total + accumulated).
  • Add a mock provider and new integration-style tests asserting conversation visibility and token accounting after compaction paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
crates/goose/src/agents/execute_commands.rs Updates manual compaction to record usage into session metrics after replacing the conversation.
crates/goose/tests/agent.rs Adds new compaction-focused tests and a mock provider to validate conversation state + token metrics.

Comment on lines 1284 to 1311
// Note: The current session state reflects the compaction operation,
// as the agent records compaction metrics before retrying
let final_input = updated_session.input_tokens.unwrap();
let final_output = updated_session.output_tokens;
let final_total = updated_session.total_tokens.unwrap();

// After compaction during recovery, the session shows the compaction tokens
// Input: system (6000) + long_tool_call messages (~15,400) + new message (100) = ~21,500
// Output: 200 (compaction summary)
// Total: ~21,700
assert!(
final_input >= 21000 && final_input <= 22000,
"Final input should reflect compaction input (~21,500). Got: {}",
final_input
);

assert_eq!(
final_output,
Some(200),
"Final output should be compaction output (200). Got: {:?}",
final_output
);

assert_eq!(
final_total,
final_input + final_output.unwrap(),
"Final total should equal input + output"
);
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_context_limit_recovery_compaction asserts that the final session metrics reflect the compaction call (e.g., output_tokens == Some(200)), but the agent loop retries after recovery and calls update_session_metrics(..., is_compaction_usage=false) for the successful reply, so the final session state should reflect the reply usage (e.g., output tokens ~100) rather than the compaction usage.

Copilot uses AI. Check for mistakes.
Comment on lines 1005 to 1027
// Setup session with many messages to have substantial context
// 20 exchanges = 40 messages * 100 tokens = ~4000 tokens in conversation
let mut messages = vec![];
for i in 0..20 {
messages.push(Message::user().with_text(format!("User message {}", i)));
messages.push(Message::assistant().with_text(format!("Assistant response {}", i)));
}

let session =
setup_test_session(&agent, &temp_dir, "auto-compact-test", messages).await?;

// Capture initial context size before triggering reply
// Should be: system (6000) + 40 messages (4000) = ~10000 tokens
let initial_session = agent
.config
.session_manager
.get_session(&session.id, true)
.await?;
let initial_input_tokens = initial_session.input_tokens.unwrap_or(0);

// Setup mock provider (no context limit enforcement)
let provider = Arc::new(MockCompactionProvider::new());
agent.update_provider(provider, &session.id).await?;
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_auto_compaction_during_reply is unlikely to ever exercise auto-compaction because setup_test_session sets session.total_tokens to Some(1000), and check_if_compaction_needed uses that metadata (instead of estimating from the 40 messages) against a large default context limit (128k for an unknown model), so needs_auto_compact will stay false even with lots of messages.

Copilot uses AI. Check for mistakes.
Comment on lines 615 to 617
// For compaction: system prompt length is a good proxy for conversation size
// Base: 6000 (system) + conversation content embedded in prompt
6000 + (system_prompt.len() as i32 / 4).max(400)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MockCompactionProvider::calculate_input_tokens derives compaction input tokens from system_prompt.len(), which makes these tests brittle to unrelated edits in the compaction.md template or prompt formatting (token assertions/ranges may fail even if compaction logic is correct); consider making the mock's compaction token accounting independent of the rendered prompt text (e.g., based on number/size of messages or a fixed value).

Suggested change
// For compaction: system prompt length is a good proxy for conversation size
// Base: 6000 (system) + conversation content embedded in prompt
6000 + (system_prompt.len() as i32 / 4).max(400)
// For compaction: use a fixed heuristic value so tests are stable
// Approximate previous minimum of 6000 (system) + 400 (conversation content)
6400

Copilot uses AI. Check for mistakes.
schedule_id: session.schedule_id,
max_turns: None,
retry_config: None,
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm, that doesn't seem great. change update_session_metrics to just take a session_id or get your hands on the actual SessionConfig (I think we can do the first thing)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done although, this schedule_id leaked out? We had it in all places so just continuing to pass it along.

}

#[cfg(test)]
mod compaction_tests {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a lot of lines; should this live in its own test module maybe? also the setup of three tests seems similar; can we unify?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it. Lots of shared code on the setup of the provider and session + validations on post-compaction session state. Maybe could squeeze a little more with the messages, but I think the individual tests tell a clear story of the context before/after compaction and I am seeding the one that triggers out of context with extra messages.

Copilot AI review requested due to automatic review settings January 23, 2026 22:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment on lines 500 to 520
if compaction_occurred {
// Verify that current input context decreased after compaction
let tokens_after = input_tokens_after_compaction
.expect("Should have captured tokens after compaction");

// Before compaction: system (6000) + 40 messages (4000) = 10,000 tokens
// After compaction: only the summary (200 tokens) - this becomes the new input
assert!(
tokens_after < initial_input_tokens,
"Input tokens should decrease after compaction. Before: {}, After: {}",
initial_input_tokens,
tokens_after
);

// After compaction, input should be exactly the summary: 200 tokens
assert_eq!(
tokens_after, 200,
"Input tokens after compaction should be exactly 200 (summary). Got: {}",
tokens_after
);

Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_auto_compaction_during_reply doesn’t actually exercise the auto-compaction path: setup_test_session hard-codes session.total_tokens to 1000 and MockCompactionProvider uses the default model context limit (128k), so check_if_compaction_needed will never request compaction and the compaction_occurred branch is effectively dead; make this deterministic by setting the mock model’s context_limit (e.g., via ModelConfig::with_context_limit) and initializing session.total_tokens high enough (or None to force estimation) so the threshold is exceeded.

Copilot uses AI. Check for mistakes.
Comment on lines +712 to +730
// as the agent records compaction metrics before retrying
let final_input = updated_session.input_tokens.unwrap();
let final_output = updated_session.output_tokens;
let final_total = updated_session.total_tokens.unwrap();

// After compaction during recovery, the session shows the compaction tokens
// Input: system (6000) + long_tool_call messages (~15,400) + new message (100) = ~21,500
// Output: 200 (compaction summary)
// Total: ~21,700
assert!(
(21000..=22000).contains(&final_input),
"Final input should reflect compaction input (~21,500). Got: {}",
final_input
);

assert_eq!(
final_output,
Some(200),
"Final output should be compaction output (200). Got: {:?}",
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test_context_limit_recovery_compaction, the assertions on updated_session.input_tokens/output_tokens after the stream finishes assume the session metrics still reflect the compaction call, but agent.rs continues the loop after HistoryReplaced and then calls update_session_metrics(..., is_compaction_usage=false) for the retry response, which overwrites the current token window; assert the post-compaction state using the input_tokens_after_compaction snapshot and expect the final session metrics to match the successful retry instead.

Suggested change
// as the agent records compaction metrics before retrying
let final_input = updated_session.input_tokens.unwrap();
let final_output = updated_session.output_tokens;
let final_total = updated_session.total_tokens.unwrap();
// After compaction during recovery, the session shows the compaction tokens
// Input: system (6000) + long_tool_call messages (~15,400) + new message (100) = ~21,500
// Output: 200 (compaction summary)
// Total: ~21,700
assert!(
(21000..=22000).contains(&final_input),
"Final input should reflect compaction input (~21,500). Got: {}",
final_input
);
assert_eq!(
final_output,
Some(200),
"Final output should be compaction output (200). Got: {:?}",
// as the agent records compaction metrics before retrying.
// After compaction during recovery, the snapshot should show the compaction tokens:
// Input: system (6000) + long_tool_call messages (~15,400) + new message (100) = ~21,500
assert!(
(21000..=22000).contains(&input_tokens_after_compaction),
"Compaction input should be ~21,500. Got: {}",
input_tokens_after_compaction
);
// After the retry, the session window should reflect the successful reply, not the compaction call.
let final_input = updated_session.input_tokens.unwrap();
let final_output = updated_session.output_tokens;
let final_total = updated_session.total_tokens.unwrap();
// Reply: ~6,300 input + 100 output = 6,400
assert!(
(6000..=6500).contains(&final_input),
"Final input should reflect reply input (~6,300). Got: {}",
final_input
);
assert_eq!(
final_output,
Some(100),
"Final output should be reply output (100). Got: {:?}",

Copilot uses AI. Check for mistakes.
@katzdave katzdave merged commit 214f7a2 into main Jan 26, 2026
31 of 34 checks passed
@katzdave katzdave deleted the dkatz/compaction-usage-fix branch January 26, 2026 20:44
zanesq added a commit that referenced this pull request Jan 27, 2026
…upport

* origin/main: (79 commits)
  fix[format/openai]: return error on empty msg. (#6511)
  Fix: ElevenLabs API Key Not Persisting (#6557)
  Logging uplift for model training purposes (command injection model) [Small change] (#6330)
  fix(goose): only send agent-session-id when a session exists (#6657)
  BERT-based command injection detection in tool calls (#6599)
  chore: [CONTRIBUTING.md] add Hermit to instructions (#6518)
  fix: update Gemini context limits (#6536)
  Document r slash command (#6724)
  Upgrade GitHub Actions to latest versions (#6700)
  fix: Manual compaction does not update context window. (#6682)
  Removed the Acceptable Usage Policy (#6204)
  Document spellcheck toggle (#6721)
  fix: docs workflow cleanup and prevent cancellations (#6713)
  Docs: file bug directly (#6718)
  fix: dispatch ADD_ACTIVE_SESSION event before navigating from "View All" (#6679)
  Speed up Databricks provider init by removing fetch of supported models (#6616)
  fix: correct typos in documentation and Justfile (#6686)
  docs: frameDomains and baseUriDomains for mcp apps (#6684)
  docs: add Remotion video creation tutorial (#6675)
  docs: export recipe and copy yaml (#6680)
  ...

# Conflicts:
#	ui/desktop/src/hooks/useChatStream.ts
katzdave added a commit that referenced this pull request Jan 27, 2026
…ovider

* 'main' of github.com:block/goose:
  fix: Manual compaction does not update context window. (#6682)
  Removed the Acceptable Usage Policy (#6204)
  Document spellcheck toggle (#6721)
  fix: docs workflow cleanup and prevent cancellations (#6713)
  Docs: file bug directly (#6718)
  fix: dispatch ADD_ACTIVE_SESSION event before navigating from "View All" (#6679)
  Speed up Databricks provider init by removing fetch of supported models (#6616)
  fix: correct typos in documentation and Justfile (#6686)
  docs: frameDomains and baseUriDomains for mcp apps (#6684)
zanesq added a commit that referenced this pull request Jan 27, 2026
* 'main' of github.com:block/goose:
  Create default gooseignore file when missing (#6498)
  fix slash and @ keyboard navigation popover background color (#6550)
  fix[format/openai]: return error on empty msg. (#6511)
  Fix: ElevenLabs API Key Not Persisting (#6557)
  Logging uplift for model training purposes (command injection model) [Small change] (#6330)
  fix(goose): only send agent-session-id when a session exists (#6657)
  BERT-based command injection detection in tool calls (#6599)
  chore: [CONTRIBUTING.md] add Hermit to instructions (#6518)
  fix: update Gemini context limits (#6536)
  Document r slash command (#6724)
  Upgrade GitHub Actions to latest versions (#6700)
  fix: Manual compaction does not update context window. (#6682)
  Removed the Acceptable Usage Policy (#6204)
  Document spellcheck toggle (#6721)
  fix: docs workflow cleanup and prevent cancellations (#6713)
  Docs: file bug directly (#6718)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants