🤖 fix: make OpenAI truncation integration test more robust #253

ammar-agent · 2025-10-14T19:18:03Z

Problem

Integration test was flaking: OpenAI auto truncation integration > should include full file_edit diff in UI/history but redact it from the next provider request

Failure mode: Stream completes tool calls but never emits stream-end event, causing test timeout.

Root cause: AI models can complete tool execution without generating text output. This is non-deterministic behavior - sometimes the model responds with text after tools, sometimes it doesn't.

See full analysis in E2E_FLAKE_ANALYSIS.md.

Solution

Modified test prompt to explicitly request confirmation after tool execution:

- "Open and replace 'line2' with 'LINE2' using file_edit_replace"
+ "Open and replace 'line2' with 'LINE2' using file_edit_replace, then confirm the change was successfully applied."

This encourages the AI to generate text output after completing tools, ensuring the stream finishes properly.

Trade-offs

Short-term fix: Prompt modification reduces flakiness significantly
Long-term fix: Stream manager should detect tool-only responses and auto-emit stream-end (tracked for future work)

Testing

Integration tests should be more stable
Test still validates the actual truncation/redaction behavior
No changes to production code

Generated with cmux

The integration test was flaky because AI models sometimes complete tool calls without generating text output, causing the stream to never emit stream-end. Fix: Modified test prompt to request confirmation after tool execution. This encourages the AI to generate text output, ensuring the stream completes properly. Added analysis document explaining the root cause and potential solutions.

ammario enabled auto-merge October 14, 2025 19:23

ammario added this pull request to the merge queue Oct 14, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 14, 2025

ammar-agent added 2 commits October 14, 2025 15:03

remove stray analysis doc

d87478d

ammar-agent force-pushed the investigate-e2e-flake branch from 42c093d to d87478d Compare October 14, 2025 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🤖 fix: make OpenAI truncation integration test more robust #253

🤖 fix: make OpenAI truncation integration test more robust #253

Uh oh!

ammar-agent commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🤖 fix: make OpenAI truncation integration test more robust #253

Are you sure you want to change the base?

🤖 fix: make OpenAI truncation integration test more robust #253

Uh oh!

Conversation

ammar-agent commented Oct 14, 2025

Problem

Solution

Trade-offs

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant