-
Notifications
You must be signed in to change notification settings - Fork 101
fix: suppress spurious timeout error in Slack when streaming finalization fails #1999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -16,6 +16,8 @@ const logger = getLogger('slack-streaming'); | |
|
|
||
| const STREAM_TIMEOUT_MS = 120_000; | ||
| const CHATSTREAM_OP_TIMEOUT_MS = 10_000; | ||
| /** Shorter timeout for best-effort cleanup in error paths to bound total error handling time. */ | ||
| const CLEANUP_TIMEOUT_MS = 3_000; | ||
|
|
||
| /** | ||
| * Wrap a promise with a timeout to prevent indefinite blocking on Slack API calls. | ||
|
|
@@ -260,11 +262,21 @@ export async function streamAgentResponse(params: { | |
| clearTimeout(timeoutId); | ||
|
|
||
| const contextBlock = createContextBlock({ agentName }); | ||
| await withTimeout( | ||
| streamer.stop({ blocks: [contextBlock] }), | ||
| CHATSTREAM_OP_TIMEOUT_MS, | ||
| 'streamer.stop' | ||
| ); | ||
| try { | ||
| await withTimeout( | ||
| streamer.stop({ blocks: [contextBlock] }), | ||
| CHATSTREAM_OP_TIMEOUT_MS, | ||
| 'streamer.stop' | ||
| ); | ||
| } catch (stopError) { | ||
| // If content was already delivered to the user, a streamer.stop() timeout | ||
| // is a non-critical finalization error — log it but don't surface to user | ||
| span.setAttribute(SLACK_SPAN_KEYS.STREAM_FINALIZATION_FAILED, true); | ||
| logger.warn( | ||
| { stopError, channel, threadTs, responseLength: fullText.length }, | ||
| 'Failed to finalize chatStream — content was already delivered' | ||
| ); | ||
| } | ||
|
Comment on lines
271
to
279
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💭 Consider: Add span attribute for finalization failures Issue: When Why: Being able to query for "streams that succeeded but had finalization issues" would help identify patterns (e.g., specific agents, time-of-day, message length) that correlate with slow finalization. Fix: Add a span attribute to mark degraded success: } catch (stopError) {
span.setAttribute('slack.finalization_failed', true);
logger.warn(
{ stopError, channel, threadTs, responseLength: fullText.length },
'Failed to finalize chatStream — content was already delivered'
);
}Refs: |
||
|
|
||
| if (thinkingMessageTs) { | ||
| try { | ||
|
|
@@ -287,8 +299,36 @@ export async function streamAgentResponse(params: { | |
| } catch (streamError) { | ||
| clearTimeout(timeoutId); | ||
| if (streamError instanceof Error) setSpanWithError(span, streamError); | ||
|
|
||
| const contentAlreadyDelivered = fullText.length > 0; | ||
|
|
||
| if (contentAlreadyDelivered) { | ||
| // Content was already streamed to the user — a late error (e.g. streamer.append | ||
| // timeout on the final chunk) should not surface as a user-facing error message. | ||
| span.setAttribute(SLACK_SPAN_KEYS.CONTENT_ALREADY_DELIVERED, true); | ||
| logger.warn( | ||
| { streamError, channel, threadTs, responseLength: fullText.length }, | ||
| 'Error during Slack streaming after content was already delivered — suppressing user-facing error' | ||
| ); | ||
| await withTimeout(streamer.stop(), CLEANUP_TIMEOUT_MS, 'streamer.stop-cleanup').catch((e) => | ||
| logger.warn({ error: e }, 'Failed to stop streamer during error cleanup') | ||
| ); | ||
|
|
||
| if (thinkingMessageTs) { | ||
| try { | ||
| await slackClient.chat.delete({ channel, ts: thinkingMessageTs }); | ||
| } catch { | ||
| // Ignore delete errors in error path | ||
| } | ||
| } | ||
|
|
||
| span.end(); | ||
| return { success: true }; | ||
|
Comment on lines
303
to
326
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💭 Consider: Add span attribute for partial failure tracking Issue: When content has been delivered but a late error occurs, the function returns Why: Being able to distinguish "clean success" from "graceful degradation after partial failure" in dashboards would help track reliability trends without false-alarming on user-facing errors. Fix: Add a span attribute before returning success: if (contentAlreadyDelivered) {
span.setAttribute('slack.partial_failure', true);
logger.warn(
{ streamError, channel, threadTs, responseLength: fullText.length },
'Error during Slack streaming after content was already delivered — suppressing user-facing error'
);
// ... rest of cleanup
}Refs:
|
||
| } | ||
|
|
||
| // No content was delivered — surface the error to the user | ||
| logger.error({ streamError }, 'Error during Slack streaming'); | ||
| await withTimeout(streamer.stop(), CHATSTREAM_OP_TIMEOUT_MS, 'streamer.stop').catch((e) => | ||
| await withTimeout(streamer.stop(), CLEANUP_TIMEOUT_MS, 'streamer.stop-cleanup').catch((e) => | ||
| logger.warn({ error: e }, 'Failed to stop streamer during error cleanup') | ||
| ); | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Minor: Retry deduplication bypasses tracing span
Issue: This code path returns early before entering
tracer.startActiveSpan(), making retry acknowledgments invisible to distributed tracing. All other early-return paths in this handler (url_verification, signature_invalid, ignored_bot_message) setspan.setAttribute(SLACK_SPAN_KEYS.OUTCOME, outcome)before ending the span.Why: Without span tracking, retry frequency is invisible to observability tooling. This makes it harder to detect Slack delivery issues or understand why initial acks are slow — important context for debugging latency issues.
Fix: Move the retry check inside the
tracer.startActiveSpan()block and add a new outcome value (e.g.,'acknowledged_retry') toSlackOutcometype intracer.ts. Example:Refs: