Skip to content

Comments

Fix #312: Extract output_tokens from message_delta for Anthropic SSE streams#313

Merged
ding113 merged 1 commit intodevfrom
fix/issue-312-anthropic-output-tokens
Dec 10, 2025
Merged

Fix #312: Extract output_tokens from message_delta for Anthropic SSE streams#313
ding113 merged 1 commit intodevfrom
fix/issue-312-anthropic-output-tokens

Conversation

@ding113
Copy link
Owner

@ding113 ding113 commented Dec 10, 2025

Summary

  • Fixed incorrect output token counting for Anthropic-type providers in SSE streaming responses
  • Now properly extracts output_tokens from message_delta event (at stream end) instead of message_start (at stream beginning)
  • Maintains extraction of input tokens and cache fields (including 5m/1h differentiated billing) from message_start

Problem

Fixes #312

The v0.3.27 update introduced differentiated cache billing (5m/1h), which changed the usage extraction logic to use message_start. However, message_start appears at the beginning of the stream when output_tokens is typically 1 (or very low), before the actual response content is generated.

Example message_start (incorrect source for output_tokens):

{"type":"message_start","message":{"usage":{"input_tokens":8,"output_tokens":1,...}}}

Example message_delta (correct source for final output_tokens):

{"type":"message_delta","delta":{"stop_reason":"tool_use"},"usage":{"output_tokens":356}}

Solution

Modified parseUsageFromResponseText in src/app/v1/_lib/proxy/response-handler.ts to:

  1. Extract input tokens and cache fields from message_start event (required for 5m/1h cache billing)
  2. Extract final output_tokens from message_delta event (appears at stream end with accurate count)
  3. Merge both metrics to provide complete and accurate usage data

This approach preserves the 5m/1h cache differentiation feature while fixing the output token counting bug.

Changes

  • src/app/v1/_lib/proxy/response-handler.ts: Rewrote SSE parsing logic for Claude format to handle message_start and message_delta separately, then merge the results

Testing

  • TypeScript compilation passes
  • Biome lint passes
  • Manual testing with real Anthropic SSE streams (recommended)

Created by Claude AI in response to @claude mention

The previous implementation extracted output_tokens from message_start event
which appears at the beginning of the SSE stream, resulting in incorrect
(usually 1) output token counts since the main content hadn't been generated yet.

This fix:
- Extracts input tokens and cache fields (5m/1h) from message_start
- Extracts final output_tokens from message_delta (at stream end)
- Merges both to get accurate usage metrics

Closes #312

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ding113, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue where Anthropic SSE streaming responses were reporting incorrect output token counts due to premature extraction from the message_start event. The change refines the token usage parsing by separating the extraction of input tokens and cache information from the message_start event, and the final, accurate output token count from the message_delta event, ensuring that all usage metrics are correctly captured and merged without disrupting existing cache billing functionalities.

Highlights

  • Accurate Anthropic Output Token Counting: Corrected the logic for counting output tokens in Anthropic SSE streaming responses by extracting the final output_tokens value from the message_delta event, which occurs at the end of the stream and provides the accurate count.
  • Preservation of Input Token and Cache Billing Logic: Ensured that the extraction of input tokens and cache-related fields (including the 5m/1h differentiated billing) continues to correctly occur from the message_start event at the beginning of the stream.
  • Unified Usage Metrics: Implemented a merging mechanism to combine the input token and cache data from message_start with the accurate output token count from message_delta, providing complete and correct usage metrics for Anthropic SSE streams.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ding113 ding113 merged commit ae29020 into dev Dec 10, 2025
6 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in Claude Code Hub Roadmap Dec 10, 2025
@github-actions github-actions bot added bug Something isn't working size/XS Extra Small PR (< 50 lines) labels Dec 10, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes an issue with output token counting for Anthropic SSE streams by extracting usage data from message_delta events. The implementation is sound. I've added one suggestion to refactor the logic to make it more robust and maintainable. By explicitly checking the providerType instead of implicitly detecting the provider based on SSE event names, the code becomes easier to understand and less prone to future bugs if other providers adopt similar event names.

Comment on lines 1357 to 1437
const events = parseSSEData(responseText);
for (const event of events) {
if (usageMetrics) {
break;
}

// Claude SSE 特殊处理:
// - message_start 包含 input tokens 和缓存创建字段(5m/1h 区分计费)
// - message_delta 包含最终的 output_tokens
// 需要分别提取并合并
let messageStartUsage: UsageMetrics | null = null;
let messageDeltaOutputTokens: number | null = null;

for (const event of events) {
if (typeof event.data !== "object" || !event.data) {
continue;
}

const data = event.data as Record<string, unknown>;

// Claude message_start format: data.message.usage (preferred)
if (data.message && typeof data.message === "object") {
// Claude message_start format: data.message.usage
// 提取 input tokens 和缓存字段
if (event.event === "message_start" && data.message && typeof data.message === "object") {
const messageObj = data.message as Record<string, unknown>;
applyUsageValue(messageObj.usage, `sse.${event.event}.message.usage`);
if (messageObj.usage && typeof messageObj.usage === "object") {
const extracted = extractUsageMetrics(messageObj.usage);
if (extracted) {
messageStartUsage = extracted;
logger.debug("[ResponseHandler] Extracted usage from message_start", {
source: "sse.message_start.message.usage",
usage: extracted,
});
}
}
}

// Claude message_delta format: data.usage.output_tokens
// 提取最终的 output_tokens(在流结束时)
if (event.event === "message_delta" && data.usage && typeof data.usage === "object") {
const deltaUsage = data.usage as Record<string, unknown>;
if (typeof deltaUsage.output_tokens === "number") {
messageDeltaOutputTokens = deltaUsage.output_tokens;
logger.debug("[ResponseHandler] Extracted output_tokens from message_delta", {
source: "sse.message_delta.usage.output_tokens",
outputTokens: messageDeltaOutputTokens,
});
}
}

// Fallback: Standard usage fields (data.usage)
applyUsageValue(data.usage, `sse.${event.event}.usage`);
// 非 Claude 格式的 SSE 处理(Gemini 等)
if (!messageStartUsage && !messageDeltaOutputTokens) {
// Standard usage fields (data.usage)
applyUsageValue(data.usage, `sse.${event.event}.usage`);

// Gemini usageMetadata
applyUsageValue(data.usageMetadata, `sse.${event.event}.usageMetadata`);

// Gemini usageMetadata
applyUsageValue(data.usageMetadata, `sse.${event.event}.usageMetadata`);
// Handle response wrapping in SSE
if (!usageMetrics && data.response && typeof data.response === "object") {
const responseObj = data.response as Record<string, unknown>;
applyUsageValue(responseObj.usage, `sse.${event.event}.response.usage`);
applyUsageValue(responseObj.usageMetadata, `sse.${event.event}.response.usageMetadata`);
}
}
}

// Handle response wrapping in SSE
if (!usageMetrics && data.response && typeof data.response === "object") {
const responseObj = data.response as Record<string, unknown>;
applyUsageValue(responseObj.usage, `sse.${event.event}.response.usage`);
applyUsageValue(responseObj.usageMetadata, `sse.${event.event}.response.usageMetadata`);
// 合并 Claude SSE 的 message_start 和 message_delta 数据
if (messageStartUsage) {
// 使用 message_delta 中的 output_tokens 覆盖 message_start 中的值
if (messageDeltaOutputTokens !== null) {
messageStartUsage.output_tokens = messageDeltaOutputTokens;
logger.debug(
"[ResponseHandler] Merged output_tokens from message_delta into message_start usage",
{
finalOutputTokens: messageDeltaOutputTokens,
}
);
}
usageMetrics = adjustUsageForProviderType(messageStartUsage, providerType);
usageRecord = messageStartUsage as unknown as Record<string, unknown>;
logger.debug("[ResponseHandler] Final merged usage from Claude SSE", {
providerType,
usage: usageMetrics,
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation correctly handles the Anthropic SSE stream parsing. However, it implicitly detects a Claude stream by checking for message_start or message_delta events. This could be brittle if other providers adopt similar event names in the future.

A more robust and maintainable approach would be to explicitly check the providerType to distinguish between Claude/Anthropic streams and other providers like Gemini. This makes the separation of parsing logic clearer and safer.

I suggest refactoring this block to have a clear if/else based on providerType. This also allows us to restore the more efficient break statement for non-Claude providers.

    const events = parseSSEData(responseText);

    if (providerType === "claude" || providerType === "claude-auth") {
      // Claude SSE 特殊处理:
      // - message_start 包含 input tokens 和缓存创建字段(5m/1h 区分计费)
      // - message_delta 包含最终的 output_tokens
      // 需要分别提取并合并
      let messageStartUsage: UsageMetrics | null = null;
      let messageDeltaOutputTokens: number | null = null;

      for (const event of events) {
        if (typeof event.data !== "object" || !event.data) {
          continue;
        }

        const data = event.data as Record<string, unknown>;

        // Claude message_start format: data.message.usage
        // 提取 input tokens 和缓存字段
        if (event.event === "message_start" && data.message && typeof data.message === "object") {
          const messageObj = data.message as Record<string, unknown>;
          if (messageObj.usage && typeof messageObj.usage === "object") {
            const extracted = extractUsageMetrics(messageObj.usage);
            if (extracted) {
              messageStartUsage = extracted;
              logger.debug("[ResponseHandler] Extracted usage from message_start", {
                source: "sse.message_start.message.usage",
                usage: extracted,
              });
            }
          }
        }

        // Claude message_delta format: data.usage.output_tokens
        // 提取最终的 output_tokens(在流结束时)
        if (event.event === "message_delta" && data.usage && typeof data.usage === "object") {
          const deltaUsage = data.usage as Record<string, unknown>;
          if (typeof deltaUsage.output_tokens === "number") {
            messageDeltaOutputTokens = deltaUsage.output_tokens;
            logger.debug("[ResponseHandler] Extracted output_tokens from message_delta", {
              source: "sse.message_delta.usage.output_tokens",
              outputTokens: messageDeltaOutputTokens,
            });
          }
        }
      }

      // 合并 Claude SSE 的 message_start 和 message_delta 数据
      if (messageStartUsage) {
        // 使用 message_delta 中的 output_tokens 覆盖 message_start 中的值
        if (messageDeltaOutputTokens !== null) {
          messageStartUsage.output_tokens = messageDeltaOutputTokens;
          logger.debug(
            "[ResponseHandler] Merged output_tokens from message_delta into message_start usage",
            {
              finalOutputTokens: messageDeltaOutputTokens,
            }
          );
        }
        usageMetrics = adjustUsageForProviderType(messageStartUsage, providerType);
        usageRecord = messageStartUsage as unknown as Record<string, unknown>;
        logger.debug("[ResponseHandler] Final merged usage from Claude SSE", {
          providerType,
          usage: usageMetrics,
        });
      }
    } else {
      // 非 Claude 格式的 SSE 处理(Gemini 等)
      for (const event of events) {
        if (usageMetrics) {
          break;
        }

        if (typeof event.data !== "object" || !event.data) {
          continue;
        }

        const data = event.data as Record<string, unknown>;

        // Standard usage fields (data.usage)
        applyUsageValue(data.usage, `sse.${event.event}.usage`);

        // Gemini usageMetadata
        applyUsageValue(data.usageMetadata, `sse.${event.event}.usageMetadata`);

        // Handle response wrapping in SSE
        if (!usageMetrics && data.response && typeof data.response === "object") {
          const responseObj = data.response as Record<string, unknown>;
          applyUsageValue(responseObj.usage, `sse.${event.event}.response.usage`);
          applyUsageValue(responseObj.usageMetadata, `sse.${event.event}.response.usageMetadata`);
        }
      }
    }

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

No significant issues identified in this PR. The implementation correctly addresses the bug where output_tokens was incorrectly extracted from message_start (which contains the initial/low value) instead of message_delta (which contains the final accurate count).

PR Size: XS

  • Lines changed: 82 (66 additions, 16 deletions)
  • Files changed: 1

Analysis Notes

The refactored parseUsageFromResponseText function now:

  1. ✅ Correctly separates Claude SSE handling from other formats (Gemini, etc.)
  2. ✅ Extracts input_tokens and cache fields from message_start event
  3. ✅ Extracts final output_tokens from message_delta event
  4. ✅ Merges both metrics appropriately
  5. ✅ Maintains backward compatibility with non-Claude SSE formats

The logic gate if (!messageStartUsage && !messageDeltaOutputTokens) correctly ensures non-Claude SSE processing only occurs when no Claude-specific events are detected, preventing format conflicts.

Review Coverage

  • Logic and correctness - Clean
  • Security (OWASP Top 10) - Clean
  • Error handling - Clean (appropriate debug logging added)
  • Type safety - Clean
  • Documentation accuracy - Clean (comments match implementation)
  • Test coverage - No automated tests for this function (pre-existing gap, not introduced by this PR)
  • Code clarity - Good (well-named variables, clear separation of concerns)

Recommendation

Approve - The implementation is sound. Manual testing with real Anthropic SSE streams is recommended before deployment as noted in the PR description.


Automated review by Claude AI

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

No significant issues identified in this PR. The implementation correctly addresses the bug where output_tokens was incorrectly extracted from message_start (which contains the initial/low value) instead of message_delta (which contains the final accurate count).

PR Size: XS

  • Lines changed: 82 (66 additions, 16 deletions)
  • Files changed: 1

Analysis Notes

The refactored parseUsageFromResponseText function now:

  1. Correctly separates Claude SSE handling from other formats (Gemini, etc.)
  2. Extracts input_tokens and cache fields from message_start event
  3. Extracts final output_tokens from message_delta event
  4. Merges both metrics appropriately
  5. Maintains backward compatibility with non-Claude SSE formats

The logic gate if (!messageStartUsage && !messageDeltaOutputTokens) correctly ensures non-Claude SSE processing only occurs when no Claude-specific events are detected, preventing format conflicts.

Review Coverage

  • Logic and correctness - Clean
  • Security (OWASP Top 10) - Clean
  • Error handling - Clean (appropriate debug logging added)
  • Type safety - Clean
  • Documentation accuracy - Clean (comments match implementation)
  • Test coverage - No automated tests for this function (pre-existing gap, not introduced by this PR)
  • Code clarity - Good (well-named variables, clear separation of concerns)

Recommendation

Approve - The implementation is sound. Manual testing with real Anthropic SSE streams is recommended before deployment as noted in the PR description.


Automated review by Claude AI

@github-actions github-actions bot mentioned this pull request Dec 10, 2025
Merged
9 tasks
@ding113 ding113 deleted the fix/issue-312-anthropic-output-tokens branch December 11, 2025 10:35
@github-actions github-actions bot mentioned this pull request Dec 12, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size/XS Extra Small PR (< 50 lines)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant