Skip to content

release#216

Merged
ding113 merged 20 commits intomainfrom
dev
Nov 27, 2025
Merged

release#216
ding113 merged 20 commits intomainfrom
dev

Conversation

@ding113
Copy link
Owner

@ding113 ding113 commented Nov 27, 2025

摘要

本次发布引入了全面的供应商可用性监控系统、智能断路器探测以加快供应商恢复,以及增强的供应商测试功能,支持三层验证。

问题

  • 无法实时查看供应商健康状况和可用性趋势
  • 断路器处于 OPEN 状态时需要人工干预或固定等待时间才能恢复
  • 供应商测试缺乏可配置的验证和用于中继服务验证的预设模板
  • 供应商页面存在 N+1 查询问题和 SQL 全表扫描,导致性能下降

解决方案

🆕 新功能

供应商可用性监控 (/dashboard/availability)

  • 实时仪表盘显示供应商健康状态(绿色/红色/未知)
  • 可用性指标:成功率、平均/P50/P95/P99 延迟、请求数
  • 时间范围筛选(15分钟、1小时、6小时、24小时、7天)
  • 热力图可视化,桶大小可配置
  • 自动刷新功能

智能断路器探测

  • 断路器处于 OPEN 状态时定期进行健康探测
  • 可通过环境变量进行配置:
    • ENABLE_SMART_PROBING(默认:false)
    • PROBE_INTERVAL_MS(默认:30000 毫秒)
    • PROBE_TIMEOUT_MS(默认:5000 毫秒)
  • 成功的探测可提前将断路器切换到 HALF_OPEN 状态

增强的供应商测试

  • 三层验证:HTTP 状态 → 延迟阈值 → 内容关键词匹配
  • 带有真实 CLI 请求模式的预设模板,用于中继服务验证
  • 支持自定义 JSON 载荷,适用于高级测试方案
  • 详细的结果卡片显示验证分解、时间信息和令牌使用情况

🔧 改进

  • 性能:修复供应商页面的 N+1 查询和 SQL 全表扫描问题
  • 错误解析:增强了中继服务的嵌套错误结构支持
  • 流式超时:将空闲超时时间范围从 1-120 秒调整为 60-600 秒(0 表示禁用)

🐛 漏洞修复

  • 修复了供应商每日使用统计 JSONB 字段名错误
  • 修复了供应商测试中响应内容验证失败的问题
  • 修复了登录重定向显示重复语言前缀的问题
  • 修复了 zh-TW 缺少 apiTest 的 8 个翻译键问题
  • 移除 DialogContent 中硬编码的 sm:max-w-lg 宽度限制

📚 文档

变更

  • 新增可用性监控模块 (src/lib/availability/)
  • 新增带解析器和验证器的供应商测试库 (src/lib/provider-testing/)
  • 新增智能断路器探测服务 (src/lib/circuit-breaker-probe.ts)
  • 新增可用性仪表盘页面及组件
  • 更新 EN、JA、RU、ZH-CN、ZH-TW 的 i18n 翻译
  • 优化供应商仓库查询

测试

  • 供应商可用性监控仪表盘的手动测试
  • 使用各种预设配置进行供应商测试
  • 验证断路器探测功能
  • 确保现有 API 无重大变更

截图(如适用)

N/A - UI 组件遵循现有设计模式

Silentely and others added 18 commits November 26, 2025 23:19
- 优先提取 upstream_error 中的错误信息
- 解决 cubence 等中转服务的错误显示问题
- 修复前端显示"you appear to be making a request using a non-Claude code cli client, which is prohibited"
- 保持向后兼容的错误提取逻辑

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- 减少嵌套层级,提高代码可读性
- 使用可选链操作符简化属性访问
- 保持相同的功能逻辑

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
fix: 增强错误解析以支持中转服务的嵌套错误结构
- 新增供应商可用性监控页面,支持热力图可视化
- 添加排序功能(按可用性/名称/请求数),默认按可用性排序
- 简化状态逻辑:移除"波动"状态,成功请求=绿色,失败请求=红色
- 响应式热力图宽度,自动适配时间跨度
- 支持5种语言的i18n翻译 (zh-CN, en, ja, ru, zh-TW)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
问题分析:
- 供应商限额页面:每个供应商触发 1 DB + 5 Redis 查询(N+1 问题)
- 供应商管理页面:getProviderStatistics SQL 缺少日期过滤(全表扫描)

修复内容:
1. 新增批量查询方法避免 N+1:
   - RateLimitService.getCurrentCostBatch() - Redis Pipeline 批量获取限额
   - SessionTracker.getProviderSessionCountBatch() - 批量获取 session 计数
   - getProviderLimitUsageBatch() - 整合批量查询入口

2. 优化 getProviderStatistics SQL:
   - provider_stats CTE: LEFT JOIN 添加 created_at >= CURRENT_DATE 过滤
   - latest_call CTE: 添加 7 天时间范围限制

性能提升(50 个供应商):
- 限额页面:52 DB + 250 Redis → 2 DB + 2 Redis Pipeline
- 管理页面:全表扫描 → 仅扫描今日/7 天数据

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
基于 relay-pulse 实现改进内容提取逻辑:

- 改进 isSSEResponse 检测:使用 event: + data: 双重特征判断
- 新增 aggregateResponseText 函数:提供 SSE → JSON → 原始响应的回退机制
- 修改 test-service 使用 aggregateResponseText 进行内容验证

修复场景:即使 SSE 解析返回空内容,也能通过 JSON 或原始响应回退提取关键词

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
SQL 查询中使用了错误的键名 'providerId',而 ProviderChainItem 实际定义的是 'id'。
当请求有重试切换时,JSONB 查询返回 NULL,导致今日用量显示为 0。

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
将供应商可用性监控的状态标签从 Badge variant 改为自定义 className,
采用与请求日志表格相同的浅色背景配色方案,提升 UI 一致性。

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
前端提示说"范围 1-120 秒,默认 10 秒",但后端校验规则实际为 60-600 秒。
更新所有语言的 streamingIdle 描述为正确范围,并说明填 0 可禁用。

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
使用 pathWithoutLocale 替代 pathname 作为 from 参数,
避免 locale-aware router 再次添加 locale 前缀导致
/zh-CN/zh-CN/dashboard 这样的错误 URL。

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
核心改动:
- 移除复杂的 SSE/JSON 解析逻辑,直接用 responseBody.includes() 匹配关键词
- 新增 rawResponse 字段(5000字符),提供完整响应体供用户检查
- 前端用 Raw Response Body 替代 Stream Info,方便用户自行判断问题
- 添加 5 种语言的 rawResponse 翻译

优点:无论响应格式如何(SSE/JSON/纯文本),只要包含关键词就能通过验证

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
添加供应商测试功能相关的繁体中文翻译:
- streamInfo, chunksReceived, streamFormat, streamResponse
- chunksCount, truncatedPreview, truncatedBrief, copyFormat

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
供应商检测功能参考了 prehisle/relay-pulse 项目的实现。

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
允许调用方通过 className 完全控制弹窗宽度,修复编辑服务商等弹窗无法使用 max-w-6xl 的问题。

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ding113, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a comprehensive upgrade to provider management and monitoring capabilities. It introduces a dedicated dashboard for real-time provider availability, streamlines API testing with a unified and robust validation framework, and enhances system resilience through smart probing for circuit breakers. Additionally, it optimizes backend data retrieval for improved performance and ensures all new features are fully localized.

Highlights

  • Provider Availability Monitoring: A new 'Provider Availability Monitor' dashboard has been introduced, offering real-time tracking of provider performance and availability metrics through a visual heatmap. This feature is accessible to administrators.
  • Unified Provider Testing: The API testing functionality has been unified under a new testProviderUnified action, implementing a robust three-tier validation system (HTTP status, latency, and content matching) inspired by the relay-pulse project. This includes support for custom payloads and predefined test presets.
  • Smart Probing for Circuit Breakers: New configuration options enable 'smart probing' for circuit breakers. When a provider's circuit is in an 'OPEN' state, the system will periodically probe it. A successful probe can transition the circuit to 'HALF_OPEN' faster, improving recovery times.
  • Optimized Data Fetching: Performance for fetching provider quota usage and active session counts has been significantly improved by implementing batch queries using Redis pipelines, effectively addressing N+1 query issues.
  • Internationalization Updates: Extensive new translation keys have been added across multiple languages (English, Japanese, Russian, Simplified Chinese, Traditional Chinese) to support the new availability monitoring and unified provider testing features.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ding113 ding113 added the size/XL Extra Large PR (> 1000 lines) label Nov 27, 2025
@ding113
Copy link
Owner Author

ding113 commented Nov 27, 2025

📊 PR Size Analysis

This PR is XL with 6,920 lines changed across 51 files.

Large PRs are harder to review and more likely to introduce bugs.

🔀 Suggested Split:

Based on the changes, this PR could be split into:

  1. PR 1: Provider Testing Framework (~13 files)

    • src/lib/provider-testing/* (parsers, validators, presets, types, utils)
    • src/actions/providers.ts
    • src/app/[locale]/settings/providers/_components/forms/api-test-button.tsx
    • src/app/[locale]/settings/providers/_components/forms/test-result-card.tsx
  2. PR 2: Availability Service (~7 files)

    • src/lib/availability/*
    • src/app/api/availability/*
    • src/app/[locale]/dashboard/availability/*
    • src/app/[locale]/dashboard/quotas/providers/page.tsx
  3. PR 3: i18n Updates (~12 files)

    • messages/*/dashboard.json
    • messages/*/settings.json
  4. PR 4: Infrastructure & Misc (~9 files)

    • src/lib/circuit-breaker-probe.ts
    • src/lib/rate-limit/service.ts
    • src/lib/session-tracker.ts
    • src/middleware.ts
    • src/instrumentation.ts
    • src/repository/provider.ts
    • Documentation updates

Why Split?

  • Easier to review each feature independently
  • Faster CI feedback per change
  • Easier to revert specific features if needed
  • Better git history for tracking feature additions
  • Allows parallel review of different features

🤖 Automated analysis by Claude AI

@ding113 ding113 added the enhancement New feature or request label Nov 27, 2025
@ding113
Copy link
Owner Author

ding113 commented Nov 27, 2025

🔒 Security Scan Results

No security vulnerabilities detected

This PR has been scanned against OWASP Top 10, CWE Top 25, and common security anti-patterns. No security issues were identified in the code changes.

Scanned Categories

  • Injection attacks (SQL, NoSQL, Command, LDAP, etc.) - The code uses parameterized queries via Drizzle ORM. Raw SQL queries in provider.ts use parameterized placeholders (${timezone}) safely. No string concatenation for SQL queries detected.
  • Authentication and session management - All server actions properly check admin session via getSession() before executing sensitive operations. Session validation is consistently applied.
  • Sensitive data exposure - API keys are properly masked in logs using sanitizeErrorTextForLogging(). No hardcoded credentials detected. Sensitive data is properly redacted in error messages.
  • Access control and authorization - Admin-only routes are properly protected. The new availability monitoring endpoints properly validate user roles.
  • Security misconfiguration - No debug mode or verbose error exposure detected in production code paths.
  • Cross-site scripting (XSS) - React components use proper escaping. No dangerouslySetInnerHTML or direct DOM manipulation with user input detected.
  • Insecure deserialization - JSON parsing includes proper error handling. No unsafe deserialization patterns detected.
  • SSRF and path traversal - The validateProviderUrlForConnectivity() function properly blocks internal network addresses (localhost, 127.x.x.x, 10.x.x.x, 172.16-31.x.x, 192.168.x.x, etc.) and dangerous ports (22, 23, 25, 3306, 5432, 6379, 27017, 9200).
  • Cryptographic weaknesses - No weak cryptographic algorithms detected. No predictable token generation.

Security-Positive Patterns Observed

  1. SSRF Prevention (providers.ts:1553-1593)

    • Comprehensive blocklist for internal IP ranges
    • Protocol validation (HTTP/HTTPS only)
    • Dangerous port blocking
  2. Rate Limiting & DoS Protection (providers.ts:73-77)

    • Stream buffer size limits (10MB max)
    • Chunk count limits (1000 max)
    • Iteration limits (10000 max)
  3. Authentication Consistency

    • All new server actions (testProviderUnified, getProviderTestPresets, getProviderLimitUsageBatch) properly check admin session
  4. Error Message Sanitization (providers.ts:1033-1053)

    • API keys redacted from logs
    • Email addresses sanitized
    • Bearer tokens removed
    • Paths and sensitive config files sanitized
  5. Input Validation

    • Zod schemas used for provider creation/updates
    • URL validation before making external requests
    • JSON payload validation in test service

Files Reviewed

File Changes Security Notes
src/actions/providers.ts +393 lines SSRF protection in place, proper auth checks
src/lib/provider-testing/test-service.ts +271 lines Clean request handling, no injection risks
src/lib/rate-limit/service.ts +136 lines Redis pipeline queries, safe batch operations
src/lib/session-tracker.ts +109 lines Safe Redis operations with proper error handling
src/repository/provider.ts +13/-5 lines SQL field name fix (providerId→id), uses parameterized queries
src/middleware.ts +2/-2 lines Fixes double-locale prefix issue
src/app/api/availability/*.ts New routes Would benefit from admin auth check (see recommendations)

Recommendations (Non-blocking)

  1. API Route Authentication (Low priority)

    • The new /api/availability routes should verify that the routes are properly protected by middleware or add explicit admin session checks similar to server actions.
  2. Request Timeout Consistency

    • Consider standardizing timeout values across different test scenarios for consistency.

🤖 Automated security scan by Claude AI - OWASP Top 10 & CWE coverage

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several major features, including a provider availability monitoring dashboard, a smart probing mechanism for circuit breakers, and a unified provider testing framework. It also includes significant performance optimizations by batching database and Redis queries. My review identified one high-severity issue in the new smart probing logic. A successful probe incorrectly resets the circuit breaker to a fully CLOSED state instead of the safer HALF_OPEN state. This could reduce system stability when dealing with flaky providers.

Comment on lines 122 to 124
// Reset circuit to closed (will transition through half-open on next real request)
// Using resetCircuit here since we want to give the provider another chance
resetCircuit(providerId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Resetting the circuit directly to CLOSED upon a single successful probe is too aggressive. It bypasses the HALF_OPEN state, which is a crucial part of the circuit breaker pattern for safely verifying a provider's recovery. A single successful lightweight probe might not guarantee the provider can handle production load, and immediately opening it to full traffic could cause it to fail again, leading to a flapping state where the circuit rapidly opens and closes.

The log message on line 115 ("transitioning to half-open") and the comment on line 122 are also misleading, as the current implementation with resetCircuit sets the state directly to CLOSED.

A better approach would be to transition the circuit to the HALF_OPEN state. This could be done by creating a new function, e.g., tripToHalfOpen(providerId), that correctly sets the circuit state, or by adjusting the circuitOpenUntil timestamp to allow the natural transition to HALF_OPEN on the next request.

Copy link
Owner Author

@ding113 ding113 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Code Review Summary

This PR adds multiple new features including provider availability monitoring, unified provider testing with three-tier validation, smart circuit breaker probing, and batch operations for provider limits. The implementation is comprehensive but has several issues that should be addressed.

🔍 Issues Found

  • Critical (🔴): 0 issues
  • High (🟠): 3 issues
  • Medium (🟡): 2 issues
  • Low (🟢): 1 issue

🎯 Priority Actions

See inline comments for specific issues identified:

  1. src/lib/provider-testing/test-service.ts: Missing timeout cleanup in finally block - potential memory leak on fetch abort
  2. src/lib/circuit-breaker-probe.ts: Infinite setInterval without proper cleanup mechanism
  3. src/lib/availability/availability-service.ts: Unbounded data loading - potential memory exhaustion
  4. src/lib/session-tracker.ts: N+1 Redis pipeline pattern with unnecessary second phase
  5. src/app/api/availability/current/route.ts: Inconsistent authentication mechanism

💡 General Observations

  • The new provider testing infrastructure is well-structured with good separation of concerns
  • The availability monitoring adds useful observability features
  • Consider adding rate limiting to the availability API endpoints to prevent abuse
  • The three-tier validation (HTTP status, latency, content) follows industry best practices

🤖 Automated review by Claude AI - focused on identifying issues for improvement

Copy link
Owner Author

@ding113 ding113 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Code Review Summary

This PR adds multiple new features including provider availability monitoring, unified provider testing with three-tier validation, smart circuit breaker probing, and batch operations for provider limits. The implementation is comprehensive but has several issues that should be addressed.

🔍 Issues Found

  • Critical (🔴): 0 issues
  • High (🟠): 3 issues
  • Medium (🟡): 2 issues
  • Low (🟢): 1 issue

🎯 Priority Actions

  1. Memory leak in test-service.ts: Missing timeout cleanup on fetch abort
  2. Resource leak in circuit-breaker-probe.ts: Infinite setInterval without proper cleanup
  3. Performance issue in availability-service.ts: Unbounded data loading could cause memory exhaustion
  4. N+1 pattern in session-tracker.ts: Unnecessary second phase in batch operation
  5. Inconsistent auth in current/route.ts: Uses different auth mechanism than main route

🤖 Automated review by Claude AI - focused on identifying issues for improvement

const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), timeoutMs);

// Execute request
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High: Missing timeout cleanup in error path leading to potential memory leak

Why this is a problem: When the fetch request throws an error (abort, network error, etc.), the clearTimeout(timeoutId) on line 77 is never executed because the error is thrown before reaching it. The timeout continues running, potentially causing issues if many requests fail.

Suggested fix:

const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), timeoutMs);

try {
  const response = await fetch(url, {
    method: "POST",
    headers,
    body: JSON.stringify(body),
    signal: controller.signal,
  });
  // ... success handling
} catch (error) {
  // ... error handling
} finally {
  clearTimeout(timeoutId);
}

if (health.circuitState === "open") {
openCircuits.push(parseInt(providerId, 10));
}
}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High: Infinite setInterval without proper cleanup mechanism on graceful shutdown

Why this is a problem: The probeIntervalId is stored but there's no integration with process shutdown handlers. When the Node.js process receives SIGTERM/SIGINT (common in containerized deployments), the interval keeps running until forceful termination, which could leave probes in an inconsistent state.

Suggested fix:

export function startProbeScheduler(): void {
  if (!ENABLE_SMART_PROBING) return;
  if (probeIntervalId) return;

  logger.info("[SmartProbe] Starting probe scheduler", { intervalMs: PROBE_INTERVAL_MS });
  probeIntervalId = setInterval(runProbeCycle, PROBE_INTERVAL_MS);
  runProbeCycle();

  // Register cleanup handlers for graceful shutdown
  const cleanup = () => {
    stopProbeScheduler();
    process.removeListener('SIGTERM', cleanup);
    process.removeListener('SIGINT', cleanup);
  };
  process.on('SIGTERM', cleanup);
  process.on('SIGINT', cleanup);
}

endTime: endDate.toISOString(),
bucketSizeMinutes: explicitBucketSize ?? 60,
providers: [],
systemAvailability: 0,
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High: Unbounded query loading all requests into memory could cause OOM

Why this is a problem: This query loads ALL requests within the time range into memory without pagination. For production systems with high traffic, querying 7 days of data could return millions of rows, causing memory exhaustion and application crashes.

Suggested fix:

// Add LIMIT to prevent unbounded loading
const MAX_REQUESTS_PER_QUERY = 100000; // Reasonable limit

const requests = await db
  .select({
    id: messageRequest.id,
    providerId: messageRequest.providerId,
    statusCode: messageRequest.statusCode,
    durationMs: messageRequest.durationMs,
    errorMessage: messageRequest.errorMessage,
    createdAt: messageRequest.createdAt,
  })
  .from(messageRequest)
  .where(and(...requestConditions))
  .orderBy(messageRequest.createdAt)
  .limit(MAX_REQUESTS_PER_QUERY);

// Consider logging a warning if limit is reached
if (requests.length === MAX_REQUESTS_PER_QUERY) {
  logger.warn('Availability query hit max limit, results may be incomplete');
}

Alternatively, consider using SQL aggregation instead of in-memory processing for better performance.

providerSessionMap.set(providerId, sessionIds);
allSessionIds.push(...sessionIds);
} else {
providerSessionMap.set(providerId, []);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium: Redundant pipeline phase - second EXISTS check is unnecessary

Why this is a problem: The second pipeline verifies session existence by checking session:{sessionId}:info, but the zrange already returned valid session IDs from the sorted set. If the session was removed, zremrangebyscore would have already cleaned it. This adds extra Redis round-trips without benefit.

Suggested fix:

// Skip the second verification phase - just count what zrange returned
for (const providerId of providerIds) {
  const sessionIds = providerSessionMap.get(providerId) || [];
  result.set(providerId, sessionIds.length);
}

return result;

If the verification IS needed (e.g., session:info can expire independently), document why in a comment.

// Verify admin authentication
const authHeader = request.headers.get("Authorization");
const token = authHeader?.replace("Bearer ", "");

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium: Inconsistent authentication mechanism between availability endpoints

Why this is a problem: This endpoint uses Bearer token + validateKey() while /api/availability/route.ts uses session cookies via getSession(). This inconsistency could confuse API consumers and make authorization harder to maintain.

Suggested fix: Use the same authentication mechanism as the main availability route:

import { getSession } from "@/lib/auth";

export async function GET(request: NextRequest) {
  const session = await getSession();
  if (!session || session.user.role !== "admin") {
    return NextResponse.json({ error: "Unauthorized" }, { status: 401 });
  }
  // ...
}

Or document why Bearer token auth is intentionally used here (e.g., for API access from external services).

(mr.provider_chain IS NULL OR jsonb_array_length(mr.provider_chain) = 0) AND mr.provider_id = p.id
OR
-- 情况2:有重试,使用 providerChain 最后一项的 providerId
-- 情况2:有重试,使用 providerChain 最后一项的 id
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low: SQL field name change from 'providerId' to 'id' may break existing data

Why this is a problem: The change from provider_chain->-1->>'providerId' to provider_chain->-1->>'id' assumes all provider_chain JSON objects use the 'id' field. If there's existing data with 'providerId', statistics will be incorrect.

Suggested fix: Add COALESCE to handle both field names during migration:

COALESCE(
  (mr.provider_chain->-1->>'id')::int,
  (mr.provider_chain->-1->>'providerId')::int
) = p.id

Or confirm via migration that all existing data has been updated.

@ding113
Copy link
Owner Author

ding113 commented Nov 27, 2025

📝 Documentation Review

📊 Issues Summary

  • Critical (🔴): 0 - Must fix before merge
  • High (🟠): 0 - Should fix before merge
  • Medium (🟡): 1 - Consider addressing
  • Low (🟢): 0 - Optional improvements

⚡ Priority Fixes

  1. Terminology inconsistency in acknowledgments - Minor wording improvement suggested

📋 Review Coverage

  • Technical accuracy - 1 issue
  • Completeness - 0 issues
  • Code examples - N/A (no code examples in docs changes)
  • Links and references - 0 issues (all links verified: 200 OK)
  • Clarity and organization - 0 issues

🔍 Detailed Findings

🟡 Terminology: Description may not accurately reflect functionality

Files: README.md (line 285), README.en.md (line 285)

Problem: The acknowledgments describe relay-pulse as providing "provider detection functionality", but the actual relay-pulse project and the implemented code provide "provider availability monitoring" / "provider testing" functionality with three-tier validation.

Current text (README.en.md):

...and [prehisle/relay-pulse](https://github.com/prehisle/relay-pulse) for provider detection functionality.

Suggested text:

...and [prehisle/relay-pulse](https://github.com/prehisle/relay-pulse) for provider availability monitoring.

Or alternatively:

...and [prehisle/relay-pulse](https://github.com/prehisle/relay-pulse) for provider testing functionality.

Impact: Minor - readers who visit relay-pulse expecting "detection" features may be confused to find availability monitoring instead.

Reference: relay-pulse README describes itself as "LLM 服务可用性监控" (LLM service availability monitoring). The implemented code in this PR (src/lib/provider-testing/, src/lib/availability/) aligns with monitoring/testing, not detection.


✅ Verified Items

  1. External links all return HTTP 200:

  2. Consistent bilingual updates: Both README.md (Chinese) and README.en.md (English) were updated with matching acknowledgments

  3. Localization files: All locale message files (en, ja, ru, zh-CN, zh-TW) contain consistent translations for new features


🤖 Automated docs review by Claude AI

ding113 and others added 2 commits November 27, 2025 11:44
1. circuit-breaker: 添加 tripToHalfOpen 函数,智能探测成功后转到 HALF_OPEN 而非直接 CLOSED
   - 修复日志 "transitioning to half-open" 与实际行为不一致的问题
   - 更安全的恢复机制,允许通过 HALF_OPEN 阶段验证

2. availability-service: 添加 MAX_REQUESTS_PER_QUERY 限制 (100000)
   - 防止大时间范围查询导致内存溢出
   - 达到限制时记录警告日志

3. test-service: 使用 finally 块确保 timeout 清理
   - 即使请求抛出异常也会执行 clearTimeout

4. current/route.ts: 统一认证机制
   - 从 Bearer token + validateKey 改为 getSession
   - 与 /api/availability 保持一致

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ding113 ding113 added the bug Something isn't working label Nov 27, 2025
@ding113
Copy link
Owner Author

ding113 commented Nov 27, 2025

📊 PR Size Analysis

This PR is XL with 6,968 lines changed across 52 files (6,555 additions, 413 deletions).

Large PRs are harder to review and more likely to introduce bugs.

🔀 Suggested Split:

Based on the changes, this PR could be split into 5 focused PRs:

1. Provider Availability Monitoring (~2,500 lines)

Core availability feature with dashboard:

  • src/lib/availability/* (service, types, index)
  • src/app/api/availability/* (API routes)
  • src/app/[locale]/dashboard/availability/* (UI components)
  • Related i18n keys in messages/*/dashboard.json

2. Smart Circuit Breaker Probing (~800 lines)

Circuit breaker enhancements:

  • src/lib/circuit-breaker-probe.ts
  • src/lib/circuit-breaker.ts (modifications)
  • src/instrumentation.ts
  • .env.example (new env vars)

3. Enhanced Provider Testing (~2,500 lines)

Provider testing library with parsers and validators:

  • src/lib/provider-testing/* (test-service, types, presets)
  • src/lib/provider-testing/parsers/*
  • src/lib/provider-testing/validators/*
  • src/lib/provider-testing/utils/*
  • src/lib/provider-testing/data/* (test fixtures)
  • src/app/[locale]/settings/providers/_components/forms/*
  • Related i18n keys in messages/*/settings.json

4. Performance & Query Optimizations (~300 lines)

Database and query improvements:

  • src/repository/provider.ts
  • src/actions/providers.ts
  • src/app/[locale]/dashboard/quotas/providers/page.tsx

5. Bug Fixes & Minor Improvements (~400 lines)

Independent fixes:

  • src/lib/session-tracker.ts
  • src/lib/rate-limit/service.ts
  • src/middleware.ts (login redirect fix)
  • src/components/ui/dialog.tsx (width fix)
  • README.md, README.en.md (docs)

Why Split?

  • ✅ Easier to review - Reviewers can focus on one concern at a time
  • ✅ Faster CI feedback - Smaller PRs run tests faster
  • ✅ Easier to revert - If one feature has issues, it can be reverted independently
  • ✅ Better git history - Clear commits showing feature evolution
  • ✅ Reduced merge conflicts - Smaller surface area for conflicts

Recommended Merge Order:

  1. Bug Fixes & Minor Improvements (foundation)
  2. Performance & Query Optimizations (needed for new features)
  3. Smart Circuit Breaker Probing (standalone feature)
  4. Provider Availability Monitoring (depends on optimizations)
  5. Enhanced Provider Testing (depends on availability service)

🤖 Automated analysis by Claude AI

Copy link
Owner Author

@ding113 ding113 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Code Review Summary

This PR introduces a comprehensive provider availability monitoring system, intelligent circuit breaker probing, and enhanced provider testing functionality. The implementation adds significant monitoring capabilities but has several issues that should be addressed.

🔍 Issues Found

  • Critical (🔴): 1 issue
  • High (🟠): 2 issues
  • Medium (🟡): 4 issues
  • Low (🟢): 0 issues

🎯 Priority Actions

  1. 🔴 Critical - API Key Exposure in URL (src/lib/provider-testing/utils/test-prompts.ts:291): The Gemini API key is appended as a URL query parameter (?key=${apiKey}). This is a security risk as URLs with query parameters are commonly logged in server access logs, proxy logs, and browser history. Consider using header-based authentication (x-goog-api-key header) instead.

  2. 🟠 High - Missing Error Handling in Probe Scheduler (src/lib/circuit-breaker-probe.ts:232-234): The runProbeCycle() is called without await in the interval callback, but errors are only caught at the .catch() level. If loadProviderConfigs() throws synchronously before the async operations, it could crash the scheduler.

  3. 🟠 High - Unbounded Memory Growth in Provider Config Cache (src/lib/circuit-breaker-probe.ts:31-33): The providerConfigCache Map grows indefinitely as providers are added but never cleaned up when providers are deleted. Consider adding cache invalidation when providers are removed or implementing a maximum cache size.

  4. 🟡 Medium - Missing Input Validation for Custom JSON Payload (src/lib/provider-testing/test-service.ts:67-71): The custom payload parsing catches JSON parse errors but doesn't validate the structure of the parsed object. Malformed payloads could cause unexpected behavior downstream.

  5. 🟡 Medium - Potential SQL Performance Issue with Window Queries (src/repository/provider.ts:408-416): While the 7-day limit on latest_call CTE is a good optimization, the window function ROW_NUMBER() still needs to process all rows within that window. Consider adding an index on (final_provider_id, created_at DESC) if not already present.

  6. 🟡 Medium - Race Condition in Batch Redis Pipeline (src/lib/rate-limit/service.ts:888-970): The batch cost query assumes all Redis operations will return in order. If Redis has issues with specific keys, the queryMeta mapping could become misaligned with results.

  7. 🟡 Medium - Missing Timeout for Provider Config Loading (src/lib/circuit-breaker-probe.ts:69-87): The database query in loadProviderConfigs() has no timeout. If the database is slow or unresponsive, it could block the probe scheduler indefinitely.

💡 General Observations

  • The N+1 query fix using Redis batch operations (getCurrentCostBatch and getProviderSessionCountBatch) is a good performance improvement.
  • The availability monitoring system is well-structured with proper separation of concerns.
  • The three-tier validation system (HTTP status → latency → content) follows good patterns from the relay-pulse reference implementation.
  • Consider adding rate limiting to the /api/availability endpoint to prevent abuse.

🤖 Automated review by Claude AI - focused on identifying issues for improvement

@ding113
Copy link
Owner Author

ding113 commented Nov 27, 2025

🔒 Security Scan Results

No critical security vulnerabilities detected

This PR has been scanned against OWASP Top 10, CWE Top 25, and common security anti-patterns. No security issues requiring immediate attention were identified in the code changes.


📋 OWASP Top 10 Coverage

Category Status Notes
A01: Injection (SQL, Command) ✅ Clean Drizzle ORM with parameterized queries used throughout. No raw SQL concatenation.
A02: Broken Authentication ✅ Clean Admin session validation via getSession() consistently applied to new API endpoints.
A03: Sensitive Data Exposure ✅ Clean API keys handled appropriately, no logging of secrets detected.
A04: XML External Entities ✅ N/A No XML parsing in changes.
A05: Broken Access Control ✅ Clean All new endpoints (/api/availability, /api/availability/current) properly check admin role.
A06: Security Misconfiguration ✅ Clean No debug modes enabled, proper error handling.
A07: Cross-Site Scripting ✅ Clean React auto-escaping used, no dangerouslySetInnerHTML.
A08: Insecure Deserialization ✅ Clean JSON parsing has proper error handling with try-catch.
A09: Known Vulnerabilities ⚪ N/A No new dependencies added.
A10: Insufficient Logging ✅ Clean Proper logging added for new features.

🛡️ Security Highlights

Positive Security Patterns Found:

  1. SSRF Protection (src/actions/providers.ts:2355-2365)

    • Provider URL validation via validateProviderUrlForConnectivity() before making outbound requests
    • Proper URL scheme validation (http/https only)
  2. SQL Injection Prevention (src/lib/availability/availability-service.ts, src/repository/provider.ts)

    • Drizzle ORM with proper parameterized queries
    • inArray(), eq(), gte(), lte() operators prevent SQL injection
    • Timezone parameter properly escaped in template literal
  3. Authentication Consistency

    • All new API routes check getSession() and verify role === 'admin'
    • Consistent authorization pattern across:
      • /api/availability/route.ts
      • /api/availability/current/route.ts
      • testProviderUnified() server action
      • getProviderLimitUsageBatch() server action
  4. Input Validation

    • Query parameters parsed safely with parseInt() and isNaN() checks
    • Provider IDs validated before use in queries
    • Custom payload JSON parsing wrapped in try-catch
  5. Resource Limits (DoS Prevention)

    • MAX_REQUESTS_PER_QUERY = 100000 limit in availability-service.ts
    • Request timeout enforcement in provider testing
    • Proper abort controller usage with timeout cleanup in finally block

⚠️ Minor Observations (Not Security Issues)

  1. Information Disclosure (Low/Informational)

    • Error messages returned to admin users contain some internal details
    • This is acceptable since these endpoints are admin-only
  2. Rate Limiting

    • New availability API endpoints rely on session-based auth but don't have explicit rate limiting
    • Mitigated by admin-only access requirement

📝 Files Scanned

  • src/actions/providers.ts - Provider testing, batch queries
  • src/app/api/availability/*.ts - New availability API endpoints
  • src/lib/availability/*.ts - Availability service implementation
  • src/lib/circuit-breaker-probe.ts - Smart probing scheduler
  • src/lib/circuit-breaker.ts - Circuit breaker state management
  • src/lib/provider-testing/*.ts - Provider testing service
  • src/lib/rate-limit/service.ts - Rate limit batch queries
  • src/lib/session-tracker.ts - Session tracking batch queries
  • src/repository/provider.ts - Provider statistics SQL
  • src/middleware.ts - Auth middleware

🏁 Conclusion

Security Posture: Acceptable

This PR follows secure coding practices:

  • Uses parameterized queries via Drizzle ORM
  • Implements proper authentication/authorization checks
  • Validates user input before use
  • Has SSRF protection for outbound requests
  • Includes resource limits to prevent DoS

No blocking security concerns. Safe to merge from a security perspective.


🤖 Automated security scan by Claude AI - OWASP Top 10 & CWE coverage

Copy link
Owner Author

@ding113 ding113 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Documentation Review

Documentation looks good

No significant issues found. The documentation changes are:

  • Technically accurate
  • Complete and clear
  • Well-structured
  • Free of broken links

📋 Review Coverage

  • Technical accuracy - 0 issues
  • Completeness - 0 issues
  • Code examples - 0 issues (no code examples in changes)
  • Links and references - 0 issues (all 3 GitHub links verified: zsio/claude-code-hub ✅, router-for-me/CLIProxyAPI ✅, prehisle/relay-pulse ✅)
  • Clarity and organization - 0 issues

✅ Changes Validated

  1. README.md & README.en.md: New acknowledgment for prehisle/relay-pulse is accurate and consistent between English and Chinese versions.

  2. .env.example: Smart probing configuration documentation correctly documents:

    • ENABLE_SMART_PROBING (default: false) ✅
    • PROBE_INTERVAL_MS (default: 30000ms = 30s) ✅
    • PROBE_TIMEOUT_MS (default: 5000ms = 5s) ✅

    All values match the implementation in src/lib/circuit-breaker-probe.ts.

  3. i18n files (settings.json): The streaming idle timeout documentation fix from "1-120 seconds" to "60-600 seconds" is correct - matches the validation constants in src/lib/constants/provider.constants.ts:34 (STREAMING_IDLE_TIMEOUT_MS: { MIN: 60000, MAX: 600000 }).


🤖 Automated docs review by Claude AI

Copy link
Owner Author

@ding113 ding113 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline comments for PR review

}

return url;
}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical: Potential API Key Exposure in URL

Why this is a problem: The API key is appended as a URL query parameter (?key=${apiKey}). URLs with query parameters are commonly logged in:

  • Server access logs
  • Reverse proxy logs
  • Browser history
  • Network debugging tools
  • Error tracking services (like Sentry)

This could lead to API key exposure.

Suggested fix:

// Use x-goog-api-key header instead of URL parameter
case "gemini":
case "gemini-cli":
  return {
    ...baseHeaders,
    ...GEMINI_TEST_HEADERS,
    "x-goog-api-key": apiKey,
  };

Then remove the URL parameter logic and pass the API key via headers in getTestHeaders() instead.

name: string;
url: string;
key: string;
providerType: ProviderType;
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High: Unbounded Memory Growth in Provider Config Cache

Why this is a problem: The providerConfigCache Map grows indefinitely as providers are added but is never cleaned up when providers are deleted. Over time, this could lead to memory issues in long-running processes.

Suggested fix:

// Option 1: Add maximum cache size with LRU eviction
const MAX_CACHE_SIZE = 1000;

async function loadProviderConfigs(): Promise<void> {
  // ... existing code ...
  
  // If cache exceeds max size, clear old entries
  if (providerConfigCache.size > MAX_CACHE_SIZE) {
    const keysToDelete = Array.from(providerConfigCache.keys())
      .slice(0, providerConfigCache.size - MAX_CACHE_SIZE);
    keysToDelete.forEach(key => providerConfigCache.delete(key));
  }
}

// Option 2: Invalidate cache for deleted providers
// Build cache fresh each time from DB query (which already filters deleted_at IS NULL)
providerConfigCache.clear();
providerConfigCache = new Map(providerList.map(...));

}
} else if (config.preset) {
// Use preset configuration
body = getPresetPayload(config.preset, config.model);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium: Missing Input Validation for Custom JSON Payload

Why this is a problem: The code catches JSON parse errors but doesn't validate the structure of the parsed object. A malformed payload (e.g., missing required fields, wrong types) could cause unexpected behavior or cryptic errors downstream.

Suggested fix:

if (config.customPayload) {
  try {
    body = JSON.parse(config.customPayload);
    // Validate basic structure
    if (typeof body !== 'object' || body === null) {
      throw new Error('Custom payload must be a JSON object');
    }
  } catch (e) {
    if (e instanceof SyntaxError) {
      throw new Error('Invalid custom payload JSON: ' + e.message);
    }
    throw e;
  }
}

Copy link
Owner Author

@ding113 ding113 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional inline comments

logger.error("[SmartProbe] Scheduled probe cycle failed", {
error: error instanceof Error ? error.message : String(error),
});
});
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High: Missing Error Handling for Synchronous Errors

Why this is a problem: The runProbeCycle() function is called in the interval callback without proper error handling for synchronous throws. While .catch() handles async rejections, if loadProviderConfigs() or early parts of the function throw synchronously (e.g., due to import failures), it won't be caught.

Suggested fix:

probeIntervalId = setInterval(() => {
  void (async () => {
    try {
      await runProbeCycle();
    } catch (error) {
      logger.error('[SmartProbe] Scheduled probe cycle failed', {
        error: error instanceof Error ? error.message : String(error),
      });
    }
  })();
}, PROBE_INTERVAL_MS);

Alternatively, use a more explicit async wrapper pattern.

window24h.toString()
);
queryMeta.push({ providerId, period: "daily", isRolling: true });
} else {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium: Potential Index Mismatch in Pipeline Results

Why this is a problem: The code assumes pipeline results will always be returned in the exact order they were submitted. While Redis pipelines generally preserve order, if any command fails in an unexpected way, the queryMeta[i] mapping could become misaligned.

Suggested fix:

// Add defensive check to verify array lengths match
if (pipelineResults.length !== queryMeta.length) {
  logger.error('[RateLimit] Pipeline results count mismatch', {
    expected: queryMeta.length,
    received: pipelineResults.length,
  });
  return result;
}

ELSE (provider_chain->-1->>'id')::int
END AS final_provider_id,
created_at AS last_call_time,
model AS last_call_model
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium: Window Function Performance Consideration

Why this is a problem: While the 7-day limit is a good optimization, the ROW_NUMBER() OVER (PARTITION BY final_provider_id ORDER BY created_at DESC) still requires sorting all rows within each partition. For high-traffic systems, this could be slow.

Recommendation: Ensure there's a composite index on (final_provider_id, created_at DESC) in the database schema to optimize this query. If not present, consider adding:

CREATE INDEX idx_message_request_provider_time 
ON message_request (final_provider_id, created_at DESC) 
WHERE deleted_at IS NULL;

@ding113 ding113 merged commit 06311b1 into main Nov 27, 2025
10 checks passed
ding113 pushed a commit that referenced this pull request Nov 27, 2025
sususu98 pushed a commit to sususu98/claude-code-hub that referenced this pull request Nov 28, 2025
ding113 pushed a commit that referenced this pull request Nov 28, 2025
ding113 pushed a commit that referenced this pull request Nov 28, 2025
@ding113 ding113 mentioned this pull request Nov 28, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request size/XL Extra Large PR (> 1000 lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: 供应商配置流式静默超时范围错误 Feat:供应商状态页面

2 participants

Comments