Conversation
- 优先提取 upstream_error 中的错误信息 - 解决 cubence 等中转服务的错误显示问题 - 修复前端显示"you appear to be making a request using a non-Claude code cli client, which is prohibited" - 保持向后兼容的错误提取逻辑 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- 减少嵌套层级,提高代码可读性 - 使用可选链操作符简化属性访问 - 保持相同的功能逻辑 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
fix: 增强错误解析以支持中转服务的嵌套错误结构
- 新增供应商可用性监控页面,支持热力图可视化 - 添加排序功能(按可用性/名称/请求数),默认按可用性排序 - 简化状态逻辑:移除"波动"状态,成功请求=绿色,失败请求=红色 - 响应式热力图宽度,自动适配时间跨度 - 支持5种语言的i18n翻译 (zh-CN, en, ja, ru, zh-TW) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
问题分析: - 供应商限额页面:每个供应商触发 1 DB + 5 Redis 查询(N+1 问题) - 供应商管理页面:getProviderStatistics SQL 缺少日期过滤(全表扫描) 修复内容: 1. 新增批量查询方法避免 N+1: - RateLimitService.getCurrentCostBatch() - Redis Pipeline 批量获取限额 - SessionTracker.getProviderSessionCountBatch() - 批量获取 session 计数 - getProviderLimitUsageBatch() - 整合批量查询入口 2. 优化 getProviderStatistics SQL: - provider_stats CTE: LEFT JOIN 添加 created_at >= CURRENT_DATE 过滤 - latest_call CTE: 添加 7 天时间范围限制 性能提升(50 个供应商): - 限额页面:52 DB + 250 Redis → 2 DB + 2 Redis Pipeline - 管理页面:全表扫描 → 仅扫描今日/7 天数据 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
基于 relay-pulse 实现改进内容提取逻辑: - 改进 isSSEResponse 检测:使用 event: + data: 双重特征判断 - 新增 aggregateResponseText 函数:提供 SSE → JSON → 原始响应的回退机制 - 修改 test-service 使用 aggregateResponseText 进行内容验证 修复场景:即使 SSE 解析返回空内容,也能通过 JSON 或原始响应回退提取关键词 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
SQL 查询中使用了错误的键名 'providerId',而 ProviderChainItem 实际定义的是 'id'。 当请求有重试切换时,JSONB 查询返回 NULL,导致今日用量显示为 0。 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
将供应商可用性监控的状态标签从 Badge variant 改为自定义 className, 采用与请求日志表格相同的浅色背景配色方案,提升 UI 一致性。 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
前端提示说"范围 1-120 秒,默认 10 秒",但后端校验规则实际为 60-600 秒。 更新所有语言的 streamingIdle 描述为正确范围,并说明填 0 可禁用。 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
使用 pathWithoutLocale 替代 pathname 作为 from 参数, 避免 locale-aware router 再次添加 locale 前缀导致 /zh-CN/zh-CN/dashboard 这样的错误 URL。 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
核心改动: - 移除复杂的 SSE/JSON 解析逻辑,直接用 responseBody.includes() 匹配关键词 - 新增 rawResponse 字段(5000字符),提供完整响应体供用户检查 - 前端用 Raw Response Body 替代 Stream Info,方便用户自行判断问题 - 添加 5 种语言的 rawResponse 翻译 优点:无论响应格式如何(SSE/JSON/纯文本),只要包含关键词就能通过验证 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
添加供应商测试功能相关的繁体中文翻译: - streamInfo, chunksReceived, streamFormat, streamResponse - chunksCount, truncatedPreview, truncatedBrief, copyFormat 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
供应商检测功能参考了 prehisle/relay-pulse 项目的实现。 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
允许调用方通过 className 完全控制弹窗宽度,修复编辑服务商等弹窗无法使用 max-w-6xl 的问题。 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Summary of ChangesHello @ding113, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers a comprehensive upgrade to provider management and monitoring capabilities. It introduces a dedicated dashboard for real-time provider availability, streamlines API testing with a unified and robust validation framework, and enhances system resilience through smart probing for circuit breakers. Additionally, it optimizes backend data retrieval for improved performance and ensures all new features are fully localized. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
📊 PR Size AnalysisThis PR is XL with 6,920 lines changed across 51 files. Large PRs are harder to review and more likely to introduce bugs. 🔀 Suggested Split:Based on the changes, this PR could be split into:
Why Split?
🤖 Automated analysis by Claude AI |
🔒 Security Scan Results✅ No security vulnerabilities detected This PR has been scanned against OWASP Top 10, CWE Top 25, and common security anti-patterns. No security issues were identified in the code changes. Scanned Categories
Security-Positive Patterns Observed
Files Reviewed
Recommendations (Non-blocking)
🤖 Automated security scan by Claude AI - OWASP Top 10 & CWE coverage |
There was a problem hiding this comment.
Code Review
This pull request introduces several major features, including a provider availability monitoring dashboard, a smart probing mechanism for circuit breakers, and a unified provider testing framework. It also includes significant performance optimizations by batching database and Redis queries. My review identified one high-severity issue in the new smart probing logic. A successful probe incorrectly resets the circuit breaker to a fully CLOSED state instead of the safer HALF_OPEN state. This could reduce system stability when dealing with flaky providers.
src/lib/circuit-breaker-probe.ts
Outdated
| // Reset circuit to closed (will transition through half-open on next real request) | ||
| // Using resetCircuit here since we want to give the provider another chance | ||
| resetCircuit(providerId); |
There was a problem hiding this comment.
Resetting the circuit directly to CLOSED upon a single successful probe is too aggressive. It bypasses the HALF_OPEN state, which is a crucial part of the circuit breaker pattern for safely verifying a provider's recovery. A single successful lightweight probe might not guarantee the provider can handle production load, and immediately opening it to full traffic could cause it to fail again, leading to a flapping state where the circuit rapidly opens and closes.
The log message on line 115 ("transitioning to half-open") and the comment on line 122 are also misleading, as the current implementation with resetCircuit sets the state directly to CLOSED.
A better approach would be to transition the circuit to the HALF_OPEN state. This could be done by creating a new function, e.g., tripToHalfOpen(providerId), that correctly sets the circuit state, or by adjusting the circuitOpenUntil timestamp to allow the natural transition to HALF_OPEN on the next request.
ding113
left a comment
There was a problem hiding this comment.
📋 Code Review Summary
This PR adds multiple new features including provider availability monitoring, unified provider testing with three-tier validation, smart circuit breaker probing, and batch operations for provider limits. The implementation is comprehensive but has several issues that should be addressed.
🔍 Issues Found
- Critical (🔴): 0 issues
- High (🟠): 3 issues
- Medium (🟡): 2 issues
- Low (🟢): 1 issue
🎯 Priority Actions
See inline comments for specific issues identified:
- src/lib/provider-testing/test-service.ts: Missing timeout cleanup in finally block - potential memory leak on fetch abort
- src/lib/circuit-breaker-probe.ts: Infinite setInterval without proper cleanup mechanism
- src/lib/availability/availability-service.ts: Unbounded data loading - potential memory exhaustion
- src/lib/session-tracker.ts: N+1 Redis pipeline pattern with unnecessary second phase
- src/app/api/availability/current/route.ts: Inconsistent authentication mechanism
💡 General Observations
- The new provider testing infrastructure is well-structured with good separation of concerns
- The availability monitoring adds useful observability features
- Consider adding rate limiting to the availability API endpoints to prevent abuse
- The three-tier validation (HTTP status, latency, content) follows industry best practices
🤖 Automated review by Claude AI - focused on identifying issues for improvement
ding113
left a comment
There was a problem hiding this comment.
📋 Code Review Summary
This PR adds multiple new features including provider availability monitoring, unified provider testing with three-tier validation, smart circuit breaker probing, and batch operations for provider limits. The implementation is comprehensive but has several issues that should be addressed.
🔍 Issues Found
- Critical (🔴): 0 issues
- High (🟠): 3 issues
- Medium (🟡): 2 issues
- Low (🟢): 1 issue
🎯 Priority Actions
- Memory leak in test-service.ts: Missing timeout cleanup on fetch abort
- Resource leak in circuit-breaker-probe.ts: Infinite setInterval without proper cleanup
- Performance issue in availability-service.ts: Unbounded data loading could cause memory exhaustion
- N+1 pattern in session-tracker.ts: Unnecessary second phase in batch operation
- Inconsistent auth in current/route.ts: Uses different auth mechanism than main route
🤖 Automated review by Claude AI - focused on identifying issues for improvement
| const controller = new AbortController(); | ||
| const timeoutId = setTimeout(() => controller.abort(), timeoutMs); | ||
|
|
||
| // Execute request |
There was a problem hiding this comment.
🟠 High: Missing timeout cleanup in error path leading to potential memory leak
Why this is a problem: When the fetch request throws an error (abort, network error, etc.), the clearTimeout(timeoutId) on line 77 is never executed because the error is thrown before reaching it. The timeout continues running, potentially causing issues if many requests fail.
Suggested fix:
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), timeoutMs);
try {
const response = await fetch(url, {
method: "POST",
headers,
body: JSON.stringify(body),
signal: controller.signal,
});
// ... success handling
} catch (error) {
// ... error handling
} finally {
clearTimeout(timeoutId);
}| if (health.circuitState === "open") { | ||
| openCircuits.push(parseInt(providerId, 10)); | ||
| } | ||
| } |
There was a problem hiding this comment.
🟠 High: Infinite setInterval without proper cleanup mechanism on graceful shutdown
Why this is a problem: The probeIntervalId is stored but there's no integration with process shutdown handlers. When the Node.js process receives SIGTERM/SIGINT (common in containerized deployments), the interval keeps running until forceful termination, which could leave probes in an inconsistent state.
Suggested fix:
export function startProbeScheduler(): void {
if (!ENABLE_SMART_PROBING) return;
if (probeIntervalId) return;
logger.info("[SmartProbe] Starting probe scheduler", { intervalMs: PROBE_INTERVAL_MS });
probeIntervalId = setInterval(runProbeCycle, PROBE_INTERVAL_MS);
runProbeCycle();
// Register cleanup handlers for graceful shutdown
const cleanup = () => {
stopProbeScheduler();
process.removeListener('SIGTERM', cleanup);
process.removeListener('SIGINT', cleanup);
};
process.on('SIGTERM', cleanup);
process.on('SIGINT', cleanup);
}| endTime: endDate.toISOString(), | ||
| bucketSizeMinutes: explicitBucketSize ?? 60, | ||
| providers: [], | ||
| systemAvailability: 0, |
There was a problem hiding this comment.
🟠 High: Unbounded query loading all requests into memory could cause OOM
Why this is a problem: This query loads ALL requests within the time range into memory without pagination. For production systems with high traffic, querying 7 days of data could return millions of rows, causing memory exhaustion and application crashes.
Suggested fix:
// Add LIMIT to prevent unbounded loading
const MAX_REQUESTS_PER_QUERY = 100000; // Reasonable limit
const requests = await db
.select({
id: messageRequest.id,
providerId: messageRequest.providerId,
statusCode: messageRequest.statusCode,
durationMs: messageRequest.durationMs,
errorMessage: messageRequest.errorMessage,
createdAt: messageRequest.createdAt,
})
.from(messageRequest)
.where(and(...requestConditions))
.orderBy(messageRequest.createdAt)
.limit(MAX_REQUESTS_PER_QUERY);
// Consider logging a warning if limit is reached
if (requests.length === MAX_REQUESTS_PER_QUERY) {
logger.warn('Availability query hit max limit, results may be incomplete');
}Alternatively, consider using SQL aggregation instead of in-memory processing for better performance.
| providerSessionMap.set(providerId, sessionIds); | ||
| allSessionIds.push(...sessionIds); | ||
| } else { | ||
| providerSessionMap.set(providerId, []); |
There was a problem hiding this comment.
🟡 Medium: Redundant pipeline phase - second EXISTS check is unnecessary
Why this is a problem: The second pipeline verifies session existence by checking session:{sessionId}:info, but the zrange already returned valid session IDs from the sorted set. If the session was removed, zremrangebyscore would have already cleaned it. This adds extra Redis round-trips without benefit.
Suggested fix:
// Skip the second verification phase - just count what zrange returned
for (const providerId of providerIds) {
const sessionIds = providerSessionMap.get(providerId) || [];
result.set(providerId, sessionIds.length);
}
return result;If the verification IS needed (e.g., session:info can expire independently), document why in a comment.
| // Verify admin authentication | ||
| const authHeader = request.headers.get("Authorization"); | ||
| const token = authHeader?.replace("Bearer ", ""); | ||
|
|
There was a problem hiding this comment.
🟡 Medium: Inconsistent authentication mechanism between availability endpoints
Why this is a problem: This endpoint uses Bearer token + validateKey() while /api/availability/route.ts uses session cookies via getSession(). This inconsistency could confuse API consumers and make authorization harder to maintain.
Suggested fix: Use the same authentication mechanism as the main availability route:
import { getSession } from "@/lib/auth";
export async function GET(request: NextRequest) {
const session = await getSession();
if (!session || session.user.role !== "admin") {
return NextResponse.json({ error: "Unauthorized" }, { status: 401 });
}
// ...
}Or document why Bearer token auth is intentionally used here (e.g., for API access from external services).
| (mr.provider_chain IS NULL OR jsonb_array_length(mr.provider_chain) = 0) AND mr.provider_id = p.id | ||
| OR | ||
| -- 情况2:有重试,使用 providerChain 最后一项的 providerId | ||
| -- 情况2:有重试,使用 providerChain 最后一项的 id |
There was a problem hiding this comment.
🟢 Low: SQL field name change from 'providerId' to 'id' may break existing data
Why this is a problem: The change from provider_chain->-1->>'providerId' to provider_chain->-1->>'id' assumes all provider_chain JSON objects use the 'id' field. If there's existing data with 'providerId', statistics will be incorrect.
Suggested fix: Add COALESCE to handle both field names during migration:
COALESCE(
(mr.provider_chain->-1->>'id')::int,
(mr.provider_chain->-1->>'providerId')::int
) = p.idOr confirm via migration that all existing data has been updated.
📝 Documentation Review📊 Issues Summary
⚡ Priority Fixes
📋 Review Coverage
🔍 Detailed Findings🟡 Terminology: Description may not accurately reflect functionalityFiles: Problem: The acknowledgments describe relay-pulse as providing "provider detection functionality", but the actual relay-pulse project and the implemented code provide "provider availability monitoring" / "provider testing" functionality with three-tier validation. Current text (README.en.md): ...and [prehisle/relay-pulse](https://github.com/prehisle/relay-pulse) for provider detection functionality.Suggested text: ...and [prehisle/relay-pulse](https://github.com/prehisle/relay-pulse) for provider availability monitoring.Or alternatively: ...and [prehisle/relay-pulse](https://github.com/prehisle/relay-pulse) for provider testing functionality.Impact: Minor - readers who visit relay-pulse expecting "detection" features may be confused to find availability monitoring instead. Reference: relay-pulse README describes itself as "LLM 服务可用性监控" (LLM service availability monitoring). The implemented code in this PR ( ✅ Verified Items
🤖 Automated docs review by Claude AI |
1. circuit-breaker: 添加 tripToHalfOpen 函数,智能探测成功后转到 HALF_OPEN 而非直接 CLOSED - 修复日志 "transitioning to half-open" 与实际行为不一致的问题 - 更安全的恢复机制,允许通过 HALF_OPEN 阶段验证 2. availability-service: 添加 MAX_REQUESTS_PER_QUERY 限制 (100000) - 防止大时间范围查询导致内存溢出 - 达到限制时记录警告日志 3. test-service: 使用 finally 块确保 timeout 清理 - 即使请求抛出异常也会执行 clearTimeout 4. current/route.ts: 统一认证机制 - 从 Bearer token + validateKey 改为 getSession - 与 /api/availability 保持一致 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
📊 PR Size AnalysisThis PR is XL with 6,968 lines changed across 52 files (6,555 additions, 413 deletions). Large PRs are harder to review and more likely to introduce bugs. 🔀 Suggested Split:Based on the changes, this PR could be split into 5 focused PRs: 1. Provider Availability Monitoring (~2,500 lines)Core availability feature with dashboard:
2. Smart Circuit Breaker Probing (~800 lines)Circuit breaker enhancements:
3. Enhanced Provider Testing (~2,500 lines)Provider testing library with parsers and validators:
4. Performance & Query Optimizations (~300 lines)Database and query improvements:
5. Bug Fixes & Minor Improvements (~400 lines)Independent fixes:
Why Split?
Recommended Merge Order:
🤖 Automated analysis by Claude AI |
ding113
left a comment
There was a problem hiding this comment.
📋 Code Review Summary
This PR introduces a comprehensive provider availability monitoring system, intelligent circuit breaker probing, and enhanced provider testing functionality. The implementation adds significant monitoring capabilities but has several issues that should be addressed.
🔍 Issues Found
- Critical (🔴): 1 issue
- High (🟠): 2 issues
- Medium (🟡): 4 issues
- Low (🟢): 0 issues
🎯 Priority Actions
-
🔴 Critical - API Key Exposure in URL (
src/lib/provider-testing/utils/test-prompts.ts:291): The Gemini API key is appended as a URL query parameter (?key=${apiKey}). This is a security risk as URLs with query parameters are commonly logged in server access logs, proxy logs, and browser history. Consider using header-based authentication (x-goog-api-keyheader) instead. -
🟠 High - Missing Error Handling in Probe Scheduler (
src/lib/circuit-breaker-probe.ts:232-234): TherunProbeCycle()is called without await in the interval callback, but errors are only caught at the.catch()level. IfloadProviderConfigs()throws synchronously before the async operations, it could crash the scheduler. -
🟠 High - Unbounded Memory Growth in Provider Config Cache (
src/lib/circuit-breaker-probe.ts:31-33): TheproviderConfigCacheMap grows indefinitely as providers are added but never cleaned up when providers are deleted. Consider adding cache invalidation when providers are removed or implementing a maximum cache size. -
🟡 Medium - Missing Input Validation for Custom JSON Payload (
src/lib/provider-testing/test-service.ts:67-71): The custom payload parsing catches JSON parse errors but doesn't validate the structure of the parsed object. Malformed payloads could cause unexpected behavior downstream. -
🟡 Medium - Potential SQL Performance Issue with Window Queries (
src/repository/provider.ts:408-416): While the 7-day limit onlatest_callCTE is a good optimization, the window functionROW_NUMBER()still needs to process all rows within that window. Consider adding an index on(final_provider_id, created_at DESC)if not already present. -
🟡 Medium - Race Condition in Batch Redis Pipeline (
src/lib/rate-limit/service.ts:888-970): The batch cost query assumes all Redis operations will return in order. If Redis has issues with specific keys, the queryMeta mapping could become misaligned with results. -
🟡 Medium - Missing Timeout for Provider Config Loading (
src/lib/circuit-breaker-probe.ts:69-87): The database query inloadProviderConfigs()has no timeout. If the database is slow or unresponsive, it could block the probe scheduler indefinitely.
💡 General Observations
- The N+1 query fix using Redis batch operations (
getCurrentCostBatchandgetProviderSessionCountBatch) is a good performance improvement. - The availability monitoring system is well-structured with proper separation of concerns.
- The three-tier validation system (HTTP status → latency → content) follows good patterns from the relay-pulse reference implementation.
- Consider adding rate limiting to the
/api/availabilityendpoint to prevent abuse.
🤖 Automated review by Claude AI - focused on identifying issues for improvement
🔒 Security Scan Results✅ No critical security vulnerabilities detected This PR has been scanned against OWASP Top 10, CWE Top 25, and common security anti-patterns. No security issues requiring immediate attention were identified in the code changes. 📋 OWASP Top 10 Coverage
🛡️ Security HighlightsPositive Security Patterns Found:
|
ding113
left a comment
There was a problem hiding this comment.
📝 Documentation Review
✅ Documentation looks good
No significant issues found. The documentation changes are:
- Technically accurate
- Complete and clear
- Well-structured
- Free of broken links
📋 Review Coverage
- Technical accuracy - 0 issues
- Completeness - 0 issues
- Code examples - 0 issues (no code examples in changes)
- Links and references - 0 issues (all 3 GitHub links verified: zsio/claude-code-hub ✅, router-for-me/CLIProxyAPI ✅, prehisle/relay-pulse ✅)
- Clarity and organization - 0 issues
✅ Changes Validated
-
README.md & README.en.md: New acknowledgment for prehisle/relay-pulse is accurate and consistent between English and Chinese versions.
-
.env.example: Smart probing configuration documentation correctly documents:
ENABLE_SMART_PROBING(default: false) ✅PROBE_INTERVAL_MS(default: 30000ms = 30s) ✅PROBE_TIMEOUT_MS(default: 5000ms = 5s) ✅
All values match the implementation in
src/lib/circuit-breaker-probe.ts. -
i18n files (settings.json): The streaming idle timeout documentation fix from "1-120 seconds" to "60-600 seconds" is correct - matches the validation constants in
src/lib/constants/provider.constants.ts:34(STREAMING_IDLE_TIMEOUT_MS: { MIN: 60000, MAX: 600000 }).
🤖 Automated docs review by Claude AI
ding113
left a comment
There was a problem hiding this comment.
Inline comments for PR review
| } | ||
|
|
||
| return url; | ||
| } |
There was a problem hiding this comment.
🔴 Critical: Potential API Key Exposure in URL
Why this is a problem: The API key is appended as a URL query parameter (?key=${apiKey}). URLs with query parameters are commonly logged in:
- Server access logs
- Reverse proxy logs
- Browser history
- Network debugging tools
- Error tracking services (like Sentry)
This could lead to API key exposure.
Suggested fix:
// Use x-goog-api-key header instead of URL parameter
case "gemini":
case "gemini-cli":
return {
...baseHeaders,
...GEMINI_TEST_HEADERS,
"x-goog-api-key": apiKey,
};Then remove the URL parameter logic and pass the API key via headers in getTestHeaders() instead.
| name: string; | ||
| url: string; | ||
| key: string; | ||
| providerType: ProviderType; |
There was a problem hiding this comment.
🟠 High: Unbounded Memory Growth in Provider Config Cache
Why this is a problem: The providerConfigCache Map grows indefinitely as providers are added but is never cleaned up when providers are deleted. Over time, this could lead to memory issues in long-running processes.
Suggested fix:
// Option 1: Add maximum cache size with LRU eviction
const MAX_CACHE_SIZE = 1000;
async function loadProviderConfigs(): Promise<void> {
// ... existing code ...
// If cache exceeds max size, clear old entries
if (providerConfigCache.size > MAX_CACHE_SIZE) {
const keysToDelete = Array.from(providerConfigCache.keys())
.slice(0, providerConfigCache.size - MAX_CACHE_SIZE);
keysToDelete.forEach(key => providerConfigCache.delete(key));
}
}
// Option 2: Invalidate cache for deleted providers
// Build cache fresh each time from DB query (which already filters deleted_at IS NULL)
providerConfigCache.clear();
providerConfigCache = new Map(providerList.map(...));| } | ||
| } else if (config.preset) { | ||
| // Use preset configuration | ||
| body = getPresetPayload(config.preset, config.model); |
There was a problem hiding this comment.
🟡 Medium: Missing Input Validation for Custom JSON Payload
Why this is a problem: The code catches JSON parse errors but doesn't validate the structure of the parsed object. A malformed payload (e.g., missing required fields, wrong types) could cause unexpected behavior or cryptic errors downstream.
Suggested fix:
if (config.customPayload) {
try {
body = JSON.parse(config.customPayload);
// Validate basic structure
if (typeof body !== 'object' || body === null) {
throw new Error('Custom payload must be a JSON object');
}
} catch (e) {
if (e instanceof SyntaxError) {
throw new Error('Invalid custom payload JSON: ' + e.message);
}
throw e;
}
}
ding113
left a comment
There was a problem hiding this comment.
Additional inline comments
| logger.error("[SmartProbe] Scheduled probe cycle failed", { | ||
| error: error instanceof Error ? error.message : String(error), | ||
| }); | ||
| }); |
There was a problem hiding this comment.
🟠 High: Missing Error Handling for Synchronous Errors
Why this is a problem: The runProbeCycle() function is called in the interval callback without proper error handling for synchronous throws. While .catch() handles async rejections, if loadProviderConfigs() or early parts of the function throw synchronously (e.g., due to import failures), it won't be caught.
Suggested fix:
probeIntervalId = setInterval(() => {
void (async () => {
try {
await runProbeCycle();
} catch (error) {
logger.error('[SmartProbe] Scheduled probe cycle failed', {
error: error instanceof Error ? error.message : String(error),
});
}
})();
}, PROBE_INTERVAL_MS);Alternatively, use a more explicit async wrapper pattern.
| window24h.toString() | ||
| ); | ||
| queryMeta.push({ providerId, period: "daily", isRolling: true }); | ||
| } else { |
There was a problem hiding this comment.
🟡 Medium: Potential Index Mismatch in Pipeline Results
Why this is a problem: The code assumes pipeline results will always be returned in the exact order they were submitted. While Redis pipelines generally preserve order, if any command fails in an unexpected way, the queryMeta[i] mapping could become misaligned.
Suggested fix:
// Add defensive check to verify array lengths match
if (pipelineResults.length !== queryMeta.length) {
logger.error('[RateLimit] Pipeline results count mismatch', {
expected: queryMeta.length,
received: pipelineResults.length,
});
return result;
}| ELSE (provider_chain->-1->>'id')::int | ||
| END AS final_provider_id, | ||
| created_at AS last_call_time, | ||
| model AS last_call_model |
There was a problem hiding this comment.
🟡 Medium: Window Function Performance Consideration
Why this is a problem: While the 7-day limit is a good optimization, the ROW_NUMBER() OVER (PARTITION BY final_provider_id ORDER BY created_at DESC) still requires sorting all rows within each partition. For high-traffic systems, this could be slow.
Recommendation: Ensure there's a composite index on (final_provider_id, created_at DESC) in the database schema to optimize this query. If not present, consider adding:
CREATE INDEX idx_message_request_provider_time
ON message_request (final_provider_id, created_at DESC)
WHERE deleted_at IS NULL;
摘要
本次发布引入了全面的供应商可用性监控系统、智能断路器探测以加快供应商恢复,以及增强的供应商测试功能,支持三层验证。
问题
解决方案
🆕 新功能
供应商可用性监控 (
/dashboard/availability)智能断路器探测
ENABLE_SMART_PROBING(默认:false)PROBE_INTERVAL_MS(默认:30000 毫秒)PROBE_TIMEOUT_MS(默认:5000 毫秒)增强的供应商测试
🔧 改进
🐛 漏洞修复
sm:max-w-lg宽度限制📚 文档
变更
src/lib/availability/)src/lib/provider-testing/)src/lib/circuit-breaker-probe.ts)测试
截图(如适用)
N/A - UI 组件遵循现有设计模式