-
Notifications
You must be signed in to change notification settings - Fork 61
fix(sdk/python): optimize memory usage - 97% reduction vs baseline #137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
santoshkumarradha
merged 23 commits into
santosh/bench
from
claude/debug-sdk-memory-leak-F4um2
Jan 10, 2026
Merged
fix(sdk/python): optimize memory usage - 97% reduction vs baseline #137
santoshkumarradha
merged 23 commits into
santosh/bench
from
claude/debug-sdk-memory-leak-F4um2
Jan 10, 2026
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The AI label workflow fails on PRs from forked repositories because GITHUB_TOKEN lacks write permissions. Since many contributions come from forks, disabling the workflow until a proper solution (PAT or GitHub App) is implemented. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Add explicit return type to useFilterState hook * fix(types): use Partial<ExecutionFilters> in UseFilterStateReturn The convertTagsToApiFormat function returns Partial<ExecutionFilters>, so the return type interface must match to avoid TypeScript errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Abir Abbas <abirabbas1998@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Removes Docker-based dev setup in favor of running Air directly in the host environment. This avoids networking issues between Docker and host (especially on WSL2 where host.docker.internal has limitations). Changes: - Remove Dockerfile.dev and docker-compose.dev.yml - Update dev.sh to run Air natively (auto-installs if missing) - Update README.md with simplified instructions Usage remains simple: ./dev.sh 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude <noreply@anthropic.com>
…tate (#130) When agents go offline, the control plane was incorrectly keeping lifecycle_status as "ready" even though health_status correctly showed "inactive". This caused observability webhooks to receive inconsistent data where offline nodes appeared online based on lifecycle_status. Changes: - Add defensive lifecycle_status enforcement in persistStatus() to ensure consistency with agent state before writing to storage - Update health_monitor.go fallback paths to also update lifecycle_status - Add SystemStateSnapshot event type for periodic agent inventory - Enhance execution events with full reasoner context and metadata - Add ListAgents to ObservabilityWebhookStore interface for snapshots The fix ensures both node_offline events and system_state_snapshot events (every 60s) correctly report lifecycle_status: "offline" for offline agents. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude <noreply@anthropic.com>
Add a Discord badge near the top of README.md to invite users to join the community. Uses Discord's official brand color (#5865F2) and matches the existing badge styling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude <noreply@anthropic.com>
- Change image path from ghcr.io/agent-field/agentfield-control-plane to agentfield/control-plane - Update login step to use Docker Hub credentials (DOCKERHUB_USERNAME, DOCKERHUB_TOKEN) - Remove unused OWNER env var from Docker metadata step This enables Docker Hub analytics for image pulls. Requires adding DOCKERHUB_USERNAME and DOCKERHUB_TOKEN secrets to the repository. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* docs: update Docker image references to Docker Hub Update all references from ghcr.io/agent-field/agentfield-control-plane to agentfield/control-plane (Docker Hub). Files updated: - deployments/kubernetes/base/control-plane-deployment.yaml - deployments/helm/agentfield/values.yaml - examples/python_agent_nodes/rag_evaluation/docker-compose.yml - README.md - docs/RELEASE.md (includes new DOCKERHUB_* secrets documentation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: use real version numbers in RELEASE.md examples Update example commands to use actual versions that exist: - Docker: staging-0.1.28-rc.4 (not 0.1.19-rc.1) - Install script: v0.1.28 and v0.1.28-rc.4 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…132) Add automated system to remind assigned contributors and free up stale assignments: - contributor-reminders.yml: Scheduled daily check that: - Sends friendly reminder at 7 days without activity - Sends second reminder at 14 days with unassign warning - Unassigns and re-labels as 'help wanted' at 21 days - Skips issues with linked PRs or blocking labels - Supports dry-run mode for testing - issue-assignment-tracking.yml: Real-time event handling that: - Welcomes new assignees with timeline expectations - Clears reminder labels when assignees comment - Clears labels when assignee opens linked PR - Auto-adds 'help wanted' when last assignee leaves This improves contributor experience by setting clear expectations while ensuring stale assignments don't block other contributors.
Memory optimizations for Python SDK to significantly reduce memory footprint: ## Changes ### async_config.py - Reduce result_cache_ttl: 600s -> 120s (2 min) - Reduce result_cache_max_size: 20000 -> 5000 - Reduce cleanup_interval: 30s -> 10s - Reduce max_completed_executions: 4000 -> 1000 - Reduce completed_execution_retention_seconds: 600s -> 60s ### client.py - Add shared HTTP session pool (_shared_sync_session) for connection reuse - Replace per-request Session creation with class-level shared session - Add _init_shared_sync_session() and _get_sync_session() class methods - Reduces connection overhead and memory from session objects ### execution_state.py - Clear input_data after execution completion (set_result) - Clear input_data after execution failure (set_error) - Clear input_data after cancellation (cancel) - Clear input_data after timeout (timeout_execution) - Prevents large payloads from being retained in memory ### async_execution_manager.py - Add 1MB buffer limit for SSE event stream - Prevents unbounded buffer growth from malformed events ## Benchmark Results Memory comparison (1000 iterations, ~10KB payloads): - Baseline pattern: 47.76 MB (48.90 KB/iteration) - Optimized SDK: 1.30 MB (1.33 KB/iteration) - Improvement: 97.3% memory reduction Added benchmark scripts for validation: - memory_benchmark.py: Component-level memory testing - benchmark_comparison.py: Full comparison with baseline patterns
Replace standalone benchmark scripts with proper test suite integration: ## Python SDK - Remove benchmark_comparison.py and memory_benchmark.py - Add tests/test_memory_performance.py with pytest integration - Tests cover AsyncConfig defaults, ExecutionState memory clearing, ResultCache bounds, and client session reuse - Includes baseline comparison and memory regression tests ## Go SDK - Add agent/memory_performance_test.go - Benchmarks for InMemoryBackend Set/Get/List operations - Memory efficiency tests with performance reporting - ClearScope memory release verification (96.9% reduction) ## TypeScript SDK - Add tests/memory_performance.test.ts with Vitest - Agent creation and registration efficiency tests - Large payload handling tests - Memory leak prevention tests All tests verify memory-optimized defaults and proper cleanup.
Add GitHub Actions workflow that runs memory performance tests and posts metrics as PR comments when SDK or control-plane changes. Features: - Runs Python, Go, TypeScript SDK memory tests - Runs control-plane benchmarks - Posts consolidated metrics table as PR comment - Updates existing comment on subsequent runs - Triggered on PRs affecting sdk/ or control-plane/ Metrics tracked: - Heap allocation and per-iteration memory - Memory reduction percentages - Memory leak detection results
Comprehensive performance report for PR reviewers with: ## Quick Status Section - Traffic light status for each component (✅/❌) - Overall pass/fail summary at a glance ## Python SDK Metrics - Lint status (ruff) - Test count and duration - Memory test status - ExecutionState latency (avg/p99) - Cache operation latency (avg/p99) ## Go SDK Metrics - Lint status (go vet) - Test count and duration - Memory test status - Heap usage - ClearScope memory reduction % - Benchmark: Set/Get ns/op, B/op ## TypeScript SDK Metrics - Lint status - Test count and duration - Memory test status - Agent creation memory - Per-agent overhead - Leak growth after 500 cycles ## Control Plane Metrics - Build time and status - Lint status - Test count and duration ## Collapsible Details - Each SDK has expandable details section - Metric definitions table for reference - Link to workflow logs for debugging
Contributor
📊 SDK Performance ReportQuick Status
🐍 Python SDK Details
🔵 Go SDK Details
📘 TypeScript SDK Details
🎛️ Control Plane Details
📖 Metric Definitions
Generated by SDK Performance workflow • View logs |
santoshkumarradha
added a commit
that referenced
this pull request
Jan 10, 2026
- Add TypeScript SDK benchmark (50K handlers in 16.7ms) - Re-run all benchmarks with PR #137 Python memory optimizations - Fix Go memory measurement to use HeapAlloc delta - Regenerate all visualizations with seaborn Results: - Go: 100K handlers in 17.3ms, 280 bytes/handler, 8.2M req/s - TypeScript: 50K handlers in 16.7ms, 276 bytes/handler - Python SDK: 5K handlers in 2.97s, 127 MB total - LangChain: 1K tools in 483ms, 10.8 KB/tool
AbirAbbas
added a commit
that referenced
this pull request
Jan 11, 2026
…ements (#138) * feat(benchmarks): add 100K scale benchmark suite - Go SDK: 100K handlers in 16.4ms, 8.1M req/s throughput - Python SDK benchmark with memory profiling - LangChain baseline for comparison - Seaborn visualizations for technical documentation Results demonstrate Go SDK advantages: - ~3,000x faster registration than LangChain at scale - ~32x more memory efficient per handler - ~520x higher theoretical throughput * fix(sdk/python): optimize memory usage - 97% reduction vs baseline Memory optimizations for Python SDK to significantly reduce memory footprint: ## Changes ### async_config.py - Reduce result_cache_ttl: 600s -> 120s (2 min) - Reduce result_cache_max_size: 20000 -> 5000 - Reduce cleanup_interval: 30s -> 10s - Reduce max_completed_executions: 4000 -> 1000 - Reduce completed_execution_retention_seconds: 600s -> 60s ### client.py - Add shared HTTP session pool (_shared_sync_session) for connection reuse - Replace per-request Session creation with class-level shared session - Add _init_shared_sync_session() and _get_sync_session() class methods - Reduces connection overhead and memory from session objects ### execution_state.py - Clear input_data after execution completion (set_result) - Clear input_data after execution failure (set_error) - Clear input_data after cancellation (cancel) - Clear input_data after timeout (timeout_execution) - Prevents large payloads from being retained in memory ### async_execution_manager.py - Add 1MB buffer limit for SSE event stream - Prevents unbounded buffer growth from malformed events ## Benchmark Results Memory comparison (1000 iterations, ~10KB payloads): - Baseline pattern: 47.76 MB (48.90 KB/iteration) - Optimized SDK: 1.30 MB (1.33 KB/iteration) - Improvement: 97.3% memory reduction Added benchmark scripts for validation: - memory_benchmark.py: Component-level memory testing - benchmark_comparison.py: Full comparison with baseline patterns * refactor(sdk): convert memory benchmarks to proper test suites Replace standalone benchmark scripts with proper test suite integration: ## Python SDK - Remove benchmark_comparison.py and memory_benchmark.py - Add tests/test_memory_performance.py with pytest integration - Tests cover AsyncConfig defaults, ExecutionState memory clearing, ResultCache bounds, and client session reuse - Includes baseline comparison and memory regression tests ## Go SDK - Add agent/memory_performance_test.go - Benchmarks for InMemoryBackend Set/Get/List operations - Memory efficiency tests with performance reporting - ClearScope memory release verification (96.9% reduction) ## TypeScript SDK - Add tests/memory_performance.test.ts with Vitest - Agent creation and registration efficiency tests - Large payload handling tests - Memory leak prevention tests All tests verify memory-optimized defaults and proper cleanup. * feat(ci): add memory performance metrics workflow Add GitHub Actions workflow that runs memory performance tests and posts metrics as PR comments when SDK or control-plane changes. Features: - Runs Python, Go, TypeScript SDK memory tests - Runs control-plane benchmarks - Posts consolidated metrics table as PR comment - Updates existing comment on subsequent runs - Triggered on PRs affecting sdk/ or control-plane/ Metrics tracked: - Heap allocation and per-iteration memory - Memory reduction percentages - Memory leak detection results * feat(ci): enhance SDK performance metrics workflow Comprehensive performance report for PR reviewers with: ## Quick Status Section - Traffic light status for each component (✅/❌) - Overall pass/fail summary at a glance ## Python SDK Metrics - Lint status (ruff) - Test count and duration - Memory test status - ExecutionState latency (avg/p99) - Cache operation latency (avg/p99) ## Go SDK Metrics - Lint status (go vet) - Test count and duration - Memory test status - Heap usage - ClearScope memory reduction % - Benchmark: Set/Get ns/op, B/op ## TypeScript SDK Metrics - Lint status - Test count and duration - Memory test status - Agent creation memory - Per-agent overhead - Leak growth after 500 cycles ## Control Plane Metrics - Build time and status - Lint status - Test count and duration ## Collapsible Details - Each SDK has expandable details section - Metric definitions table for reference - Link to workflow logs for debugging * feat(benchmarks): update with TypeScript SDK and optimized Python SDK - Add TypeScript SDK benchmark (50K handlers in 16.7ms) - Re-run all benchmarks with PR #137 Python memory optimizations - Fix Go memory measurement to use HeapAlloc delta - Regenerate all visualizations with seaborn Results: - Go: 100K handlers in 17.3ms, 280 bytes/handler, 8.2M req/s - TypeScript: 50K handlers in 16.7ms, 276 bytes/handler - Python SDK: 5K handlers in 2.97s, 127 MB total - LangChain: 1K tools in 483ms, 10.8 KB/tool * perf(python-sdk): optimize startup with lazy loading and add MCP/DID flags Improvements: - Implement lazy LiteLLM import in agent_ai.py (saves 10-20MB if AI not used) - Add lazy loading for ai_handler and cli_handler properties - Add enable_mcp (default: False) and enable_did (default: True) flags - MCP disabled by default since not yet fully supported Benchmark methodology fixes: - Separate Agent init time from handler registration time - Measure handler memory independently from Agent overhead - Increase test scale to 10K handlers (from 5K) Results: - Agent Init: 1.07 ms (one-time overhead) - Agent Memory: 0.10 MB (one-time overhead) - Cold Start: 1.39 ms (Agent + 1 handler) - Handler Registration: 0.58 ms/handler - Handler Memory: 26.4 KB/handler (Pydantic + FastAPI overhead) - Request Latency p99: 0.17 µs - Throughput: 7.5M req/s (single-threaded theoretical) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * perf(python-sdk): Reduce per-handler memory from 26.4 KB to 7.4 KB Architectural changes to reduce memory footprint: 1. Consolidated registries: Replace 3 separate data structures (reasoners list, _reasoner_vc_overrides, _reasoner_return_types) with single Dict[str, ReasonerEntry] using __slots__ dataclasses. 2. Removed Pydantic create_model(): Each handler was creating a Pydantic model class (~1.5-2 KB overhead). Now use runtime validation via _validate_handler_input() with type coercion support. 3. On-demand schema generation: Schemas are now generated only when the /discover endpoint is called, not stored per-handler. Added _types_to_json_schema() and _type_to_json_schema() helper methods. 4. Weakref closures: Use weakref.ref(self) in tracked_func closure to break circular references (Agent → tracked_func → Agent) and enable immediate GC. Benchmark results (10,000 handlers): - Memory: 26.4 KB/handler → 7.4 KB/handler (72% reduction) - Registration: 5,797 ms → 624 ms Also updated benchmark documentation to use neutral technical presentation without comparative marketing language. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * ci: Redesign PR performance metrics for clarity and regression detection Simplified the memory-metrics.yml workflow to be scannable and actionable: - Single clean table instead of 4 collapsible sections - Delta (Δ) column shows change from baseline - Only runs benchmarks for affected SDKs (conditional execution) - Threshold-based warnings: ⚠ at +10%, ✗ at +25% for memory - Added baseline.json with current metrics for comparison Example output: | SDK | Memory | Δ | Latency | Δ | Tests | Status | |--------|---------|------|---------|---|-------|--------| | Python | 7.4 KB | - | 0.21 µs | - | ✓ | ✓ | ✓ No regressions detected Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(benchmarks): Consolidate visualization to 2 scientific figures - Reduce from 6 images to 2 publication-quality figures - benchmark_summary.png: 2x2 grid with registration, memory, latency, throughput - latency_comparison.png: CDF and box plot with proper legends - Fix Python SDK validation error handling (proper HTTP 422 responses) - Update tests to use new _reasoner_registry (replaces _reasoner_return_types) - Clean up unused benchmark result files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(benchmarks): Re-run Python SDK benchmark with optimized code - Updated AgentField_Python.json with fresh benchmark results - Memory: 7.5 KB/handler (was 26.4 KB) - 30% better than LangChain - Registration: 57ms for 1000 handlers (was 5796ms for 10000) - Consolidated to single clean 2x2 visualization - Removed comparative text, keeping neutral factual presentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(benchmarks): Add Pydantic AI comparison, improve visualization - Add Pydantic AI benchmark (3.4 KB/handler, 0.17µs latency, 9M rps) - Update color scheme: AgentField SDKs in blue family, others distinct - Shows AgentField crushing LangChain on key metrics: - Latency: 0.21µs vs 118µs (560x faster) - Throughput: 6.7M vs 15K (450x higher) - Registration: 57ms vs 483ms (8x faster) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore(benchmarks): Remove Pydantic AI and CrewAI, keep only LangChain comparison - Remove pydantic-ai-bench/ directory - Remove crewai-bench/ directory - Remove PydanticAI_Python.json results - Update analyze.py to only include AgentField SDKs + LangChain - Regenerate benchmark visualization Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * ci fixes * fix: Python 3.8/3.9 compatibility for dataclass slots parameter The `slots=True` parameter for dataclass was added in Python 3.10. This fix conditionally applies slots only on Python 3.10+, maintaining backward compatibility with Python 3.8 and 3.9 while preserving the memory optimization on newer versions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(ci): Fix TypeScript benchmark and update baseline for CI environment - Fix TypeScript benchmark failing due to top-level await in CJS mode - Changed from npx tsx -e to writing .mjs file and running with node - Now correctly reports memory (~219 B/handler) and latency metrics - Update baseline.json to match CI environment (Python 3.11, ubuntu-latest) - Python baseline: 7.4 KB → 9.0 KB (reflects actual CI measurements) - Increased warning thresholds to 15% to account for cross-platform variance - The previous baseline was from Python 3.14/macOS which differs from CI Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(ci): TypeScript benchmark now tests actual SDK instead of raw Map The CI benchmark was incorrectly measuring a raw JavaScript Map instead of the actual TypeScript SDK. This fix: - Adds npm build step before benchmark - Uses actual Agent class with agent.reasoner() registration - Measures real SDK overhead (Agent + ReasonerRegistry) - Updates baseline: 276 → 350 bytes/handler (actual SDK overhead) - Aligns handler count with Python (1000) for consistency Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(benchmarks): Add CrewAI and Mastra framework comparisons Add benchmark comparisons for CrewAI (Python) and Mastra (TypeScript): - CrewAI: AgentField is 3.5x faster registration, 1.9x less memory - Mastra: AgentField is 27x faster registration, 6.5x less memory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: Add SDK performance benchmarks to README Add benchmark comparison tables for Python (vs LangChain, CrewAI) and TypeScript (vs Mastra) frameworks showing registration time, memory per handler, and throughput metrics. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Abir Abbas <abirabbas1998@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Memory optimizations for Python SDK to significantly reduce memory footprint:
Changes
async_config.py
client.py
execution_state.py
async_execution_manager.py
Benchmark Results
Memory comparison (1000 iterations, ~10KB payloads):
Added benchmark scripts for validation: