Skip to content

Conversation

@xiaguan
Copy link
Collaborator

@xiaguan xiaguan commented Aug 12, 2025

In this PR, we introduce client side support for both transfer and RPC-related metrics.

We provide interface support: Prometheus-style and human-readable formats. Additionally, we've implemented a background thread for metric printing (disabled by default).

I20250812 17:07:51.023622 117228 client.cpp:1417] Client Metrics Report:
Client Metrics Summary
=== Transfer Metrics Summary ===
Total Read: 5.00 MB
Total Write: 5.00 MB

=== Latency Summary (microseconds) ===
Get: count=103, p95<250μs, max<15000μs
Put: count=4, p95<1500μs, max<15000μs
Batch Get: count=1, p95<125μs, max<15000μs
Batch Put: count=3, p95<5000μs, max<20000μs

=== RPC Metrics Summary ===
GetReplicaList: count=104, p95<150μs, max<200μs
PutStart: count=5, p95<250μs, max<250μs
PutEnd: count=4, p95<200μs, max<1500μs
Remove: count=29, p95<300μs, max<400μs
MountSegment: count=1, p95<125μs, max<7000μs
GetFsdir: count=1, p95<125μs, max<250μs
BatchGetReplicaList: count=1, p95<125μs, max<300μs
BatchPutStart: count=3, p95<250μs, max<500μs
BatchPutEnd: count=3, p95<150μs, max<200μs

@xiaguan xiaguan requested a review from Copilot August 12, 2025 10:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces comprehensive client-side metrics for transfer and RPC operations, adding both Prometheus-style and human-readable metric collection interfaces. The changes enable real-time monitoring of client performance with support for configurable background reporting.

Key changes include:

  • Implementation of client-side metrics collection for transfer (read/write bytes, latency) and RPC operations
  • Addition of background metrics reporting thread with environment variable configuration
  • Integration of metrics tracking throughout the client operation flow

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
client_metric.h New header defining metric structures for transfer and RPC operations with summary formatting
client.cpp Integration of metrics collection throughout client operations and background reporting thread
master_client.cpp RPC call tracking with latency measurement and metric collection
transfer_task.cpp Transfer operation byte counting and metrics integration
client.h Client class extension with metrics members and reporting thread management
master_client.h Header updates for metrics parameter passing
transfer_task.h Header updates for metrics integration in transfer operations
client_metrics_test.cpp Comprehensive test suite for all metric functionality
CMakeLists.txt Build configuration for the new test file

@xiaguan xiaguan requested a review from ykwd August 12, 2025 11:13
@stmatengss stmatengss mentioned this pull request Aug 12, 2025
29 tasks
Copy link
Collaborator

@ykwd ykwd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good to me. Just two points:

  1. Is there a way to skip recording metrics when the metrics feature is disabled? This would allow users to avoid the overhead of metrics if they don’t use this feature.
  2. Perhaps we could also add a section in the documentation explaining how to use client metrics.

xiaguan and others added 8 commits August 13, 2025 11:31
This commit introduces a comprehensive metrics system for the client component, tracking transfer byte counts and operation latencies with both human-readable summaries and Prometheus-style serialization. Key features include:
- New TransferMetric for tracking read/write bytes and latency histograms
- MasterClientMetric for RPC call counting and latency tracking
- Environment-controlled metrics reporting (MC_STORE_METRIC_REPORT)
- Automatic periodic metrics collection thread
- Enhanced test coverage for metrics validation
- Unified metrics interface across all client operations

The implementation provides detailed latency percentiles (P50/P95) and total byte tracking with automatic unit conversion (B/KB/MB/GB).
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@xiaguan xiaguan force-pushed the master_client_metric branch from 5eed4ec to a98b0b4 Compare August 13, 2025 06:08
@xiaguan
Copy link
Collaborator Author

xiaguan commented Aug 13, 2025

We can now disable metric collection through an environment variable.

Regarding the documentation update, I believe we currently lack an appropriate doc for this. For example, we could include it in a "Mooncake Configuration Guide" or "Mooncake Deployment Guide." I'll update the documentation lately.

@xiaguan xiaguan merged commit 57d1c71 into kvcache-ai:main Aug 13, 2025
11 checks passed
@xiaguan xiaguan deleted the master_client_metric branch August 13, 2025 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants