Skip to content

[FEATURE] Support ShuffleServer latency metrics #309

@leixm

Description

@leixm

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the feature

When the ShuffleServer load is high, we cannot directly judge whether the client read and write has been greatly affected according to the metrics.

Motivation

Accurately determine whether the current service load has caused a large delay to the client's read and write.

Describe the solution

Delay monitoring is divided into two parts. The first part is the delay of ShuffleServer processing logic. Here we can directly add metrics. The second part is before ShuffleServer processing logic, including network delay and rpc queue waiting time.

For the second part, maybe we can record the timestamp of the request before the client initiates the read and write request, and include this timestamp in the request. When ShuffleServer receives the request it can know how long the delay time is and record it in the metrics of ShuffleServer, maybe grpc also supports related implementations.

We can measure the processing delay of the current ShuffleServer through some monitoring indicators such as p95 and p99.

Additional context

No

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions