Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf observability: Encoder data throughput and blob size breakdown #716

Merged
merged 6 commits into from
Aug 22, 2024

Conversation

jianoaix
Copy link
Contributor

@jianoaix jianoaix commented Aug 21, 2024

Why are these changes needed?

  • Better understanding of the encoding throughput
  • Better understanding of the latency by blob size buckets

Checks

  • I've made sure the lint is passing in this PR.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

@jianoaix jianoaix requested review from dmanc and bxue-l2 August 21, 2024 19:52
@@ -42,7 +42,7 @@ func NewMetrics(httpPort string, logger logging.Logger) *Metrics {
Name: "request_total",
Help: "the number and size of total encode blob request at server side per state",
},
[]string{"state"}, // state is either success, ratelimited, canceled, or failure
[]string{"type", "state"}, // type is either number or size; state is either success, ratelimited, canceled, or failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a separate metric instead of using label

@@ -126,7 +126,7 @@ func (s *Server) handleEncoding(ctx context.Context, req *pb.EncodeBlobRequest)
}

totalTime := time.Since(begin)
s.metrics.TakeLatency(encodingTime, totalTime)
s.metrics.TakeLatency(len(req.GetData()), encodingTime, totalTime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since coding time actually depends on the size of encoded blob, that is a function of input blob size and current stake distribution. Should we measure against the coded size.
i.e. ChunkLength*NumChunks @dmanc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add chunklength numchunks as labels too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not trying to add all factors that affect latency to a metric.
Also when the operators are given, each blob has the same condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blobs has the same condition only in the same reference Blocknumber. It could change depending on the stake distribution. I guess the real question is what we want to get out of this metrics. My understanding is that we want to know the encoding speed. Now here is my concern, suppose for one stake distribution, min stake is 0.01, and in another distribution, the stake distribution is 0.001. Although both blob are the same size, the proof generation time would be different, because the encoded size are differnt. We might misinterpret as if there is performance degradation, whereas in reality, it is just some stake distribution change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we want to track a variable from user perspective, like "I have a blob of size X, and it takes Y ms to process".
I'm fine to drop this label if blobSize is a poor variable to monitor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I should have kept it: we are not using blobSize to determine latency here. And this blobSize label will help detect the significant change in stake distribution.
E.g. the same blob size is getting a much larger encoding latency. We will be informed that there is fundamental change in the stake distribution.
Without this label, we cannot tell if it's because stake distribution change or something else, as it's mixing all blob sizes which can have very different latencies.

@jianoaix jianoaix merged commit 1cdaebd into Layr-Labs:master Aug 22, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants