Reader fragment #749

cody-littley · 2024-09-05T16:30:01Z

Why are these changes needed?

PR 8 of 9 for the traffic generator project. This PR adds a worker that reads blobs.

Checks

I've made sure the lint is passing in this PR.
I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
Testing Strategy
- Unit tests
- Integration tests
- This PR is not tested :(

Signed-off-by: Cody Littley <cody@eigenlabs.org>

ian-shim

Looks good! Just few comments

tools/traffic/workers/mock_chain_client.go

api/clients/mock/retrieval_client.go

tools/traffic/workers/blob_reader.go

ian-shim · 2024-09-06T05:52:44Z

tools/traffic/workers/blob_reader.go

+	assignments := chunks.Assignments
+
+	data, err := reader.retriever.CombineChunks(chunks)
+	if err != nil {


This is an interesting case and may warrant different error handling vs. the read failure above.
Read failures may happen for different reasons but mostly suggest some issues with network health.
Blob recovery failure here most certainly suggests a bug. So we may set up noisier alert for this one

This makes an error log and updates a metric. What did you have in mind for a noisier alert?

Signed-off-by: Cody Littley <cody@eigenlabs.org>

jianoaix · 2024-09-10T00:13:33Z

tools/traffic/workers/blob_reader.go

+		metrics: &blobReaderMetrics{
+			generatorMetrics:           generatorMetrics,
+			fetchBatchHeaderMetric:     generatorMetrics.NewLatencyMetric("fetch_batch_header"),
+			fetchBatchHeaderSuccess:    generatorMetrics.NewCountMetric("fetch_batch_header_success"),


How does it compare to having a metric with a "status" label for success/failure or valid/invalid (instead of one metric for each status)?

The way this code is currently set up, all of the latency metrics use the same base metric name but with different labels. The same is true for the count metrics and the gauge metrics. If anything, I feel like I'm using labels too much right now. 🙃

The way I should have set up this metrics API would have been to allow each entry to specify both a metric name and a label. Since this pattern is something that extends to all three components in the traffic pattern (the writer, the status checker, and the reader), should I work on this as a part of this PR or split that work into a separate one?

Maybe we can work on this + refactoring existing metrics in a separate PR?

sounds like a plan

Proper wrapping may help avoid boilplate but it can also be an issue to over wrapping, making it not quite clear what's included. I'd think two thins important here may worth considering:

labels, useful for both readability and visibility of cardinality

documentation, useful for understanding what this metric is about

Also I'd suggest dropping the "Metric" in naming, it doesn't provide information. E.g. invalidBlobMetric would be much more informative if it's numInvalidBlobs, which is clear it's a counter for invalid blobs.

That said it sounds good to me to do it in separate PR -- smaller/modular PR is good.

Makes sense. Will address this in the follow up PR.

jianoaix · 2024-09-10T00:15:03Z

tools/traffic/workers/blob_reader.go

+
+	ctxTimeout, cancel := context.WithTimeout(*r.ctx, r.config.FetchBatchHeaderTimeout)
+	defer cancel()
+	batchHeader, err := metrics.InvokeAndReportLatency(r.metrics.fetchBatchHeaderMetric,


I wouldn't consider a mix of core logic with monitoring code quite clean, and would prefer a separation.

There are a few patterns I could utilize for reporting latency. Which do you prefer?

Pattern A

Wrap the base function in another function that calls the base function and records the amount of time it took to execute. (This is the pattern currently in the PR)

Pattern B

Measure the time before and after the function call.

start := time.Now() foo() end := time.Now() metrics.reportDurationOfFoo(start, end)

Pattern C

Embed the measurement of the time to execute the method inside the method itself.

func foo() { start := time.now() // business logic end := time.now() metrics.reportDurationOfFoo(start, end)

Pattern D

Is there a strategy you like that I didn't mention?

Both B & C seem reasonable and clear. I'd choose one or the other based on which method is the more appropriate for reporting the metrics.

I've removed the invokeAndReportLatency() metric in favor of pattern B

jianoaix · 2024-09-10T00:18:56Z

tools/traffic/workers/blob_reader.go

+			batchHeader.BlobHeadersRoot,
+			core.QuorumID(0))
+	})
+	cancel()


Why cancel the context here? Do you want to wait for timeout?

Yes, the intention was to wait for a timeout. The way I thought this block of code worked would be that the context would be automatically cancelled if the timeout elapses, and that the method RetrieveBlobChunks would block until either the work was completed or the context was cancelled. It's very possible I'm not using a good coding pattern here. What would be the proper way to implement these semantics?

Do we just need to ensure that the ctxTimeout is canceled within the method's scope?
Why not defer cancel() like we do in L135?

change made

jianoaix · 2024-09-10T00:20:37Z

tools/traffic/workers/blob_reader.go

+func (r *BlobReader) reportChunk(operatorId core.OperatorID) {
+	metric, exists := r.metrics.operatorSuccessMetrics[operatorId]
+	if !exists {
+		metric = r.metrics.generatorMetrics.NewCountMetric(fmt.Sprintf("operator_%x_returned_chunk", operatorId))


So there will be hundreds of metrics, not sure if that's a quite usable thing.

I agree that this is going to be noisy. Is it critical for this utility to determine which nodes are failing to return chunks, or would it be ok just to report the fraction of nodes that return chunks?

If knowledge of which nodes are failing to return chunks is important, do you have any recommendations for how to extract this sort of data without using metrics this way?

Hmm would keeping track of counts for only failed retrieval reduce the number of metrics here?

Based on our in person discussion, the plan is to address which metrics we enable in production as a follow up task.

I think this could be updated in this PR

This part needs to be careful about cardinality. SG to follow up to cap the scope here. LMK if you need some inputs on refactoring this later.

The primary reason why I'd like to address metrics as a follow up is because there are other metrics I added in prior PRs that also need to be considered. Makes sense to me to do all the metrics at once. Willing to do it in this PR if you feel strongly, but want to make you aware that the scope of this PR will grow if we want to address all of the metrics issues.

ian-shim

lgtm, @jianoaix can you chime in to Cody's responses?

ian-shim · 2024-09-18T17:56:09Z

tools/traffic/workers/blob_reader.go

+			batchHeader.BlobHeadersRoot,
+			core.QuorumID(0))
+	})
+	cancel()


Do we just need to ensure that the ctxTimeout is canceled within the method's scope?
Why not defer cancel() like we do in L135?

ian-shim · 2024-09-18T17:57:44Z

tools/traffic/workers/blob_reader.go

+
+	ctxTimeout, cancel := context.WithTimeout(*r.ctx, r.config.FetchBatchHeaderTimeout)
+	defer cancel()
+	batchHeader, err := metrics.InvokeAndReportLatency(r.metrics.fetchBatchHeaderMetric,


Both B & C seem reasonable and clear. I'd choose one or the other based on which method is the more appropriate for reporting the metrics.

ian-shim · 2024-09-18T17:58:17Z

tools/traffic/workers/blob_reader.go

+		metrics: &blobReaderMetrics{
+			generatorMetrics:           generatorMetrics,
+			fetchBatchHeaderMetric:     generatorMetrics.NewLatencyMetric("fetch_batch_header"),
+			fetchBatchHeaderSuccess:    generatorMetrics.NewCountMetric("fetch_batch_header_success"),


Maybe we can work on this + refactoring existing metrics in a separate PR?

ian-shim · 2024-09-18T18:10:04Z

tools/traffic/workers/blob_reader.go

+func (r *BlobReader) reportChunk(operatorId core.OperatorID) {
+	metric, exists := r.metrics.operatorSuccessMetrics[operatorId]
+	if !exists {
+		metric = r.metrics.generatorMetrics.NewCountMetric(fmt.Sprintf("operator_%x_returned_chunk", operatorId))


Hmm would keeping track of counts for only failed retrieval reduce the number of metrics here?

Signed-off-by: Cody Littley <cody@eigenlabs.org>

ian-shim

lgtm

ian-shim · 2024-09-27T17:03:07Z

tools/traffic/workers/blob_reader.go

+func (r *BlobReader) reportChunk(operatorId core.OperatorID) {
+	metric, exists := r.metrics.operatorSuccessMetrics[operatorId]
+	if !exists {
+		metric = r.metrics.generatorMetrics.NewCountMetric(fmt.Sprintf("operator_%x_returned_chunk", operatorId))


I think this could be updated in this PR

jianoaix · 2024-09-27T20:35:28Z

tools/traffic/workers/blob_reader.go

+			case <-(*r.ctx).Done():
+				err := (*r.ctx).Err()
+				if err != nil {
+					r.logger.Info("blob r context closed", "err:", err)


blob reader context closed

jianoaix · 2024-09-27T20:55:56Z

tools/traffic/workers/blob_reader.go

+		metrics: &blobReaderMetrics{
+			generatorMetrics:           generatorMetrics,
+			fetchBatchHeaderMetric:     generatorMetrics.NewLatencyMetric("fetch_batch_header"),
+			fetchBatchHeaderSuccess:    generatorMetrics.NewCountMetric("fetch_batch_header_success"),


Proper wrapping may help avoid boilplate but it can also be an issue to over wrapping, making it not quite clear what's included. I'd think two thins important here may worth considering:

labels, useful for both readability and visibility of cardinality

documentation, useful for understanding what this metric is about

Also I'd suggest dropping the "Metric" in naming, it doesn't provide information. E.g. invalidBlobMetric would be much more informative if it's numInvalidBlobs, which is clear it's a counter for invalid blobs.

That said it sounds good to me to do it in separate PR -- smaller/modular PR is good.

jianoaix · 2024-09-27T21:13:38Z

tools/traffic/workers/blob_reader.go

+	} else {
+		end := time.Now()
+		duration := end.Sub(start)
+		r.metrics.fetchBatchHeaderMetric.ReportLatency(duration)


These 3 lines can be simplified as r.metrics.fetchBatchHeaderMetric.ReportLatency(time.Since(start))

change made

jianoaix · 2024-09-27T21:14:20Z

tools/traffic/workers/blob_reader.go

+		r.logger.Error("failed to get batch header", "err:", err)
+		r.metrics.fetchBatchHeaderFailure.Increment()
+		return
+	} else {


No need else after a return

jianoaix · 2024-09-27T21:22:39Z

tools/traffic/workers/blob_reader.go

+		r.logger.Error("failed to read chunks", "err:", err)
+		r.metrics.readFailureMetric.Increment()
+		return
+	} else {


similar here

jianoaix · 2024-09-27T21:25:25Z

tools/traffic/workers/blob_reader.go

+func (r *BlobReader) reportChunk(operatorId core.OperatorID) {
+	metric, exists := r.metrics.operatorSuccessMetrics[operatorId]
+	if !exists {
+		metric = r.metrics.generatorMetrics.NewCountMetric(fmt.Sprintf("operator_%x_returned_chunk", operatorId))


This part needs to be careful about cardinality. SG to follow up to cap the scope here. LMK if you need some inputs on refactoring this later.

jianoaix · 2024-09-27T21:26:04Z

tools/traffic/workers/blob_status_tracker.go


 	if err != nil {
 		tracker.getStatusErrorCountMetric.Increment()
 		return nil, err
+	} else {


Here and below: can be simplified as mentioned above

Signed-off-by: Cody Littley <cody@eigenlabs.org>

cody-littley added 2 commits September 5, 2024 10:30

Add reader worker.

988f1b3

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Fixed unit tests.

783a769

Signed-off-by: Cody Littley <cody@eigenlabs.org>

cody-littley requested review from jianoaix and ian-shim September 5, 2024 16:30

cody-littley self-assigned this Sep 5, 2024

Fix mock class.

4a200eb

Signed-off-by: Cody Littley <cody@eigenlabs.org>

ian-shim reviewed Sep 6, 2024

View reviewed changes

cody-littley added 2 commits September 6, 2024 09:43

Made suggested changes.

520694a

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Made suggested changes.

87fb79d

Signed-off-by: Cody Littley <cody@eigenlabs.org>

jianoaix reviewed Sep 10, 2024

View reviewed changes

ian-shim reviewed Sep 18, 2024

View reviewed changes

cody-littley added 2 commits September 26, 2024 10:37

Made suggested changes.

b4ac071

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Made suggested change.

4a269fc

Signed-off-by: Cody Littley <cody@eigenlabs.org>

ian-shim approved these changes Sep 27, 2024

View reviewed changes

jianoaix reviewed Sep 27, 2024

View reviewed changes

Made suggested changes.

0824bb3

Signed-off-by: Cody Littley <cody@eigenlabs.org>

cody-littley merged commit b3d1c35 into Layr-Labs:master Oct 3, 2024
6 checks passed

cody-littley deleted the reader-fragment branch October 3, 2024 19:16

Reader fragment #749

Reader fragment #749

Conversation

cody-littley commented Sep 5, 2024 • edited Loading

Why are these changes needed?

Checks

ian-shim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cody-littley Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pattern A

Pattern B

Pattern C

Pattern D

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cody-littley Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

ian-shim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ian-shim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cody-littley commented Sep 5, 2024 •

edited

Loading

cody-littley Sep 10, 2024 •

edited

Loading

cody-littley Oct 3, 2024 •

edited

Loading