-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reader fragment #749
Reader fragment #749
Conversation
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just few comments
assignments := chunks.Assignments | ||
|
||
data, err := reader.retriever.CombineChunks(chunks) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting case and may warrant different error handling vs. the read failure above.
Read failures may happen for different reasons but mostly suggest some issues with network health.
Blob recovery failure here most certainly suggests a bug. So we may set up noisier alert for this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes an error log and updates a metric. What did you have in mind for a noisier alert?
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
metrics: &blobReaderMetrics{ | ||
generatorMetrics: generatorMetrics, | ||
fetchBatchHeaderMetric: generatorMetrics.NewLatencyMetric("fetch_batch_header"), | ||
fetchBatchHeaderSuccess: generatorMetrics.NewCountMetric("fetch_batch_header_success"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it compare to having a metric with a "status" label for success/failure or valid/invalid (instead of one metric for each status)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this code is currently set up, all of the latency metrics use the same base metric name but with different labels. The same is true for the count metrics and the gauge metrics. If anything, I feel like I'm using labels too much right now. 🙃
The way I should have set up this metrics API would have been to allow each entry to specify both a metric name and a label. Since this pattern is something that extends to all three components in the traffic pattern (the writer, the status checker, and the reader), should I work on this as a part of this PR or split that work into a separate one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can work on this + refactoring existing metrics in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like a plan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proper wrapping may help avoid boilplate but it can also be an issue to over wrapping, making it not quite clear what's included. I'd think two thins important here may worth considering:
- labels, useful for both readability and visibility of cardinality
- documentation, useful for understanding what this metric is about
Also I'd suggest dropping the "Metric" in naming, it doesn't provide information. E.g. invalidBlobMetric
would be much more informative if it's numInvalidBlobs
, which is clear it's a counter for invalid blobs.
That said it sounds good to me to do it in separate PR -- smaller/modular PR is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Will address this in the follow up PR.
tools/traffic/workers/blob_reader.go
Outdated
|
||
ctxTimeout, cancel := context.WithTimeout(*r.ctx, r.config.FetchBatchHeaderTimeout) | ||
defer cancel() | ||
batchHeader, err := metrics.InvokeAndReportLatency(r.metrics.fetchBatchHeaderMetric, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't consider a mix of core logic with monitoring code quite clean, and would prefer a separation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few patterns I could utilize for reporting latency. Which do you prefer?
Pattern A
Wrap the base function in another function that calls the base function and records the amount of time it took to execute. (This is the pattern currently in the PR)
Pattern B
Measure the time before and after the function call.
start := time.Now()
foo()
end := time.Now()
metrics.reportDurationOfFoo(start, end)
Pattern C
Embed the measurement of the time to execute the method inside the method itself.
func foo() {
start := time.now()
// business logic
end := time.now()
metrics.reportDurationOfFoo(start, end)
Pattern D
Is there a strategy you like that I didn't mention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both B & C seem reasonable and clear. I'd choose one or the other based on which method is the more appropriate for reporting the metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the invokeAndReportLatency()
metric in favor of pattern B
tools/traffic/workers/blob_reader.go
Outdated
batchHeader.BlobHeadersRoot, | ||
core.QuorumID(0)) | ||
}) | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why cancel the context here? Do you want to wait for timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the intention was to wait for a timeout. The way I thought this block of code worked would be that the context would be automatically cancelled if the timeout elapses, and that the method RetrieveBlobChunks
would block until either the work was completed or the context was cancelled. It's very possible I'm not using a good coding pattern here. What would be the proper way to implement these semantics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we just need to ensure that the ctxTimeout
is canceled within the method's scope?
Why not defer cancel()
like we do in L135?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change made
func (r *BlobReader) reportChunk(operatorId core.OperatorID) { | ||
metric, exists := r.metrics.operatorSuccessMetrics[operatorId] | ||
if !exists { | ||
metric = r.metrics.generatorMetrics.NewCountMetric(fmt.Sprintf("operator_%x_returned_chunk", operatorId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there will be hundreds of metrics, not sure if that's a quite usable thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is going to be noisy. Is it critical for this utility to determine which nodes are failing to return chunks, or would it be ok just to report the fraction of nodes that return chunks?
If knowledge of which nodes are failing to return chunks is important, do you have any recommendations for how to extract this sort of data without using metrics this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm would keeping track of counts for only failed retrieval reduce the number of metrics here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our in person discussion, the plan is to address which metrics we enable in production as a follow up task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be updated in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part needs to be careful about cardinality. SG to follow up to cap the scope here. LMK if you need some inputs on refactoring this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary reason why I'd like to address metrics as a follow up is because there are other metrics I added in prior PRs that also need to be considered. Makes sense to me to do all the metrics at once. Willing to do it in this PR if you feel strongly, but want to make you aware that the scope of this PR will grow if we want to address all of the metrics issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, @jianoaix can you chime in to Cody's responses?
tools/traffic/workers/blob_reader.go
Outdated
batchHeader.BlobHeadersRoot, | ||
core.QuorumID(0)) | ||
}) | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we just need to ensure that the ctxTimeout
is canceled within the method's scope?
Why not defer cancel()
like we do in L135?
tools/traffic/workers/blob_reader.go
Outdated
|
||
ctxTimeout, cancel := context.WithTimeout(*r.ctx, r.config.FetchBatchHeaderTimeout) | ||
defer cancel() | ||
batchHeader, err := metrics.InvokeAndReportLatency(r.metrics.fetchBatchHeaderMetric, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both B & C seem reasonable and clear. I'd choose one or the other based on which method is the more appropriate for reporting the metrics.
metrics: &blobReaderMetrics{ | ||
generatorMetrics: generatorMetrics, | ||
fetchBatchHeaderMetric: generatorMetrics.NewLatencyMetric("fetch_batch_header"), | ||
fetchBatchHeaderSuccess: generatorMetrics.NewCountMetric("fetch_batch_header_success"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can work on this + refactoring existing metrics in a separate PR?
func (r *BlobReader) reportChunk(operatorId core.OperatorID) { | ||
metric, exists := r.metrics.operatorSuccessMetrics[operatorId] | ||
if !exists { | ||
metric = r.metrics.generatorMetrics.NewCountMetric(fmt.Sprintf("operator_%x_returned_chunk", operatorId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm would keeping track of counts for only failed retrieval reduce the number of metrics here?
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
func (r *BlobReader) reportChunk(operatorId core.OperatorID) { | ||
metric, exists := r.metrics.operatorSuccessMetrics[operatorId] | ||
if !exists { | ||
metric = r.metrics.generatorMetrics.NewCountMetric(fmt.Sprintf("operator_%x_returned_chunk", operatorId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be updated in this PR
tools/traffic/workers/blob_reader.go
Outdated
case <-(*r.ctx).Done(): | ||
err := (*r.ctx).Err() | ||
if err != nil { | ||
r.logger.Info("blob r context closed", "err:", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blob reader context closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
metrics: &blobReaderMetrics{ | ||
generatorMetrics: generatorMetrics, | ||
fetchBatchHeaderMetric: generatorMetrics.NewLatencyMetric("fetch_batch_header"), | ||
fetchBatchHeaderSuccess: generatorMetrics.NewCountMetric("fetch_batch_header_success"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proper wrapping may help avoid boilplate but it can also be an issue to over wrapping, making it not quite clear what's included. I'd think two thins important here may worth considering:
- labels, useful for both readability and visibility of cardinality
- documentation, useful for understanding what this metric is about
Also I'd suggest dropping the "Metric" in naming, it doesn't provide information. E.g. invalidBlobMetric
would be much more informative if it's numInvalidBlobs
, which is clear it's a counter for invalid blobs.
That said it sounds good to me to do it in separate PR -- smaller/modular PR is good.
tools/traffic/workers/blob_reader.go
Outdated
} else { | ||
end := time.Now() | ||
duration := end.Sub(start) | ||
r.metrics.fetchBatchHeaderMetric.ReportLatency(duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 3 lines can be simplified as r.metrics.fetchBatchHeaderMetric.ReportLatency(time.Since(start))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change made
tools/traffic/workers/blob_reader.go
Outdated
r.logger.Error("failed to get batch header", "err:", err) | ||
r.metrics.fetchBatchHeaderFailure.Increment() | ||
return | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need else
after a return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simplified
tools/traffic/workers/blob_reader.go
Outdated
r.logger.Error("failed to read chunks", "err:", err) | ||
r.metrics.readFailureMetric.Increment() | ||
return | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
func (r *BlobReader) reportChunk(operatorId core.OperatorID) { | ||
metric, exists := r.metrics.operatorSuccessMetrics[operatorId] | ||
if !exists { | ||
metric = r.metrics.generatorMetrics.NewCountMetric(fmt.Sprintf("operator_%x_returned_chunk", operatorId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part needs to be careful about cardinality. SG to follow up to cap the scope here. LMK if you need some inputs on refactoring this later.
|
||
if err != nil { | ||
tracker.getStatusErrorCountMetric.Increment() | ||
return nil, err | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and below: can be simplified as mentioned above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Why are these changes needed?
PR 8 of 9 for the traffic generator project. This PR adds a worker that reads blobs.
Checks