Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segment writer service #3498

Merged
merged 9 commits into from
Aug 19, 2024
Merged

segment writer service #3498

merged 9 commits into from
Aug 19, 2024

Conversation

korniltsev
Copy link
Collaborator

Bring back segment writer service.
Add push protobuf api for segment writer.

The service is still detached - nobody is pushing to it yet.
The service is not as optimized as in POC.

It will be addressed in followups.

@korniltsev korniltsev requested review from a team as code owners August 18, 2024 19:28
Copy link
Collaborator

@kolesnikovae kolesnikovae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM – I've left a few notes but those are just my thoughts / topics for discussion – please feel free to ignore them, we'll figure those out along the way

Comment on lines +43 to +44
f.DurationVar(&cfg.SegmentDuration, prefix+"segment.duration", 500*time.Millisecond, "Timeout when flushing segments to bucket.")
f.BoolVar(&cfg.Async, prefix+"async", false, "Enable async mode for segment writer.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be tenant options (limits) rather than global ones

Comment on lines 188 to 191
func ContextWithHeadMetrics(ctx context.Context, reg prometheus.Registerer, prefix string) context.Context {
return contextWithHeadMetrics(ctx, newHeadMetrics2(reg, prefix))
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR: I saw your attempt to make the dependency on metrics explicit 👍🏻 I really hope we won't pass it via the context

pkg/phlaredb/metrics.go Outdated Show resolved Hide resolved
Comment on lines 194 to 210
err = pprof.FromBytes(sample.RawProfile, func(p *profilev1.Profile, size int) error {
if err = segment.ingest(ctx, tenantID, p, id, series.Labels); err != nil {
reason := validation.ReasonOf(err)
if reason != validation.Unknown {
validation.DiscardedProfiles.WithLabelValues(string(reason), tenantID).Add(float64(1))
validation.DiscardedBytes.WithLabelValues(string(reason), tenantID).Add(float64(size))
switch validation.ReasonOf(err) {
case validation.SeriesLimit:
return connect.NewError(connect.CodeResourceExhausted, err)
}
}
}
return nil
})
if err != nil {
return err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we won't have SeriesLimit in segment writer, we can simplify this piece

Comment on lines 177 to 179
i.segmentWriter.metrics.segmentFlushTimeouts.WithLabelValues(tenantID).Inc()
i.segmentWriter.metrics.segmentFlushWaitDuration.WithLabelValues(tenantID).Observe(time.Since(t1).Seconds())
level.Error(i.logger).Log("msg", "flush timeout", "err", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assume that the error indicates a timeout. We probably want to check the error type here (or context.Err())

Comment on lines 160 to 170
var waits = make(map[segmentWaitFlushed]struct{}, len(req.Msg.Series))
for _, series := range req.Msg.Series {
var shard = shardKey(series.Shard)
wait, err := i.segmentWriter.ingest(shard, func(segment segmentIngest) error {
return i.ingestToSegment(ctx, segment, series, tenantID)
})
if err != nil {
return nil, err
}
waits[wait] = struct{}{}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: If we moved pprof split from distributors to segment writers and restricted requests to a single profile, we would not need to wait multiple segments to flush (which may result in 2 * segment_duration latency)

Comment on lines 14 to 17
message PushRequest {
// series is a set raw pprof profiles and accompanying labels
repeated RawProfileSeries series = 1;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this internally at some point, and I recall the consensus was that batching does not benefit us here. On the contrary, it introduces several issues:

  1. Callers have to wait all the affected shard segment writers to flush, which badly impacts latency, and may also impact resource usage on the distributor side.
  2. It complicates error handling. I'm not 100% sure that partial success is handled properly.
  3. It complicates retries on the distributor end.

I hope we'll amend the API and implementation accordingly in follow-up PRs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed repeated series, but kept repeated samples

@korniltsev korniltsev merged commit 987f743 into main Aug 19, 2024
18 checks passed
@korniltsev korniltsev deleted the korniltsev/segmentwriter branch August 19, 2024 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants