Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage,admission: investigate read-only batch latency during high-volume snapshot ingest #89788

Open
irfansharif opened this issue Oct 11, 2022 · 2 comments
Labels
C-investigation Further steps needed to qualify. C-label will change. T-admission-control Admission Control

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Oct 11, 2022

Describe the problem

Experiment discussed internally here. When trying to reproduce snapshot-induced-latency-hits, using the roachtest added in #89191, we noticed that p99.9 latencies for read traffic over data that's not currently receiving snapshots see an increase. When looking at outlier traces, the time is spent entirely below pebble. There's little trace info from within pebble to understand why; this issue tracks investigating just that.

To Reproduce

Using #89191-ish:

image

First red annotation is leases for foreground load being transferred to the node that's going to start receiving snapshots. Second red annotation is when it starts receiving snapshots, and service latencies start going through the roof. A set of outlier traces can be found here: trace-snapshot-latency.tar.gz. They look roughly like the one below:

image

+cc @andrewbaptist, @sumeerbhola.

Jira issue: CRDB-20434

@irfansharif irfansharif added the C-investigation Further steps needed to qualify. C-label will change. label Oct 11, 2022
@nicktrav nicktrav added the T-admission-control Admission Control label Dec 21, 2023
@nicktrav
Copy link
Collaborator

@sumeerbhola - this smells a little like what we saw over in cockroachdb/pebble#3792 where a flushable ingest could cause a large filter block read. That has since been addressed. However, this part of the original description gives me some pause:

latencies for read traffic over data that's not currently receiving snapshots see an increase

If the seek was outside the bounds of this flushable, would we even need to load the filter block. If that's true, maybe this still needs some investigation.

Linking this to cockroachdb/pebble#3728, as that would likely give us more introspection as to what in pebble is taking time.

@sumeerbhola
Copy link
Collaborator

If the seek was outside the bounds of this flushable, would we even need to load the filter block

I think CRDB's use of SeekPrefixGE, which doesn't set iterator bounds, can hit a path where a RANGEDEL can cause some sstable iterators in the stack to be seeked beyond the prefix. This may get improved as part of cockroachdb/pebble#3329 and cockroachdb/pebble#3845.

Given we have reproduction steps, we should retry this experiment on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. T-admission-control Admission Control
Projects
None yet
Development

No branches or pull requests

3 participants