storage,admission: investigate read-only batch latency during high-volume snapshot ingest #89788

irfansharif · 2022-10-11T22:39:09Z

Describe the problem

Experiment discussed internally here. When trying to reproduce snapshot-induced-latency-hits, using the roachtest added in #89191, we noticed that p99.9 latencies for read traffic over data that's not currently receiving snapshots see an increase. When looking at outlier traces, the time is spent entirely below pebble. There's little trace info from within pebble to understand why; this issue tracks investigating just that.

To Reproduce

Using #89191-ish:

First red annotation is leases for foreground load being transferred to the node that's going to start receiving snapshots. Second red annotation is when it starts receiving snapshots, and service latencies start going through the roof. A set of outlier traces can be found here: trace-snapshot-latency.tar.gz. They look roughly like the one below:

+cc @andrewbaptist, @sumeerbhola.

Jira issue: CRDB-20434

nicktrav · 2024-08-28T20:15:26Z

@sumeerbhola - this smells a little like what we saw over in cockroachdb/pebble#3792 where a flushable ingest could cause a large filter block read. That has since been addressed. However, this part of the original description gives me some pause:

latencies for read traffic over data that's not currently receiving snapshots see an increase

If the seek was outside the bounds of this flushable, would we even need to load the filter block. If that's true, maybe this still needs some investigation.

Linking this to cockroachdb/pebble#3728, as that would likely give us more introspection as to what in pebble is taking time.

sumeerbhola · 2024-09-18T00:52:19Z

If the seek was outside the bounds of this flushable, would we even need to load the filter block

I think CRDB's use of SeekPrefixGE, which doesn't set iterator bounds, can hit a path where a RANGEDEL can cause some sstable iterators in the stack to be seeked beyond the prefix. This may get improved as part of cockroachdb/pebble#3329 and cockroachdb/pebble#3845.

Given we have reproduction steps, we should retry this experiment on master.

irfansharif added the C-investigation Further steps needed to qualify. C-label will change. label Oct 11, 2022

nicktrav added the T-admission-control Admission Control label Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage,admission: investigate read-only batch latency during high-volume snapshot ingest #89788

storage,admission: investigate read-only batch latency during high-volume snapshot ingest #89788

irfansharif commented Oct 11, 2022 •

edited by cockroach-jira-scripts

Loading

nicktrav commented Aug 28, 2024

sumeerbhola commented Sep 18, 2024

storage,admission: investigate read-only batch latency during high-volume snapshot ingest #89788

storage,admission: investigate read-only batch latency during high-volume snapshot ingest #89788

Comments

irfansharif commented Oct 11, 2022 • edited by cockroach-jira-scripts Loading

nicktrav commented Aug 28, 2024

sumeerbhola commented Sep 18, 2024

irfansharif commented Oct 11, 2022 •

edited by cockroach-jira-scripts

Loading