sql, backupccl: P90,P99 SQL Service Latency spikes when a backup is running #69009

adityamaru · 2021-08-16T19:27:38Z

Describe the problem

Recently, we have seen a few instances where a spike in the P90 and P99 SQL Service Latency coincides with running backups (v21.1 and master). There is a 3-4x impact on foreground latency when a full/incremental backup is running. This behavior is reproducible on an 8 node roachprod cluster running a TPCC workload and has also been reported in production by a customer https://github.com/cockroachlabs/support/issues/1150.

To Reproduce

Import data.

roachprod create -n 8 $CLUSTER
roachprod stage $CLUSTER cockroach
roachprod start $CLUSTER
roachprod run $CLUSTER:1 "./cockroach workload fixtures import tpcc --warehouses=5000"

Start a TPCC workload.

roachprod ssh $CLUSTER:1
./cockroach workload run tpcc --warehouses=5000 --conns=200 --max-rate=250 <output of roachprod pgurl $CLUSTER> (in screen or tmux)

Start a backup schedule with daily fulls and 15-minute incrementals.

CREATE SCHEDULE foo
FOR BACKUP INTO <s3/gs storage location>
WITH revision_history
RECURRING '*/15 * * * *' FULL BACKUP '@daily'
WITH SCHEDULE OPTIONS first_run = 'now';

The service latency graphs should indicate a 3-4x spike every time an incremental backup is run.

Expected behavior

It is expected that backups will cause spikes in CPU, IOPS, and disk read throughput since it is reading data from the cluster and writing it to external storage. A 3-4x impact on foreground traffic is however unexpected/unintended. This issue should serve as a placeholder for any investigations into why a SQL query is slow when the backup is running? It could very well be that the backup is hogging resources, or doing something that is causing contention but to be able to remedy that, we need an indicator/root cause. Folks on the SQL Queries (or Observability) team are better equipped to debug slow SQL queries and so we will need their help to identify the cause of these latency spikes.

Epic CRDB-10556

The text was updated successfully, but these errors were encountered:

rytaft · 2021-08-17T18:59:57Z

@adityamaru I'm assigning BulkIO as well since it seems like this needs to be a joint investigation. In the mean time, moving this to the backlog for SQL Queries.

jordanlewis · 2021-08-19T14:14:05Z

Did someone fetch a statement diagnostic bundle during the spike? That would be the best way to start this investigation.

adityamaru · 2021-08-24T13:44:21Z

I have a long-running roachprod cluster at adityamaru-backup-returned running a TPCC workload, that is exhibiting this behavior. It takes daily fulls and hourly incrementals. I'm finding it a little hard to identify a particular statement that is exhibiting the increased latency. Do we have any indicator on the statements page that would signal a query is taking longer than expected so that I can grab a bundle the next time around?

jordanlewis · 2021-08-24T14:58:16Z

Probably not... I guess just a bundle for any of the statements could do, since it seems like the latency spikes are uniform across all of SQL?

adityamaru · 2021-08-24T15:22:45Z

stmt-bundle-687301840919265284.zip

This is a statement bundle for an update that was collected at 11:17 EST, which corresponds to this spike (in UTC).

jordanlewis · 2021-08-24T15:37:10Z

The captured bundle only took 18ms, so I guess we didn't find a spikey one yet. Can you capture a few? Sorry, I don't think we have a good way to do this right now. You could try the trace-when-slow cluster setting, but that also makes everything else a bit slower too, so it might produce confounding effects.

adityamaru · 2021-08-24T16:08:17Z

@jordanlewis and I poked around for a bit, just noting that the P90/99 SQL Latency lines up with KV latency spikes.

jordanlewis · 2021-08-25T15:21:17Z

After investigating with @adityamaru, we're removing SQL Queries from the investigation and adding KV, since the query latency lines up exactly with the KV request latency.

Next, we need to figure out why KV request latency is slower while backups are running.

jordanlewis · 2021-08-25T16:00:42Z

@andreimatei, can you please take a look?

jordanlewis · 2021-09-24T03:50:04Z

@adityamaru, we're wondering whether we could conceivably use @sumeerbhola's work on admission control to improve this issue.

@sumeerbhola, do you think that reducing priority for backup-related work would do anything useful for this case, given the fact that this latency spike issue is observable even when not in an overload scenario?

aayushshah15 · 2021-11-16T23:34:33Z

Took a look at this and my main takeaway is that I think our observations here are merely a symptom of hardware saturation when the incremental backups are run. The roachprod cluster created by the repro steps given above is a cluster of 4 vCPU nodes, so we don't have a ton of headroom. Every incremental backup run causes a pretty significant corresponding spike in CPU utilization levels.

8 x 4 vCPU nodes

8 x 16 vCPU nodes

As we would expect, running this workload on a cluster with beefier nodes (16 vCPUs each vs 4) eliminates these tail latency spikes during these incremental backup runs because of the additional CPU headroom.

nvanbenschoten · 2021-11-17T19:02:25Z

Thanks for taking a look at this @aayushshah15. Your testing on 4 vCPU machines matches @AlexTalks and my observations when we reproduced this. We also saw what appeared to be CPU saturation during the backup. This was even after dropping the load down to give the nodes plenty of headroom.

I grabbed the following CPU profile at the time, which shows more than 1 of the 4 vCPUs completely pinned evaluating an Export request in ExportMVCCToSst:

The takeaway from your testing and ours on 4 vCPU machines may be as boring as "Backup is CPU intensive, which can lead to latency spikes on machines without much headroom". Your testing on 16 vCPU machines adds more weight behind this because it shows that with sufficient headroom, the latency spike is not seen. That's a good thing, as it means that operators can provision appropriately to avoid foreground traffic disruption from backup.

So it doesn't sound like anything is necessarily going wrong here in terms of logical contention between backups and foreground traffic. Unless there are ways that we can make this LSM scan cheaper, this mostly becomes a question of question resource prioritization. Should we be throttling or pacing the iteration in ExportMVCCToSst? This is fairly analogous to stats collection, which does its own pacing. But backup is more latency-sensitive, so the answer isn't as clear-cut. I think we've also fallen out of favor with custom pacing mechanisms with all the recent improvements to admission control.

nvanbenschoten · 2021-11-17T19:05:34Z

I removed the KV and SQL Queries projects and the C-bug label from this investigation, as this no longer looks like a bug.

sumeerbhola · 2021-11-17T21:01:10Z

Was this running on a recent master with admission control enabled and #71109 which subjects exports to admission control? There was debate on that PR on whether we should use LowPri instead of NormalPri. An experiment with that change may be interesting. I suspect the P90,P99 will still increase, though maybe not as much, since the default value of the admission.kv_slot_adjuster.overload_threshold cluster setting is 32, since we wanted to be able to saturate CPU even with admission control on. Experimenting with reducing that threshold may also be worthwhile.

aayushshah15 · 2021-11-18T03:31:41Z

Was this running on a recent master with admission control enabled and #71109 which subjects exports to admission control?

Yes, this was run on master (as of a few days ago) so it should've included that patch. I'll try and re-run it on the smaller cluster with the changes you mentioned.

AlexTalks · 2021-11-22T21:02:46Z

My findings lined up with what @aayushshah15 posted as well, and it was clear that that lowering the rate of read/write operations had a lot to do with decreasing latency spikes during backups.

TPCC without backups, max-rate=100, 4vCPU machines

TPCC with backups, max-rate=100, 4vCPU machines

I also did a quick test with the patch @sumeerbhola mentioned, using LowPri for backups with admission control enabled, but I did not see any significant change in the latency spikes. I should note that we do still see CPU usage spike to around 75% during backups on these 4 vCPU nodes.

TPCC with backups, LowPri exports under Admission Control, max-rate=100, 4vCPU machines

aayushshah15 · 2021-11-22T21:18:20Z

I'm closing this issue as we've conclusively attributed the latency spikes to resource saturation and do not have further leads to pursue.

sumeerbhola · 2021-11-24T11:57:08Z

@adityamaru since the problem still exists, even though the issue is closed: I think admission control is a reasonable solution to this, but it requires a deeper understanding of what is happening during the experiment in terms of queueing latency, so that we know exactly what is behaving unexpectedly. If someone from bulk-io wants to investigate further and run an experiment, I am happy to synchronously look at the experiment to try to understand what is going on.

adityamaru · 2021-12-23T19:30:43Z

@sumeerbhola sorry for the late response this is definitely something I am interested in investigating with you. I might not have the bandwidth in the next few weeks but I think this is a good candidate for stability work. I'm going to assign this issue to myself and ping you for some time when I have a running experiment.

irfansharif · 2022-04-12T15:55:37Z

@sumeerbhola, @adityamaru: we have some residual work here to augment admission control to help clamp down p99s when detecting resource overload. I think both of you have a lot more more helpful context than me, can we file an issue with some details? I understand that we're planning to do something here in 22.2.

irfansharif · 2022-06-14T22:22:31Z

Connecting some dots, future work here is happening under the context of #75066.

adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior. labels Aug 16, 2021

blathers-crl bot added T-sql-observability T-sql-queries SQL Queries Team labels Aug 16, 2021

blathers-crl bot added the T-disaster-recovery label Aug 17, 2021

mwang1026 assigned pbardea Aug 23, 2021

blathers-crl bot added the T-kv KV Team label Aug 25, 2021

jordanlewis assigned andreimatei Aug 25, 2021

pbardea assigned adityamaru and unassigned pbardea Aug 30, 2021

nvanbenschoten assigned AlexTalks Oct 7, 2021

exalate-issue-sync bot removed T-sql-observability T-sql-queries SQL Queries Team labels Oct 11, 2021

blathers-crl bot added the T-sql-queries SQL Queries Team label Oct 11, 2021

exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Oct 11, 2021

mgartner added the A-admission-control label Oct 19, 2021

nvanbenschoten unassigned AlexTalks Nov 17, 2021

nvanbenschoten removed the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Nov 17, 2021

aayushshah15 closed this as completed Nov 22, 2021

adityamaru self-assigned this Dec 23, 2021

exalate-issue-sync bot unassigned adityamaru Feb 1, 2022

jlinder added the sync-me-3 label May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql, backupccl: P90,P99 SQL Service Latency spikes when a backup is running #69009

sql, backupccl: P90,P99 SQL Service Latency spikes when a backup is running #69009

adityamaru commented Aug 16, 2021 •

edited by AlexTalks

Loading

rytaft commented Aug 17, 2021

jordanlewis commented Aug 19, 2021

adityamaru commented Aug 24, 2021

jordanlewis commented Aug 24, 2021

adityamaru commented Aug 24, 2021 •

edited

Loading

jordanlewis commented Aug 24, 2021

adityamaru commented Aug 24, 2021

jordanlewis commented Aug 25, 2021

jordanlewis commented Aug 25, 2021

jordanlewis commented Sep 24, 2021

aayushshah15 commented Nov 16, 2021

nvanbenschoten commented Nov 17, 2021

nvanbenschoten commented Nov 17, 2021

sumeerbhola commented Nov 17, 2021

aayushshah15 commented Nov 18, 2021

AlexTalks commented Nov 22, 2021 •

edited

Loading

aayushshah15 commented Nov 22, 2021

sumeerbhola commented Nov 24, 2021

adityamaru commented Dec 23, 2021

irfansharif commented Apr 12, 2022

irfansharif commented Jun 14, 2022

sql, backupccl: P90,P99 SQL Service Latency spikes when a backup is running #69009

sql, backupccl: P90,P99 SQL Service Latency spikes when a backup is running #69009

Comments

adityamaru commented Aug 16, 2021 • edited by AlexTalks Loading

rytaft commented Aug 17, 2021

jordanlewis commented Aug 19, 2021

adityamaru commented Aug 24, 2021

jordanlewis commented Aug 24, 2021

adityamaru commented Aug 24, 2021 • edited Loading

jordanlewis commented Aug 24, 2021

adityamaru commented Aug 24, 2021

jordanlewis commented Aug 25, 2021

jordanlewis commented Aug 25, 2021

jordanlewis commented Sep 24, 2021

aayushshah15 commented Nov 16, 2021

8 x 4 vCPU nodes

8 x 16 vCPU nodes

nvanbenschoten commented Nov 17, 2021

nvanbenschoten commented Nov 17, 2021

sumeerbhola commented Nov 17, 2021

aayushshah15 commented Nov 18, 2021

AlexTalks commented Nov 22, 2021 • edited Loading

aayushshah15 commented Nov 22, 2021

sumeerbhola commented Nov 24, 2021

adityamaru commented Dec 23, 2021

irfansharif commented Apr 12, 2022

irfansharif commented Jun 14, 2022

adityamaru commented Aug 16, 2021 •

edited by AlexTalks

Loading

adityamaru commented Aug 24, 2021 •

edited

Loading

AlexTalks commented Nov 22, 2021 •

edited

Loading