Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql, backupccl: P90,P99 SQL Service Latency spikes when a backup is running #69009

Closed
adityamaru opened this issue Aug 16, 2021 · 21 comments
Closed
Labels
A-admission-control C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team

Comments

@adityamaru
Copy link
Contributor

adityamaru commented Aug 16, 2021

Describe the problem

Recently, we have seen a few instances where a spike in the P90 and P99 SQL Service Latency coincides with running backups (v21.1 and master). There is a 3-4x impact on foreground latency when a full/incremental backup is running. This behavior is reproducible on an 8 node roachprod cluster running a TPCC workload and has also been reported in production by a customer https://github.com/cockroachlabs/support/issues/1150.

To Reproduce

  1. Import data.
roachprod create -n 8 $CLUSTER
roachprod stage $CLUSTER cockroach
roachprod start $CLUSTER
roachprod run $CLUSTER:1 "./cockroach workload fixtures import tpcc --warehouses=5000"
  1. Start a TPCC workload.
roachprod ssh $CLUSTER:1
./cockroach workload run tpcc --warehouses=5000 --conns=200 --max-rate=250 <output of roachprod pgurl $CLUSTER> (in screen or tmux)
  1. Start a backup schedule with daily fulls and 15-minute incrementals.
CREATE SCHEDULE foo
FOR BACKUP INTO <s3/gs storage location>
WITH revision_history
RECURRING '*/15 * * * *' FULL BACKUP '@daily'
WITH SCHEDULE OPTIONS first_run = 'now';

The service latency graphs should indicate a 3-4x spike every time an incremental backup is run.
Screen Shot 2021-08-13 at 11 50 16 AM

Expected behavior

It is expected that backups will cause spikes in CPU, IOPS, and disk read throughput since it is reading data from the cluster and writing it to external storage. A 3-4x impact on foreground traffic is however unexpected/unintended. This issue should serve as a placeholder for any investigations into why a SQL query is slow when the backup is running? It could very well be that the backup is hogging resources, or doing something that is causing contention but to be able to remedy that, we need an indicator/root cause. Folks on the SQL Queries (or Observability) team are better equipped to debug slow SQL queries and so we will need their help to identify the cause of these latency spikes.

Epic CRDB-10556

@adityamaru adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior. labels Aug 16, 2021
@rytaft
Copy link
Collaborator

rytaft commented Aug 17, 2021

@adityamaru I'm assigning BulkIO as well since it seems like this needs to be a joint investigation. In the mean time, moving this to the backlog for SQL Queries.

@jordanlewis
Copy link
Member

Did someone fetch a statement diagnostic bundle during the spike? That would be the best way to start this investigation.

@adityamaru
Copy link
Contributor Author

I have a long-running roachprod cluster at adityamaru-backup-returned running a TPCC workload, that is exhibiting this behavior. It takes daily fulls and hourly incrementals. I'm finding it a little hard to identify a particular statement that is exhibiting the increased latency. Do we have any indicator on the statements page that would signal a query is taking longer than expected so that I can grab a bundle the next time around?

@jordanlewis
Copy link
Member

Probably not... I guess just a bundle for any of the statements could do, since it seems like the latency spikes are uniform across all of SQL?

@adityamaru
Copy link
Contributor Author

adityamaru commented Aug 24, 2021

stmt-bundle-687301840919265284.zip

This is a statement bundle for an update that was collected at 11:17 EST, which corresponds to this spike (in UTC).

Screen Shot 2021-08-24 at 11 22 31 AM

@jordanlewis
Copy link
Member

The captured bundle only took 18ms, so I guess we didn't find a spikey one yet. Can you capture a few? Sorry, I don't think we have a good way to do this right now. You could try the trace-when-slow cluster setting, but that also makes everything else a bit slower too, so it might produce confounding effects.

@adityamaru
Copy link
Contributor Author

Screen Shot 2021-08-24 at 11 59 24 AM

@jordanlewis and I poked around for a bit, just noting that the P90/99 SQL Latency lines up with KV latency spikes.

@blathers-crl blathers-crl bot added the T-kv KV Team label Aug 25, 2021
@jordanlewis
Copy link
Member

After investigating with @adityamaru, we're removing SQL Queries from the investigation and adding KV, since the query latency lines up exactly with the KV request latency.

Next, we need to figure out why KV request latency is slower while backups are running.

@jordanlewis
Copy link
Member

@andreimatei, can you please take a look?

@pbardea pbardea assigned adityamaru and unassigned pbardea Aug 30, 2021
@jordanlewis
Copy link
Member

@adityamaru, we're wondering whether we could conceivably use @sumeerbhola's work on admission control to improve this issue.

@sumeerbhola, do you think that reducing priority for backup-related work would do anything useful for this case, given the fact that this latency spike issue is observable even when not in an overload scenario?

@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Oct 11, 2021
@exalate-issue-sync exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Oct 11, 2021
@aayushshah15
Copy link
Contributor

Took a look at this and my main takeaway is that I think our observations here are merely a symptom of hardware saturation when the incremental backups are run. The roachprod cluster created by the repro steps given above is a cluster of 4 vCPU nodes, so we don't have a ton of headroom. Every incremental backup run causes a pretty significant corresponding spike in CPU utilization levels.

8 x 4 vCPU nodes

Screenshot from 2021-11-16 16-25-49
Screenshot from 2021-11-16 16-26-07

8 x 16 vCPU nodes

As we would expect, running this workload on a cluster with beefier nodes (16 vCPUs each vs 4) eliminates these tail latency spikes during these incremental backup runs because of the additional CPU headroom.
Screenshot from 2021-11-16 18-31-04
Screenshot from 2021-11-16 18-31-16

@nvanbenschoten
Copy link
Member

Thanks for taking a look at this @aayushshah15. Your testing on 4 vCPU machines matches @AlexTalks and my observations when we reproduced this. We also saw what appeared to be CPU saturation during the backup. This was even after dropping the load down to give the nodes plenty of headroom.

Screen Shot 2021-10-29 at 1 22 12 AM

I grabbed the following CPU profile at the time, which shows more than 1 of the 4 vCPUs completely pinned evaluating an Export request in ExportMVCCToSst:

Screen Shot 2021-10-29 at 1 17 14 AM

The takeaway from your testing and ours on 4 vCPU machines may be as boring as "Backup is CPU intensive, which can lead to latency spikes on machines without much headroom". Your testing on 16 vCPU machines adds more weight behind this because it shows that with sufficient headroom, the latency spike is not seen. That's a good thing, as it means that operators can provision appropriately to avoid foreground traffic disruption from backup.

So it doesn't sound like anything is necessarily going wrong here in terms of logical contention between backups and foreground traffic. Unless there are ways that we can make this LSM scan cheaper, this mostly becomes a question of question resource prioritization. Should we be throttling or pacing the iteration in ExportMVCCToSst? This is fairly analogous to stats collection, which does its own pacing. But backup is more latency-sensitive, so the answer isn't as clear-cut. I think we've also fallen out of favor with custom pacing mechanisms with all the recent improvements to admission control.

@nvanbenschoten nvanbenschoten removed the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Nov 17, 2021
@nvanbenschoten
Copy link
Member

I removed the KV and SQL Queries projects and the C-bug label from this investigation, as this no longer looks like a bug.

@sumeerbhola
Copy link
Collaborator

Was this running on a recent master with admission control enabled and #71109 which subjects exports to admission control? There was debate on that PR on whether we should use LowPri instead of NormalPri. An experiment with that change may be interesting. I suspect the P90,P99 will still increase, though maybe not as much, since the default value of the admission.kv_slot_adjuster.overload_threshold cluster setting is 32, since we wanted to be able to saturate CPU even with admission control on. Experimenting with reducing that threshold may also be worthwhile.

@aayushshah15
Copy link
Contributor

Was this running on a recent master with admission control enabled and #71109 which subjects exports to admission control?

Yes, this was run on master (as of a few days ago) so it should've included that patch. I'll try and re-run it on the smaller cluster with the changes you mentioned.

@AlexTalks
Copy link
Contributor

AlexTalks commented Nov 22, 2021

My findings lined up with what @aayushshah15 posted as well, and it was clear that that lowering the rate of read/write operations had a lot to do with decreasing latency spikes during backups.

TPCC without backups, max-rate=100, 4vCPU machines

image

TPCC with backups, max-rate=100, 4vCPU machines

image

I also did a quick test with the patch @sumeerbhola mentioned, using LowPri for backups with admission control enabled, but I did not see any significant change in the latency spikes. I should note that we do still see CPU usage spike to around 75% during backups on these 4 vCPU nodes.

TPCC with backups, LowPri exports under Admission Control, max-rate=100, 4vCPU machines

image

@aayushshah15
Copy link
Contributor

I'm closing this issue as we've conclusively attributed the latency spikes to resource saturation and do not have further leads to pursue.

@sumeerbhola
Copy link
Collaborator

@adityamaru since the problem still exists, even though the issue is closed: I think admission control is a reasonable solution to this, but it requires a deeper understanding of what is happening during the experiment in terms of queueing latency, so that we know exactly what is behaving unexpectedly. If someone from bulk-io wants to investigate further and run an experiment, I am happy to synchronously look at the experiment to try to understand what is going on.

@adityamaru
Copy link
Contributor Author

@sumeerbhola sorry for the late response this is definitely something I am interested in investigating with you. I might not have the bandwidth in the next few weeks but I think this is a good candidate for stability work. I'm going to assign this issue to myself and ping you for some time when I have a running experiment.

@irfansharif
Copy link
Contributor

@sumeerbhola, @adityamaru: we have some residual work here to augment admission control to help clamp down p99s when detecting resource overload. I think both of you have a lot more more helpful context than me, can we file an issue with some details? I understand that we're planning to do something here in 22.2.

@irfansharif
Copy link
Contributor

Connecting some dots, future work here is happening under the context of #75066.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team
Projects
None yet
Development

No branches or pull requests