-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql, backupccl: P90,P99 SQL Service Latency spikes when a backup is running #69009
Comments
@adityamaru I'm assigning BulkIO as well since it seems like this needs to be a joint investigation. In the mean time, moving this to the backlog for SQL Queries. |
Did someone fetch a statement diagnostic bundle during the spike? That would be the best way to start this investigation. |
I have a long-running roachprod cluster at |
Probably not... I guess just a bundle for any of the statements could do, since it seems like the latency spikes are uniform across all of SQL? |
stmt-bundle-687301840919265284.zip This is a statement bundle for an update that was collected at 11:17 EST, which corresponds to this spike (in UTC). |
The captured bundle only took 18ms, so I guess we didn't find a spikey one yet. Can you capture a few? Sorry, I don't think we have a good way to do this right now. You could try the trace-when-slow cluster setting, but that also makes everything else a bit slower too, so it might produce confounding effects. |
@jordanlewis and I poked around for a bit, just noting that the P90/99 SQL Latency lines up with KV latency spikes. |
After investigating with @adityamaru, we're removing SQL Queries from the investigation and adding KV, since the query latency lines up exactly with the KV request latency. Next, we need to figure out why KV request latency is slower while backups are running. |
@andreimatei, can you please take a look? |
@adityamaru, we're wondering whether we could conceivably use @sumeerbhola's work on admission control to improve this issue. @sumeerbhola, do you think that reducing priority for backup-related work would do anything useful for this case, given the fact that this latency spike issue is observable even when not in an overload scenario? |
Thanks for taking a look at this @aayushshah15. Your testing on 4 vCPU machines matches @AlexTalks and my observations when we reproduced this. We also saw what appeared to be CPU saturation during the backup. This was even after dropping the load down to give the nodes plenty of headroom. I grabbed the following CPU profile at the time, which shows more than 1 of the 4 vCPUs completely pinned evaluating an The takeaway from your testing and ours on 4 vCPU machines may be as boring as "Backup is CPU intensive, which can lead to latency spikes on machines without much headroom". Your testing on 16 vCPU machines adds more weight behind this because it shows that with sufficient headroom, the latency spike is not seen. That's a good thing, as it means that operators can provision appropriately to avoid foreground traffic disruption from backup. So it doesn't sound like anything is necessarily going wrong here in terms of logical contention between backups and foreground traffic. Unless there are ways that we can make this LSM scan cheaper, this mostly becomes a question of question resource prioritization. Should we be throttling or pacing the iteration in |
I removed the KV and SQL Queries projects and the |
Was this running on a recent master with admission control enabled and #71109 which subjects exports to admission control? There was debate on that PR on whether we should use LowPri instead of NormalPri. An experiment with that change may be interesting. I suspect the P90,P99 will still increase, though maybe not as much, since the default value of the |
Yes, this was run on |
My findings lined up with what @aayushshah15 posted as well, and it was clear that that lowering the rate of read/write operations had a lot to do with decreasing latency spikes during backups. TPCC without backups, TPCC with backups, I also did a quick test with the patch @sumeerbhola mentioned, using TPCC with backups, |
I'm closing this issue as we've conclusively attributed the latency spikes to resource saturation and do not have further leads to pursue. |
@adityamaru since the problem still exists, even though the issue is closed: I think admission control is a reasonable solution to this, but it requires a deeper understanding of what is happening during the experiment in terms of queueing latency, so that we know exactly what is behaving unexpectedly. If someone from bulk-io wants to investigate further and run an experiment, I am happy to synchronously look at the experiment to try to understand what is going on. |
@sumeerbhola sorry for the late response this is definitely something I am interested in investigating with you. I might not have the bandwidth in the next few weeks but I think this is a good candidate for stability work. I'm going to assign this issue to myself and ping you for some time when I have a running experiment. |
@sumeerbhola, @adityamaru: we have some residual work here to augment admission control to help clamp down p99s when detecting resource overload. I think both of you have a lot more more helpful context than me, can we file an issue with some details? I understand that we're planning to do something here in 22.2. |
Connecting some dots, future work here is happening under the context of #75066. |
Describe the problem
Recently, we have seen a few instances where a spike in the P90 and P99 SQL Service Latency coincides with running backups (v21.1 and master). There is a 3-4x impact on foreground latency when a full/incremental backup is running. This behavior is reproducible on an 8 node
roachprod
cluster running a TPCC workload and has also been reported in production by a customer https://github.com/cockroachlabs/support/issues/1150.To Reproduce
The service latency graphs should indicate a 3-4x spike every time an incremental backup is run.
Expected behavior
It is expected that backups will cause spikes in CPU, IOPS, and disk read throughput since it is reading data from the cluster and writing it to external storage. A 3-4x impact on foreground traffic is however unexpected/unintended. This issue should serve as a placeholder for any investigations into why a SQL query is slow when the backup is running? It could very well be that the backup is hogging resources, or doing something that is causing contention but to be able to remedy that, we need an indicator/root cause. Folks on the SQL Queries (or Observability) team are better equipped to debug slow SQL queries and so we will need their help to identify the cause of these latency spikes.
Epic CRDB-10556
The text was updated successfully, but these errors were encountered: