-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encountering Transaction Deadline Exceeded Errors while benchmarking TPC-C #26529
Comments
I executed 2 test runs.. before each run I restarted the cluster to have short logs |
UI from 1st run |
UI from 2nd run |
the cluster itself is stable and keeps running... problem just manifests to termination of workload program with deadline exceeded error. Let me know if you want a run with --ignore option and fresh logs .. |
Hi @HeikoOnnebrink, Are you using our Thanks, |
Hi @jordanlewis I am using your latest workload generator. HAproxy.cfg is generated from Cockroach. Just had to increase client and server timeouts as 1 min. caused HAproxy to close sockets before cockroach completed requests. The exact hardware vendor I do not know, I assume Huawei from China with Intel CPUs. The flavour of each machine is 16 cores, 60 GB RAM, 640 GB local SSD storage accessed via paravirtualised driver. OS is latest Core OS stable and Cockroach runs inside Docker Container. |
Hey @HeikoOnnebrink - thanks for your patience here. I ran a couple tests to see if I could reproduce your issues:
In each case, I ran the workload for 90 minutes. The results were: Case 1: ```_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result Case 2:
|
One final thing I should mention: I was running HAproxy and workload on a separate machine, not on the cluster itself. Here's the haproxy.cfg:
|
I've also seen transaction deadline exceeded errors from our workload tpcc implementation. They happen when the cockroach cluster isn't even close to being able to keep up with the load, which you can also tell from the very high query latencies. Given that you're only running with 1000 warehouses on 3 nodes, I'm with @tim-o on being suspicious about the hardware or system configuration. I don't have any particular suggestions, though. |
HAproxy and workload run on separate dedicated machines. HAproxy config is same like yours (both seem to be generated by Cockroach). The machines are from production OpenStack next to tons of other machines where we do not face complaints .. maybe our workload is a bit more close to the edge .. Right now I am building up a fresh 30 node cluster with same 16 core node type and local SSD on our latest new OpenStack cloud. Its not yet free for users access, so I am alone and can sort out impact by (overprovisioned) concurrent machines. I will update once I have some stats .. first need to populate 2,2 TB of data for the 10.000 warehouse test |
Here the next feedback from the 30 node cluster test. Again I used 16 core nodes with same OS/spec like above. I kicked off a workload --init last Friday to load the 10.000 warehouse data
What I observed is that logs filled up with hundred-thousands of these errors (several errors per second)
The logs of each node have grown to 500 MB just full with above errors. Here a screen shot of metrics during workload run The first hours of metrics can be ignored .. they are from a first run where I had some issue as NTP settings were not properly applied to all nodes. |
Next update Performance looked good for the first 20 minutes, than P99 latency started to grow and SQL throughput dropped by 50%. After 1h 6 min test failed with deadline exceeded. P50 latency stayed most of the time between 10-20ms. Server logs will get uploaded tomorrow .. Still no clue why performance stays only stable for 20 minutes .. |
Hmm, that doesn't look good. What mount options do you have on your disks, if any? Have you tried using the Can you attach the log and command commit latency graphs in the Storage dashboard? |
I repeated on same cockroach cluster the tests, this time with 800 warehouses (instead of 1000) This time things look much better (even we have higher latency than we should) .. but response was stable. I see 2 topics now On my end I need to investigate the I/O stack to check and improve performance. |
Thanks Heiko! I've made product aware of the performance degradation, we're getting an issue together for tracking. I'll let you know once that's been created. One thought: it might help us resolve faster if we could get access to your test cluster to pull stats and observe the system while it's under load. Would that be possible? |
Tobias and Nathan have had access in the past to our previous test cluster via a WebEx session. For sure you can have access to the new test systems as well. Only problem is the fact that I am leaving tomorrow for holiday. After holiday I am 2 days at VMware HQ in London and than back to office up from 28th of June. I will reserve all the time up from 28th to continue testing and give you access to the systems. |
As an first info about IO that I got is that the SSD storage used by the VMs is configured via RAID controller as RAID-6. The OpenStack team is informed to check the IO stack from their end. Also here I will co-work with them once I am back.. |
from your performance white paper I got that you run on google n1-16 nodes with local SSD (SSD scratch disks) From what I read at google they seem to be kind of plain non mirrored SSD modules which would explain much better write performance / lower latency.. and from what I learned is that latency is poison for Rocks DB.. |
Thanks @HeikoOnnebrink - have a great holiday. We'll keep an eye out for similar reports, and can catch up on the 28th. It sounds like you have a good lead with OpenStack's disk performance. Hopefully their support can make some recommendations about how to improve IOPS and latency; if there's a configuration that works, we'll happily update docs so the next user has an easier go of it. If there's anything else we can do in the meantime, let me know. |
Hey @HeikoOnnebrink - if you have some cycles before our meeting tomorrow, it'd be helpful to get sysstat up and logging values on the nodes we'll be testing. You can set up a crontab to log values once per minute (just substitute yum for apt-get if running CentOS):
Once that's done, Here are the results of Disk IO, memory, and CPU utilization on a cluster was running a 1000 warehouse test, on three GCE |
@tim-o Can you also pull the network data with |
Sure - here's the (raw) output of |
I am running inside Docker on CoreOS .. so things are a bit different / restricted.. I spent some time and managed to get the stats you need , inside Docker toolbox container (Fedora based) See https://coreos.com/os/docs/latest/install-debugging-tools.html
But a Hope I have prepared all you will need .. see you later at 7pm CET |
Thanks Heiko. If you get a chance, it'd be good to review the sar output before the meeting - otherwise we'll talk at 7. |
Tim's 3 nodes of google n1-16 nodes show very good Network and IO performance. @HeikoOnnebrink we shoud check the similiar and compare the latency and queue sizes.
|
this are all extreme valuable infos.. What I need to know is how to benchmark my (OpenStack) VMs against your google tests. Than I can adjust my expectations which tpmc values are realistic as targets for my tests. And in the end I need to understand which options I can throw in to optimise the whole setup as done for SSD RAID config. And in the end I try to learn which resources mostly impact Cockroach performance and where it is tolerant like a cockroach.. ;-) |
@nvanbenschoten is this closable for one of the other issues referenced here? |
Reported by @HeikoOnnebrink. Related to #18684, though it might not be identical since that error is expected to be infrequent, and this can be reproduced.
Heiko is attempting to reproduce our TPC-C benchmark using the 1000 warehouse test on a 3 node cluster. Each cluster has 16 cores and is behind HAproxy.
After roughly 1 hour, the cluster becomes unstable and queries begin failing with transaction deadline exceeded errors.
Heikko, could you provide us with logs from your most recent test? Were you able to notice anything unusual in the admin UI prior to the start of the errors (i.e., steadily increasing SQL execution latency?) Or is the onset sudden?
The text was updated successfully, but these errors were encountered: