-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul 1.4.x high CPU #5295
Comments
Thanks for the report @mnuic we'll take a look through that data. Can you give us a rough idea of the workload your servers were handling at the time - number of client agents and rough query load like DNS qps/ API qps etc? Even a best guess is helpful to make sure we can reproduce. The debug data has some metrics in but the more context the better as it can't capture the external workload in detail currently. |
Also is |
Never mind we figured out they were split zip files in the end - thanks. For others: cat consul-debug-1548835619.tar.gz.zip.zip.001.TXT consul-debug-1548835619.tar.gz.zip.zip.002.TXT > consul-debug-1548835619.tar.gz.zip.zip Extracts fine on macOS (all 4 layers of zip/tar!) |
Hi @banks, It's a small cluster, one server and one client, but it has around 40 services. It's a small version of our production, but it's good for testing. Rough idea and with the help of TRACE it is about 1000 DNS request in 1 sec on v1.4.2. On v1.2.3 it's about ~100 (checked on production and dev environment), so that is strange. Don't now if something changed in last version. TXT extension is added, so I could upload it. Remove TXT, unzip and untar. Yes, have to manage to deliver you the logs :) |
From the logs I can see that your server is answering the same DNS request for MySQL ~1000 times a second:
Is that expected? Did you see the same on older versions? While it's not crazy, all the profiles point to DNS being where all the CPU is spent. Aha just saw your update. OK so the issue is that for some reason the same clients are making a lot more DNS requests now. I see from the config you have DNS caching enabled, however the default TTLs seem to be empty Can you check with But with so few clients 1000/s seems reasonably high still so it could be that this is a symptom of something else - can you manually make DNS queries and check what's being returned. For example, maybe something changed in health checks and so now health is failing and is not returning anything and your clients are just spinning trying to resolve DNS? Or maybe we introduced a regression into DNS where the results are now corrupt and so causing clients to retry a lot? Still mostly guessing as you can tell but if you have an easy way to repro that it could be very helpful! FYI we run "smoke" tests before every release where we have a few simulated workloads that run for a few hours and we check the graphs of CPU/memory etc against previous versions and didn't spot anything so it seems at least somewhat workload dependent! |
Ah more interesting:
Looks like that DNS is hitting "service not found" 1000 times per second. We had a change in 1.4.1 #4810 related to that but I can't think immediately how it would affect DNS and you said you see this on 1.4.0. There were also changes in 1.4.0 to allow DNS prefix queries: #4605 which should have been backwards compatible but could maybe have introduced a regression. If you are able to try 1.4.0 again and show output of dig for Also, did you notice that the services were actually registering and up? Logs include things like:
But not Even if the root cause is that the service wasn't there, it still seems to be a bug that DNS turns into a hot loop in that case but we can maybe reproduce in that case. |
I also checked if the DNS requests were all coming from a few clients: cat 1548835652/consul.log | rg -o 'from client 172\.\d+\.\d+\.\d+' | sort | uniq -c | sort -k 1 -h
6 from client 172.17.0.2
34124 from client 172.17.0.27 So it looks like just one node in the cluster is sending all that DNS traffic - any way that helps you narrow it down by that IP address @mnuic ? One more thing to try if at all possible: it's not clear if the regression is in the DNS server or if it's a secondary effect caused by something else failing to register For example, let's assume the regression was in how registration works and now your You could verify that even without switching back to 1.4.x byt temporarily marking all the mssql instances as unhealthy in Consul or deregistering them - do you see the same high DNS load when you do, if yes then we know to focus on the registration/health issue of |
Hi, OK, you were right mssql was missing from the consul. It's manually registered via API. I have rebuild again the hole cluster from scratch and tried upgrading from v1.2.3 to v1.4.2 and it passed OK with no problems. CPU is normal. @banks thanks for the help, I obviously missed something before. Sorry for the trouble. Just a proposal, if consul hits 1000 times per second "service not found", maybe it would be a good idea that is not checking 1000 times more the next second, just a few that it see if the service is found. Or something similar. That would reduce the high CPU. You can close the issue. |
@mnuic This seems to have been caused by repeatedly requesting a service that didn't exist. Even in this scenario we could do better. If you were to set the The documentation for the configuration is here: https://www.consul.io/docs/agent/options.html#soa_min_ttl However I just noticed some formatting issues with that so it may be a little hard to read right now. |
We have dev consul cluster that we use and test new versions of consul before production. Today I upgraded consul from v1.2.3 to v1.4.2 and experience high CPU. Also downgraded to v1.4.0 and the behavior is the same.
We didn't experience this on v1.2.3. First thought was that this is happening because new ACL migration, but that was OK. For the sanity check I rebuild cluster from the scratch with all services and consul v1.4.2 but same is happening.
Reproduction Steps
Steps to reproduce this issue, eg:
Consul info for both Client and Server
Operating system and Environment details
Ubuntu 16.04LTS, offical consul docker image
Log
Only logs that I see, but don't think they are relevant to this:
pprof:
pprof001.svg.zip
consul-debug:
consul-debug-1548835619.tar.gz.zip.zip.001.TXT
consul-debug-1548835619.tar.gz.zip.zip.002.TXT
The text was updated successfully, but these errors were encountered: