Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul 1.4.x high CPU #5295

Closed
mnuic opened this issue Jan 30, 2019 · 9 comments
Closed

Consul 1.4.x high CPU #5295

mnuic opened this issue Jan 30, 2019 · 9 comments
Labels
needs-investigation The issue described is detailed and complex. theme/performance Performance benchmarking or potential improvement

Comments

@mnuic
Copy link

mnuic commented Jan 30, 2019

We have dev consul cluster that we use and test new versions of consul before production. Today I upgraded consul from v1.2.3 to v1.4.2 and experience high CPU. Also downgraded to v1.4.0 and the behavior is the same.

We didn't experience this on v1.2.3. First thought was that this is happening because new ACL migration, but that was OK. For the sanity check I rebuild cluster from the scratch with all services and consul v1.4.2 but same is happening.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Upgrade consul to v1.4.2 (or v1.4.0) from v1.2.3

Consul info for both Client and Server

# consul info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 31
	services = 42
build:
	prerelease =
	revision = c97c712e
	version = 1.4.2
consul:
	acl = disabled
	bootstrap = true
	known_datacenters = 4
	leader = true
	leader_addr = 10.0.66.182:8300
	server = true
raft:
	applied_index = 561756
	commit_index = 561756
	fsm_pending = 0
	last_contact = 0
	last_log_index = 561756
	last_log_term = 7
	last_snapshot_index = 560326
	last_snapshot_term = 7
	latest_configuration = [{Suffrage:Voter ID:327d6f78-a871-47bc-99af-f32959834b5d Address:10.0.66.182:8300}]
	latest_configuration_index = 1
	num_peers = 0
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 7
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 354
	max_procs = 8
	os = linux
	version = go1.11.4
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 1
	event_time = 7
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1870
	members = 10
	query_queue = 0
	query_time = 1

Operating system and Environment details

Ubuntu 16.04LTS, offical consul docker image

Log

Only logs that I see, but don't think they are relevant to this:

[WARN] consul: error getting server health from "XXX": context deadline exceeded
[WARN] consul: error getting server health from "XXX": last request still outstanding

pprof:

pprof001.svg.zip

consul-debug:

consul-debug-1548835619.tar.gz.zip.zip.001.TXT
consul-debug-1548835619.tar.gz.zip.zip.002.TXT

@banks
Copy link
Member

banks commented Jan 30, 2019

Thanks for the report @mnuic we'll take a look through that data.

Can you give us a rough idea of the workload your servers were handling at the time - number of client agents and rough query load like DNS qps/ API qps etc? Even a best guess is helpful to make sure we can reproduce. The debug data has some metrics in but the more context the better as it can't capture the external workload in detail currently.

@banks banks added needs-investigation The issue described is detailed and complex. theme/performance Performance benchmarking or potential improvement labels Jan 30, 2019
@banks
Copy link
Member

banks commented Jan 30, 2019

Also is consul-debug-1548835619.tar.gz.zip.zip.001.TXT really the name of the file consul debug created or was some extra processing done while transferring them? That seems like a bug if so - I can't open those as zip or tar.gz files sadly.

@banks
Copy link
Member

banks commented Jan 30, 2019

Never mind we figured out they were split zip files in the end - thanks.

For others:

cat consul-debug-1548835619.tar.gz.zip.zip.001.TXT consul-debug-1548835619.tar.gz.zip.zip.002.TXT > consul-debug-1548835619.tar.gz.zip.zip

Extracts fine on macOS (all 4 layers of zip/tar!)

@mnuic
Copy link
Author

mnuic commented Jan 30, 2019

Hi @banks,

It's a small cluster, one server and one client, but it has around 40 services. It's a small version of our production, but it's good for testing.

Rough idea and with the help of TRACE it is about 1000 DNS request in 1 sec on v1.4.2. On v1.2.3 it's about ~100 (checked on production and dev environment), so that is strange. Don't now if something changed in last version.

TXT extension is added, so I could upload it. Remove TXT, unzip and untar. Yes, have to manage to deliver you the logs :)

@banks
Copy link
Member

banks commented Jan 30, 2019

From the logs I can see that your server is answering the same DNS request for MySQL ~1000 times a second:

cat consul.log | rg '08:07:30 \[DEBUG\] dns:' | wc -l
    1001

Is that expected? Did you see the same on older versions? While it's not crazy, all the profiles point to DNS being where all the CPU is spent.


Aha just saw your update. OK so the issue is that for some reason the same clients are making a lot more DNS requests now.

I see from the config you have DNS caching enabled, however the default TTLs seem to be empty "DNSServiceTTL": {},.

Can you check with dig whether the DNS server is now responding with 0 TTL where it was higher before? It could be that your requests were cached before.

But with so few clients 1000/s seems reasonably high still so it could be that this is a symptom of something else - can you manually make DNS queries and check what's being returned. For example, maybe something changed in health checks and so now health is failing and is not returning anything and your clients are just spinning trying to resolve DNS? Or maybe we introduced a regression into DNS where the results are now corrupt and so causing clients to retry a lot?

Still mostly guessing as you can tell but if you have an easy way to repro that it could be very helpful!

FYI we run "smoke" tests before every release where we have a few simulated workloads that run for a few hours and we check the graphs of CPU/memory etc against previous versions and didn't spot anything so it seems at least somewhat workload dependent!

@banks
Copy link
Member

banks commented Jan 30, 2019

Ah more interesting:

{
      "Name": "consul.health.service.not-found",
      "Count": 12790,
      "Sum": 12790,
      "Min": 1,
      "Max": 1,
      "Mean": 1,
      "Stddev": 0,
      "Labels": {
        "service": "mssql"
      }
    },

Looks like that DNS is hitting "service not found" 1000 times per second. We had a change in 1.4.1 #4810 related to that but I can't think immediately how it would affect DNS and you said you see this on 1.4.0.

There were also changes in 1.4.0 to allow DNS prefix queries: #4605 which should have been backwards compatible but could maybe have introduced a regression.

If you are able to try 1.4.0 again and show output of dig for A/AAAA mssql.service.sd that would be super helpful.

Also, did you notice that the services were actually registering and up? Logs include things like:

agent: Service "app1:statsd:8550:udp" in sync

But not mssql however we only have the logs for the one agent you ran the debug command on here so it doesn't say whether that mssql service is really there or not - could you check the logs on the host(s) that should have run it and see if they showed errors registering?

Even if the root cause is that the service wasn't there, it still seems to be a bug that DNS turns into a hot loop in that case but we can maybe reproduce in that case.

@banks
Copy link
Member

banks commented Jan 30, 2019

I also checked if the DNS requests were all coming from a few clients:

cat 1548835652/consul.log | rg -o 'from client 172\.\d+\.\d+\.\d+' | sort | uniq -c | sort -k 1 -h
   6 from client 172.17.0.2
34124 from client 172.17.0.27

So it looks like just one node in the cluster is sending all that DNS traffic - any way that helps you narrow it down by that IP address @mnuic ?

One more thing to try if at all possible: it's not clear if the regression is in the DNS server or if it's a secondary effect caused by something else failing to register mssql or keep it healthy.

For example, let's assume the regression was in how registration works and now your mssql service is failing to register. Maybe this particular DNS client would always have caused this high load even on old version - you just happen never to have had all the mssql instances unavailable before?

You could verify that even without switching back to 1.4.x byt temporarily marking all the mssql instances as unhealthy in Consul or deregistering them - do you see the same high DNS load when you do, if yes then we know to focus on the registration/health issue of mssql and not DNS server.

@mnuic
Copy link
Author

mnuic commented Jan 30, 2019

Hi,

OK, you were right mssql was missing from the consul. It's manually registered via API.

I have rebuild again the hole cluster from scratch and tried upgrading from v1.2.3 to v1.4.2 and it passed OK with no problems. CPU is normal.

@banks thanks for the help, I obviously missed something before. Sorry for the trouble.

Just a proposal, if consul hits 1000 times per second "service not found", maybe it would be a good idea that is not checking 1000 times more the next second, just a few that it see if the service is found. Or something similar. That would reduce the high CPU.

You can close the issue.

@mkeeler
Copy link
Member

mkeeler commented Jan 30, 2019

@mnuic This seems to have been caused by repeatedly requesting a service that didn't exist. Even in this scenario we could do better. If you were to set the dns_config.soa.min_ttl configuration to 1, it should cause DNS clients to wait at least 1 second before making the DNS query again (you could go higher but in your case this would reduce requests to ~1/1000 of what they were originally. By default the dns_config.soa.min_ttl value is 0 which allows clients to immediately make the same query. Also the DNS client may choose not to respect the min ttl value in which case you will have the same results. Most clients should respect it though.

The documentation for the configuration is here: https://www.consul.io/docs/agent/options.html#soa_min_ttl

However I just noticed some formatting issues with that so it may be a little hard to read right now.

@mkeeler mkeeler closed this as completed Jan 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation The issue described is detailed and complex. theme/performance Performance benchmarking or potential improvement
Projects
None yet
Development

No branches or pull requests

3 participants