Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too much dns request on cf-deployment , via loggregator 106.x #401

Closed
jyriok opened this issue Dec 27, 2019 · 19 comments
Closed

too much dns request on cf-deployment , via loggregator 106.x #401

jyriok opened this issue Dec 27, 2019 · 19 comments

Comments

@jyriok
Copy link

jyriok commented Dec 27, 2019

Hello,

i use loggregator via cf-deployment , and recently, we have updated our cluster from cf-deployment 12.1.0 to 12.20.0 , so with loggregator 106.3.1. and we start to have dns issue, the log-api vm have huge load and send to much dns request (approx. 15k req/min).
Log :
[RequestLoggerHandler] 2019/12/27 14:30:15 INFO - handlers.DiscoveryHandler Request [33] [_grpclb._tcp.q-s0.doppler.default.cf.bosh.] 2 28000ns

i've try to upgrade loggregator to 106.3.2 (last one) but same issue.
i've rollbacked to a older loggregator release (105.6 but keep cf-deployment to 12.20.0) and issue disappear.

so, it's seem to have a bug on the 106.x loggregator series.

More info on the cloudfloundry slack :
https://cloudfoundry.slack.com/archives/C2U7KA7M4/p1576657695004400

thanks for your help :)

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/170456276

The labels on this github issue will be updated when the story is started.

@chentom88
Copy link
Contributor

chentom88 commented Dec 27, 2019

@jyriok This issue is related to a fix that we had made to actually get the bosh dns health check working for our doppler instances. Can you confirm that this high cpu still happens after you have updated to Bosh-DNS version 1.16? I am sure you are aware but in case you are not, Bosh-DNS is included in a runtime config in most environments so changing your version of CF-Deployment won't change it.

@jyriok
Copy link
Author

jyriok commented Dec 29, 2019

@chentom88 hello ! thank for your answer. yes, i'm aware about issue fixed by bosh-dns 1.16 about using tls 1.2 instead 1.3 but i already use bosh-dns 1.16 :( so i thinks it's another issue.

here my bosh deployment with releases i use

Name Release(s) Stemcell(s) Team(s) cf backup-and-restore-sdk/1.17.2 bosh-cloudstack-xen-ubuntu-xenial-go_agent/621.29 - bosh-dns/1.16.0 bosh-dns-aliases/0.0.3 bpm/1.1.6 capi/1.89.0 cf-cli/1.23.0 cf-networking/2.27.0 cf-security-entitlement/1.0.5 cf-smoke-tests/40.0.123-ora cf-syslog-drain/10.2.7 cflinuxfs3/0.151.0 credhub/2.5.9 diego/2.41.0 garden-runc/1.19.9 haproxy/9.8.0 log-cache/2.6.6 loggregator/106.3.2 loggregator-agent/5.3.1 minio/2018-11-17T01-23-48Z minio-tools/1.0.3 nats/32 network-config/1.0.1 node-exporter/4.2.0 os-conf/21.0.0-ora prometheus/26.1.0 pxc/0.21.0 routing/0.196.0 s3-volume/1.0.0 silk/2.27.0 statsd-injector/1.11.10 syslog/11.6.1 uaa/74.12.0

@tlwr
Copy link

tlwr commented Dec 30, 2019

We're also seeing high load averages on log-api instances, since upgrading to latest cf-deployment, due to BOSH DNS

  • Stemcell: 621.29
  • BOSH DNS release: 1.16.0
  • Loggregator release: 106.3.1

image
Image of chart showing BOSH DNS CPU usage averaged by instance type

The jump to 50% CPU usage for log-api instances seems to correspond to moving from Loggregator 105.6 to 106.2.1 or upgrading past stemcell 621.23

The jump for all other instances from ~0% CPU to ~2-5%CPU seems to correspond to upgrading BOSH-DNS from 1.8 to 1.16

My suspicion, is that changes that landed in Go 1.13 affecting DNS, or recent bumps to the GRPC libraries. I tcpdumped DNS requests counting requests over a period of 1 minute, and received the following:

   8202 SRV?	_grpclb._tcp.q-s0.doppler.cf.tlwr.bosh.
   8201 TXT?	q-s0.doppler.cf.tlwr.bosh.

which seems like undesirable behaviour; and the number of DNS requests seems to scale with the number of Doppler instances deployed

@chentom88
Copy link
Contributor

chentom88 commented Dec 30, 2019

@jyriok and @tlwr, I believe the spike in CPU on both the doppler and log-api instances are a result of this commit - 627e686. It seems having a working bosh-dns health check is causing high CPU due to bosh-dns spinning some sort of check on both VMs. @tlwr, your assertion that the problem scales with the number of doppler instances makes sense in that the log-api will attempt to make a connection to each doppler instances available, there by increasing the number of queries to bosh-dns. We will investigate, I have created this bug in our backlog - https://www.pivotaltracker.com/story/show/170474149.

@TheMoves
Copy link

TheMoves commented Jan 6, 2020

Also been seeing this issue, currently running CF Deployment 12.15.

We had to create a IPTABLES rules on the 'log-api' servers to BLOCK outbound port 53 traffic to our InfoBlox controllers as the SRC/TXT requests were burying them ... After enabling the IPTABLES rules, the CPU Load is still present.

Commenting out the external DNS resolvers and leaving BOSH-DNS IP in /etc/resolv.conf did not fix the CPU load.

@tlwr
Copy link

tlwr commented Jan 17, 2020

This looks like it will be fixed in the next loggregator-release

@TheMoves
Copy link

Need bosh-dns 1.17.0 as well i believe:
https://github.com/cloudfoundry/bosh-dns-release/releases/tag/v1.17.0

@andrejev
Copy link

Hey @tlwr we also see this happen in our environment.
Is the fixed loggregator-release you refer to Loggregator 106.3.6 ?
I cannot seem to find an information about this issue in the changelog of the release.

@tlwr
Copy link

tlwr commented Jan 22, 2020

I haven't tested any 106.3.6 but I'm not sure if that release will fix it, I'll deploy it to my development environment and see.

As I understand it, we want any release with this commit: 6d66cf7


EDIT

I have deployed 106.3.6 to my dev environment and see no noticeable change in CPU utilization in log-api instances.

However deploying the bosh-dns v1.17.0 change last week did have an affect on CPU utilization:

image

From this, I can conclude that the latest bosh-dns release will make log-api use less CPU, but not the same amount of CPU as before the change which started this

tlwr pushed a commit to alphagov/paas-cf that referenced this issue Jan 22, 2020
This release:

* https://github.com/cloudfoundry/bosh-dns-release/releases/tag/v1.17.0

Contains the following fixes:

* stop redundant health check retries between poll intervals
* stop redundant health check requests when querying a domain
* Fix compilation error on windows due to broken symlinks
* health server should shutdown cleanly on sigterm

The following fixes:

* stop redundant health check retries between poll intervals
* stop redundant health check requests when querying a domain

cause loggregator component trafficcontroller's very chatty dns
healthcheck to be cached by bosh-dns, which will reduce the currently
high cpu usage (there is another fix coming in a later loggregator
release)

See cloudfoundry/loggregator-release#401

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
tlwr pushed a commit to alphagov/paas-cf that referenced this issue Jan 24, 2020
This release:

* https://github.com/cloudfoundry/bosh-dns-release/releases/tag/v1.17.0

Contains the following fixes:

* stop redundant health check retries between poll intervals
* stop redundant health check requests when querying a domain
* Fix compilation error on windows due to broken symlinks
* health server should shutdown cleanly on sigterm

The following fixes:

* stop redundant health check retries between poll intervals
* stop redundant health check requests when querying a domain

cause loggregator component trafficcontroller's very chatty dns
healthcheck to be cached by bosh-dns, which will reduce the currently
high cpu usage (there is another fix coming in a later loggregator
release)

See cloudfoundry/loggregator-release#401

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
@tlwr
Copy link

tlwr commented Jan 24, 2020

The CPU utilisation improvements have gone to production and have improved the situation

image

however they have not returned to prior performance levels

@pianohacker
Copy link
Contributor

@tlwr We still need to cut 106.3.7 with this fix. We will be doing so very soon.

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

@friegger
Copy link

@tlwr
Copy link

tlwr commented Jan 29, 2020

No, I shouldn't have referenced the version number explicitly - my bad

The loggregator team(s) seem to have a continuous release process which automatically bumps relevant modules, without necessarily taking all the changes from develop

As you can see from the difference between the latest release, and the develop branch, the commits which allegedly fix the issue, are still not in a public final release

@friegger
Copy link

Ok, thanks. Do you have an estimate when this will be available?

@tlwr
Copy link

tlwr commented Jan 31, 2020

I think we can close this issue, as the release notes for the latest release suggest this is now fixed

tlwr pushed a commit to alphagov/paas-cf that referenced this issue Jan 31, 2020
tlwr pushed a commit to alphagov/paas-cf that referenced this issue Jan 31, 2020
@pianohacker
Copy link
Contributor

Confirming that 106.3.8 fixes this bug.

@Benjamintf1
Copy link
Member

Just to be clear here, there's two dns things here. 1)
7e772ac
Logging issue due to to bosh-dns and service requests. Loggregator released a fix in 106.3.10, it looks like it did not get backported, but we arn't seeing this bug right now when we checked out an environment on 2.8.
2) 6d66cf7
too many dns requests caused by resolver. This was fixed in 106.3.8, and was also backported to previous versions correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants