too much dns request on cf-deployment , via loggregator 106.x #401

jyriok · 2019-12-27T14:37:54Z

Hello,

i use loggregator via cf-deployment , and recently, we have updated our cluster from cf-deployment 12.1.0 to 12.20.0 , so with loggregator 106.3.1. and we start to have dns issue, the log-api vm have huge load and send to much dns request (approx. 15k req/min).
Log :
[RequestLoggerHandler] 2019/12/27 14:30:15 INFO - handlers.DiscoveryHandler Request [33] [_grpclb._tcp.q-s0.doppler.default.cf.bosh.] 2 28000ns

i've try to upgrade loggregator to 106.3.2 (last one) but same issue.
i've rollbacked to a older loggregator release (105.6 but keep cf-deployment to 12.20.0) and issue disappear.

so, it's seem to have a bug on the 106.x loggregator series.

More info on the cloudfloundry slack :
https://cloudfoundry.slack.com/archives/C2U7KA7M4/p1576657695004400

thanks for your help :)

cf-gitbot · 2019-12-27T14:37:56Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/170456276

The labels on this github issue will be updated when the story is started.

chentom88 · 2019-12-27T21:39:08Z

@jyriok This issue is related to a fix that we had made to actually get the bosh dns health check working for our doppler instances. Can you confirm that this high cpu still happens after you have updated to Bosh-DNS version 1.16? I am sure you are aware but in case you are not, Bosh-DNS is included in a runtime config in most environments so changing your version of CF-Deployment won't change it.

jyriok · 2019-12-29T03:19:37Z

@chentom88 hello ! thank for your answer. yes, i'm aware about issue fixed by bosh-dns 1.16 about using tls 1.2 instead 1.3 but i already use bosh-dns 1.16 :( so i thinks it's another issue.

here my bosh deployment with releases i use

Name Release(s) Stemcell(s) Team(s) cf backup-and-restore-sdk/1.17.2 bosh-cloudstack-xen-ubuntu-xenial-go_agent/621.29 - bosh-dns/1.16.0 bosh-dns-aliases/0.0.3 bpm/1.1.6 capi/1.89.0 cf-cli/1.23.0 cf-networking/2.27.0 cf-security-entitlement/1.0.5 cf-smoke-tests/40.0.123-ora cf-syslog-drain/10.2.7 cflinuxfs3/0.151.0 credhub/2.5.9 diego/2.41.0 garden-runc/1.19.9 haproxy/9.8.0 log-cache/2.6.6 loggregator/106.3.2 loggregator-agent/5.3.1 minio/2018-11-17T01-23-48Z minio-tools/1.0.3 nats/32 network-config/1.0.1 node-exporter/4.2.0 os-conf/21.0.0-ora prometheus/26.1.0 pxc/0.21.0 routing/0.196.0 s3-volume/1.0.0 silk/2.27.0 statsd-injector/1.11.10 syslog/11.6.1 uaa/74.12.0

tlwr · 2019-12-30T08:57:30Z

We're also seeing high load averages on log-api instances, since upgrading to latest cf-deployment, due to BOSH DNS

Stemcell: 621.29
BOSH DNS release: 1.16.0
Loggregator release: 106.3.1

Image of chart showing BOSH DNS CPU usage averaged by instance type

The jump to 50% CPU usage for log-api instances seems to correspond to moving from Loggregator 105.6 to 106.2.1 or upgrading past stemcell 621.23

The jump for all other instances from ~0% CPU to ~2-5%CPU seems to correspond to upgrading BOSH-DNS from 1.8 to 1.16

My suspicion, is that changes that landed in Go 1.13 affecting DNS, or recent bumps to the GRPC libraries. I tcpdumped DNS requests counting requests over a period of 1 minute, and received the following:

   8202 SRV?	_grpclb._tcp.q-s0.doppler.cf.tlwr.bosh.
   8201 TXT?	q-s0.doppler.cf.tlwr.bosh.

which seems like undesirable behaviour; and the number of DNS requests seems to scale with the number of Doppler instances deployed

chentom88 · 2019-12-30T15:25:15Z

@jyriok and @tlwr, I believe the spike in CPU on both the doppler and log-api instances are a result of this commit - 627e686. It seems having a working bosh-dns health check is causing high CPU due to bosh-dns spinning some sort of check on both VMs. @tlwr, your assertion that the problem scales with the number of doppler instances makes sense in that the log-api will attempt to make a connection to each doppler instances available, there by increasing the number of queries to bosh-dns. We will investigate, I have created this bug in our backlog - https://www.pivotaltracker.com/story/show/170474149.

TheMoves · 2020-01-06T19:08:33Z

Also been seeing this issue, currently running CF Deployment 12.15.

We had to create a IPTABLES rules on the 'log-api' servers to BLOCK outbound port 53 traffic to our InfoBlox controllers as the SRC/TXT requests were burying them ... After enabling the IPTABLES rules, the CPU Load is still present.

Commenting out the external DNS resolvers and leaving BOSH-DNS IP in /etc/resolv.conf did not fix the CPU load.

tlwr · 2020-01-17T08:48:11Z

This looks like it will be fixed in the next loggregator-release

TheMoves · 2020-01-17T16:11:31Z

Need bosh-dns 1.17.0 as well i believe:
https://github.com/cloudfoundry/bosh-dns-release/releases/tag/v1.17.0

andrejev · 2020-01-22T09:55:02Z

Hey @tlwr we also see this happen in our environment.
Is the fixed loggregator-release you refer to Loggregator 106.3.6 ?
I cannot seem to find an information about this issue in the changelog of the release.

tlwr · 2020-01-22T10:19:29Z

I haven't tested any 106.3.6 but I'm not sure if that release will fix it, I'll deploy it to my development environment and see.

As I understand it, we want any release with this commit: 6d66cf7

EDIT

I have deployed 106.3.6 to my dev environment and see no noticeable change in CPU utilization in log-api instances.

However deploying the bosh-dns v1.17.0 change last week did have an affect on CPU utilization:

From this, I can conclude that the latest bosh-dns release will make log-api use less CPU, but not the same amount of CPU as before the change which started this

This release: * https://github.com/cloudfoundry/bosh-dns-release/releases/tag/v1.17.0 Contains the following fixes: * stop redundant health check retries between poll intervals * stop redundant health check requests when querying a domain * Fix compilation error on windows due to broken symlinks * health server should shutdown cleanly on sigterm The following fixes: * stop redundant health check retries between poll intervals * stop redundant health check requests when querying a domain cause loggregator component trafficcontroller's very chatty dns healthcheck to be cached by bosh-dns, which will reduce the currently high cpu usage (there is another fix coming in a later loggregator release) See cloudfoundry/loggregator-release#401 Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

tlwr · 2020-01-24T17:57:01Z

The CPU utilisation improvements have gone to production and have improved the situation

however they have not returned to prior performance levels

pianohacker · 2020-01-24T20:44:03Z

@tlwr We still need to cut 106.3.7 with this fix. We will be doing so very soon.

cf-gitbot · 2020-01-24T20:44:18Z

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

friegger · 2020-01-29T14:03:52Z

Is this now part of https://github.com/cloudfoundry/loggregator-release/releases/tag/v106.3.7?

tlwr · 2020-01-29T14:30:09Z

No, I shouldn't have referenced the version number explicitly - my bad

The loggregator team(s) seem to have a continuous release process which automatically bumps relevant modules, without necessarily taking all the changes from develop

As you can see from the difference between the latest release, and the develop branch, the commits which allegedly fix the issue, are still not in a public final release

friegger · 2020-01-31T10:34:12Z

Ok, thanks. Do you have an estimate when this will be available?

tlwr · 2020-01-31T11:41:34Z

I think we can close this issue, as the release notes for the latest release suggest this is now fixed

See cloudfoundry/loggregator-release#401 and https://github.com/cloudfoundry/loggregator-release/releases/tag/v106.3.8 Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

pianohacker · 2020-02-05T16:23:16Z

Confirming that 106.3.8 fixes this bug.

Benjamintf1 · 2020-06-15T20:41:11Z

Just to be clear here, there's two dns things here. 1)
7e772ac
Logging issue due to to bosh-dns and service requests. Loggregator released a fix in 106.3.10, it looks like it did not get backported, but we arn't seeing this bug right now when we checked out an environment on 2.8.
2) 6d66cf7
too many dns requests caused by resolver. This was fixed in 106.3.8, and was also backported to previous versions correctly.

Bumps loggregator-release to 106.3.11 Resolves: cloudfoundry/loggregator-release#401 cloudfoundry/loggregator-release#405

cf-gitbot added the unscheduled label Dec 27, 2019

tlwr mentioned this issue Jan 22, 2020

Upgrade bosh-dns-release to 1.17 alphagov/paas-cf#2233

Merged

jimmykarily mentioned this issue Jan 24, 2020

eirini-loggregator-bridge crashing after a while cloudfoundry-incubator/eirini-loggregator-bridge#4

Closed

pianohacker closed this as completed Jan 24, 2020

cf-gitbot removed the unscheduled label Jan 24, 2020

pianohacker reopened this Jan 24, 2020

cf-gitbot added the unscheduled label Jan 24, 2020

f0rmiga mentioned this issue Jan 27, 2020

fix: loggregator DNS spam cloudfoundry-incubator/kubecf#365

Merged

7 tasks

tlwr mentioned this issue Jan 31, 2020

Bump loggregator to version which does not DOS DNS alphagov/paas-cf#2238

Merged

jandubois mentioned this issue Jan 31, 2020

Omnibus PR to get from RC2 to RC3 SUSE/scf#3291

Merged

pianohacker closed this as completed Feb 5, 2020

cf-gitbot removed the unscheduled label Feb 5, 2020

1kaushik1 mentioned this issue Sep 15, 2020

cf logs showing no output after upgrade to v 12.24 cloudfoundry/cf-deployment#905

Closed

naveedahd mentioned this issue Oct 8, 2020

Bump up release versions to prevent breaking changes genesis-community/cf-genesis-kit#152

Closed

dennisjbell added a commit to genesis-community/cf-genesis-kit that referenced this issue Oct 8, 2020

HOT FIX to cf-deployment v12.45.0

180f32f

Bumps loggregator-release to 106.3.11 Resolves: cloudfoundry/loggregator-release#401 cloudfoundry/loggregator-release#405

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too much dns request on cf-deployment , via loggregator 106.x #401

too much dns request on cf-deployment , via loggregator 106.x #401

jyriok commented Dec 27, 2019 •

edited

Loading

cf-gitbot commented Dec 27, 2019

chentom88 commented Dec 27, 2019 •

edited

Loading

jyriok commented Dec 29, 2019 •

edited

Loading

tlwr commented Dec 30, 2019 •

edited

Loading

chentom88 commented Dec 30, 2019 •

edited

Loading

TheMoves commented Jan 6, 2020

tlwr commented Jan 17, 2020

TheMoves commented Jan 17, 2020

andrejev commented Jan 22, 2020

tlwr commented Jan 22, 2020 •

edited

Loading

tlwr commented Jan 24, 2020

pianohacker commented Jan 24, 2020

cf-gitbot commented Jan 24, 2020

friegger commented Jan 29, 2020

tlwr commented Jan 29, 2020

friegger commented Jan 31, 2020

tlwr commented Jan 31, 2020

pianohacker commented Feb 5, 2020

Benjamintf1 commented Jun 15, 2020

too much dns request on cf-deployment , via loggregator 106.x #401

too much dns request on cf-deployment , via loggregator 106.x #401

Comments

jyriok commented Dec 27, 2019 • edited Loading

cf-gitbot commented Dec 27, 2019

chentom88 commented Dec 27, 2019 • edited Loading

jyriok commented Dec 29, 2019 • edited Loading

tlwr commented Dec 30, 2019 • edited Loading

chentom88 commented Dec 30, 2019 • edited Loading

TheMoves commented Jan 6, 2020

tlwr commented Jan 17, 2020

TheMoves commented Jan 17, 2020

andrejev commented Jan 22, 2020

tlwr commented Jan 22, 2020 • edited Loading

tlwr commented Jan 24, 2020

pianohacker commented Jan 24, 2020

cf-gitbot commented Jan 24, 2020

friegger commented Jan 29, 2020

tlwr commented Jan 29, 2020

friegger commented Jan 31, 2020

tlwr commented Jan 31, 2020

pianohacker commented Feb 5, 2020

Benjamintf1 commented Jun 15, 2020

jyriok commented Dec 27, 2019 •

edited

Loading

chentom88 commented Dec 27, 2019 •

edited

Loading

jyriok commented Dec 29, 2019 •

edited

Loading

tlwr commented Dec 30, 2019 •

edited

Loading

chentom88 commented Dec 30, 2019 •

edited

Loading

tlwr commented Jan 22, 2020 •

edited

Loading