Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of upstreams increases DNS Issues. #6812

Closed
vaibhavkhurana2018 opened this issue Feb 8, 2021 · 9 comments · Fixed by #7002
Closed

Use of upstreams increases DNS Issues. #6812

vaibhavkhurana2018 opened this issue Feb 8, 2021 · 9 comments · Fixed by #7002
Labels
core/balancer pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... stale

Comments

@vaibhavkhurana2018
Copy link

Summary

2021/02/08 14:12:25 [error] 25#0: *49573 [lua] balancer.lua:921: execute(): DNS resolution failed: dns server error: 3 name error. Tried: ["(short)api-service:(na) - cache-miss","api-service.edge.svc.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error","api-service.svc.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error","api-service.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error","api-service.ap-south-1.compute.internal:33 - cache-hit/stale/scheduled/dns server error: 3 name error","api-service:33 - cache-hit/stale/scheduled/dns server error: 3 name error","api-service.edge.svc.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error","api-service.svc.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error","api-service.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error","api-service.ap-south-1.compute.internal:1 - cache-hit/stale/scheduled/dns server error: 3 name error","api-service:1 - cache-hit/stale/scheduled/dns server error: 3 name error","api-service.edge.svc.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error","api-service.svc.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error","api-service.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error","api-service.ap-south-1.compute.internal:5 - cache-hit/stale/scheduled/dns server error: 3 name error","api-service:5 - cache-hit/stale/scheduled/dns server error: 3 name error"], client: 52.66.95.207, server: kong, request: "GET / HTTP/1.1", host: "<host>", referrer: "https://<host>"

SUMMARY_GOES_HERE

Steps To Reproduce

  1. Create an upstream with targets
    2.Create a service and add upstream as host.
  2. Check logs of kong.

Additional Details & Logs

  • Kong version ($ kong version) 2.0.3

Possible Solution:

  • To have a way to configure the upstream resolution only on kong/tell kong service that the host is upstream as upstream is the virtual entity of kong.
@vaibhavkhurana2018
Copy link
Author

@Tieske Tagging you based on the responses on other issues similar to dns. Thanks!

@Tieske
Copy link
Member

Tieske commented Feb 8, 2021

I think this was already resolved, it has nothing to do with the actual dns resolution, but is a synchronisation issue. DNS resolution (including upstreams) is tried before the upstream becomes available. This causes the upstream-lookup to fail, which then causes a fall-through to the actual DNS client which starts querying the name server.

@bungle @kikito might have better idea of when this was exactly fixed.

@mlatimer-figure
Copy link

We are seeing this issue as well. However we are on the latest version of Kong (2.3.1) and the Kong Ingress Controller (1.1.1). Issue opened here: #6807

@vaibhavkhurana2018
Copy link
Author

@kikito Can you help us if this was fixed in some version. We are facing these issues in our environment and can't find a way out/workaround for this use case. Thanks!

@hugoShaka
Copy link

hugoShaka commented Feb 12, 2021

We moved to upstreams this week and encountered several DNS performance issues, including this one, thanks for raising this issue :)

How to reproduce

  • create kong proxies (here we'll be on kubernetes)
  • create a route/service/upstream/target: here both service and upstream are called fleet-system.fleet-kube-monitor-alertmanager-s2s-alerts.rule-0, the target is an ip address
  • capture DNS traffic
  • restart kong instances (kubectl delete on all pods)

What we observed

  • the restart comes with a spike of resolutions from kong (spike height depends on how many services/upstream you have)
    image
  • resolutions are full of kong trying to resolve its upstreams
    image
  • the impact is amplified by Kubernetes' default resolv.conf settings (because of ndots=5 it will amplify each resolution by trying to add garbage to the domains you resolve, like in the screenshot)
  • when fully started kong stops trying to resolve its upstream names

Details

Kong version 2.1.2 running on GKE, not using the kong ingress controller.

Other considerations and workarounds

  • We noticed an important increase of DNS resolutions by using upstreams compared to bare hostnames directly in the services, this was not linked to this bug. It seems to be the default behaviour. We did not investigate further and put the IPs directly in the targets to get rid of the resolution step.
  • The impact of the startup resolution spike on Kubernetes can be mitigated by setting ndots to 0 on all Kong pods

@locao
Copy link
Contributor

locao commented Mar 5, 2021

Hi @hugoShaka! Thanks for your report. This one is a different problem from the one originally reported, this one is related to DNS warm-up and should be fixed when #6891 is merged. We will keep having DNS resolutions when starting Kong, but only for hosts that are not upstream names, which is the problem you are facing, right?

You may want to create a new issue for that if you want to follow the progress of that PR.

@locao
Copy link
Contributor

locao commented Mar 5, 2021

Hello @vaibhavkhurana2018 @mlatimer-figure! Thanks for pointing that. Today we released Kong 2.3.3 that includes #6833. In that PR we made some changes that address the upstream name usage, which used to sometimes cause the reported problem when the balancer was under high load. Could you please test this version?

@locao locao added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Mar 10, 2021
ghost pushed a commit that referenced this issue Apr 5, 2021
#5831

Seems like we lost that by mistake during the
workspace refactor work.

Fix #6812
locao pushed a commit that referenced this issue Apr 5, 2021
#5831

Seems like we lost that by mistake during the
workspace refactor work.

Fix #6812
@stale
Copy link

stale bot commented Apr 9, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@mgupta0141
Copy link

@vaibhavkhurana2018 @mlatimer-figure
[error] 48#0: *3147821 [lua] balancer.lua:921: execute(): DNS resolution failed: dns lookup pool exceeded retries (1): timeout.
[error] 48#0: *3148632 [lua] balancer.lua:921: execute(): DNS resolution failed: dns server error: 3 name error.
Is the issue fixed after KONG upgrade to 2.3.3 ?

We are at kong version 2.0.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core/balancer pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants