Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.7.1 upgrade to 3.9.0 migrations dns resolution issues (Postgresql 14) #14233

Open
1 task done
dgresh1 opened this issue Jan 30, 2025 · 8 comments
Open
1 task done

Comments

@dgresh1
Copy link

dgresh1 commented Jan 30, 2025

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

kong 3.9.0

Current Behavior

when i use github actions to upgrade kong there are error messages related to

kong migrations list
kong migrations up
kong migrations finish

kong is running in hybrid mode and i am trying to upgrade our control plane.

this is not a bootstrap as we are upgrading to 3.9.0

if i exec into the pod running kong

Error: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.0.0.10:53: timeout, took 276 ms. Tried: [[“psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com:A”,“DNS server error: failed to receive reply from UDP server 10.0.0.10:53: timeout, took 276 ms”]]

if i do an nslookup from the pod i do get resolution

nslookup psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com
;; Got recursion not available from 10.0.0.10
;; Got recursion not available from 10.0.0.10
;; Got recursion not available from 10.0.0.10
;; Got recursion not available from 10.0.0.10
Server: 10.0.0.10
Address: 10.0.0.10#53

Non-authoritative answer:
psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com canonical name = psql-hcp-apim-dmz-cp-dev-centralus.privatelink.postgres.database.azure.com.
Name: psql-hcp-apim-dmz-cp-dev-centralus.privatelink.postgres.database.azure.com
Address: 10.15.34.69

so at the pod level it does resolve, but when running migrations it doesn’t.

my /etc/resolv.conf file

$ cat /etc/resolv.conf
search dmz-kong.svc.cluster.local svc.cluster.local cluster.local 13jqinnqegaetjxzt0guttm2sb.gx.internal.cloudapp.net
nameserver 10.0.0.10
options ndots:5

one time i did get the following output from kong migrations list

$ kong migrations list
Executed migrations:
core: 000_base, 003_100_to_110, 004_110_to_120, 005_120_to_130, 006_130_to_140, 007_140_to_150, 008_150_to_200, 009_200_to_210, 010_210_to_211, 011_212_to_213, 012_213_to_220, 013_220_to_230, 014_230_to_270, 015_270_to_280, 016_280_to_300, 017_300_to_310, 018_310_to_320, 019_320_to_330, 020_330_to_340, 021_340_to_350, 022_350_to_360, 023_360_to_370, 024_380_to_390
acl: 000_base_acl, 002_130_to_140, 003_200_to_210, 004_212_to_213
acme: 000_base_acme, 001_280_to_300, 002_320_to_330, 003_350_to_360
ai-proxy: 001_360_to_370
basic-auth: 000_base_basic_auth, 002_130_to_140, 003_200_to_210
bot-detection: 001_200_to_210
hmac-auth: 000_base_hmac_auth, 002_130_to_140, 003_200_to_210
http-log: 001_280_to_300
ip-restriction: 001_200_to_210
jwt: 000_base_jwt, 002_130_to_140, 003_200_to_210
key-auth: 000_base_key_auth, 002_130_to_140, 003_200_to_210, 004_320_to_330
oauth2: 000_base_oauth2, 003_130_to_140, 004_200_to_210, 005_210_to_211, 006_320_to_330, 007_320_to_330
opentelemetry: 001_331_to_332
post-function: 001_280_to_300
pre-function: 001_280_to_300
rate-limiting: 000_base_rate_limiting, 003_10_to_112, 004_200_to_210, 005_320_to_330, 006_350_to_360
response-ratelimiting: 000_base_response_rate_limiting, 001_350_to_360
session: 000_base_session, 001_add_ttl_index, 002_320_to_330

i then ran it again and saw the dns resolution issue again.

when i did a kong migrations —v list i received additional info

/usr/local/share/lua/5.1/kong/cmd/migrations.lua:101: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.0.0.10:53: timeout, took 409 ms. Tried: [[“psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com:A”,“DNS server error: failed to receive reply from UDP server 10.0.0.10:53: timeout, took 409 ms”]]

Expected Behavior

i expected the migrations to work

Steps To Reproduce

i exec into the kong pod to try running migrations list and get the errors

Anything else?

No response

@bungle
Copy link
Member

bungle commented Feb 5, 2025

@dgresh1 just checking, is there any difference if you run it with:

KONG_NEW_DNS_CLIENT=on kong migrations ...

@dgresh1
Copy link
Author

dgresh1 commented Feb 5, 2025

@bungle wouldi put this in our primary kong deployment file and we also have a k8s_job spec where containers exist for kong-bootstrap, kong-migrations-up, and kong-migrations-finish?

@dgresh1
Copy link
Author

dgresh1 commented Feb 5, 2025

also, am i specifying on or is there anything we value need to be

on kong migratons

?

@dgresh1
Copy link
Author

dgresh1 commented Feb 5, 2025

i did the new dns client and am stilling getting the same error message as before.

@dgresh1
Copy link
Author

dgresh1 commented Feb 7, 2025

@bungle we also are seeing this issue with 3.8.0

@jeremyjpj0916
Copy link
Contributor

Probably root cause of these issues is same is the issues I detail in my git issue. Something funky going on with Kong with respect to how it does DNS lookups.

@lordgreg
Copy link

Hi,
(this feedback was also added to #14249)

we are having (maybe) the exact same issues- At some point, Kong just doesn't want to resolve the Postgres host anymore. What is funny is that even when the Kong goes into the CrashLoop and starts again, it doesn't work. The migrations still reply with

failed to get create response: rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: read unix @->/var/run/tw.runc.sock: use of closed network connection"
Error: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: timeout

and only then starts to work when we completely delete the pod and the deployments recreates is for us again.

@jeremyjpj0916
Copy link
Contributor

jeremyjpj0916 commented Feb 13, 2025

@lordgreg have not noticed that problem specifically just yet but does seem similar or related to the issues I have found. Feels like Kong isn't respecting DNS timeout settings and also for times where dns fails in 0-1ms sometimes in the logs makes me think as a client its trying to reuse stale sockets or something that it shouldn't be. Very strange behavior. Would have thought Kong's functional test suites would have caught something like this but may go deeper than that. Have you tried adding my DNS tunning and setting the attempts to 3? Seems to help some rn with our stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants