3.7.1 upgrade to 3.9.0 migrations dns resolution issues (Postgresql 14) #14233

dgresh1 · 2025-01-30T02:47:12Z

Is there an existing issue for this?

I have searched the existing issues

Kong version (`$ kong version`)

kong 3.9.0

Current Behavior

when i use github actions to upgrade kong there are error messages related to

kong migrations list
kong migrations up
kong migrations finish

kong is running in hybrid mode and i am trying to upgrade our control plane.

this is not a bootstrap as we are upgrading to 3.9.0

if i exec into the pod running kong

Error: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.0.0.10:53: timeout, took 276 ms. Tried: [[“psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com:A”,“DNS server error: failed to receive reply from UDP server 10.0.0.10:53: timeout, took 276 ms”]]

if i do an nslookup from the pod i do get resolution

nslookup psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com
;; Got recursion not available from 10.0.0.10
;; Got recursion not available from 10.0.0.10
;; Got recursion not available from 10.0.0.10
;; Got recursion not available from 10.0.0.10
Server: 10.0.0.10
Address: 10.0.0.10#53

Non-authoritative answer:
psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com canonical name = psql-hcp-apim-dmz-cp-dev-centralus.privatelink.postgres.database.azure.com.
Name: psql-hcp-apim-dmz-cp-dev-centralus.privatelink.postgres.database.azure.com
Address: 10.15.34.69

so at the pod level it does resolve, but when running migrations it doesn’t.

my /etc/resolv.conf file

$ cat /etc/resolv.conf
search dmz-kong.svc.cluster.local svc.cluster.local cluster.local 13jqinnqegaetjxzt0guttm2sb.gx.internal.cloudapp.net
nameserver 10.0.0.10
options ndots:5

one time i did get the following output from kong migrations list

$ kong migrations list
Executed migrations:
core: 000_base, 003_100_to_110, 004_110_to_120, 005_120_to_130, 006_130_to_140, 007_140_to_150, 008_150_to_200, 009_200_to_210, 010_210_to_211, 011_212_to_213, 012_213_to_220, 013_220_to_230, 014_230_to_270, 015_270_to_280, 016_280_to_300, 017_300_to_310, 018_310_to_320, 019_320_to_330, 020_330_to_340, 021_340_to_350, 022_350_to_360, 023_360_to_370, 024_380_to_390
acl: 000_base_acl, 002_130_to_140, 003_200_to_210, 004_212_to_213
acme: 000_base_acme, 001_280_to_300, 002_320_to_330, 003_350_to_360
ai-proxy: 001_360_to_370
basic-auth: 000_base_basic_auth, 002_130_to_140, 003_200_to_210
bot-detection: 001_200_to_210
hmac-auth: 000_base_hmac_auth, 002_130_to_140, 003_200_to_210
http-log: 001_280_to_300
ip-restriction: 001_200_to_210
jwt: 000_base_jwt, 002_130_to_140, 003_200_to_210
key-auth: 000_base_key_auth, 002_130_to_140, 003_200_to_210, 004_320_to_330
oauth2: 000_base_oauth2, 003_130_to_140, 004_200_to_210, 005_210_to_211, 006_320_to_330, 007_320_to_330
opentelemetry: 001_331_to_332
post-function: 001_280_to_300
pre-function: 001_280_to_300
rate-limiting: 000_base_rate_limiting, 003_10_to_112, 004_200_to_210, 005_320_to_330, 006_350_to_360
response-ratelimiting: 000_base_response_rate_limiting, 001_350_to_360
session: 000_base_session, 001_add_ttl_index, 002_320_to_330

i then ran it again and saw the dns resolution issue again.

when i did a kong migrations —v list i received additional info

/usr/local/share/lua/5.1/kong/cmd/migrations.lua:101: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.0.0.10:53: timeout, took 409 ms. Tried: [[“psql-hcp-apim-dmz-cp-dev-centralus.postgres.database.azure.com:A”,“DNS server error: failed to receive reply from UDP server 10.0.0.10:53: timeout, took 409 ms”]]

Expected Behavior

i expected the migrations to work

Steps To Reproduce

i exec into the kong pod to try running migrations list and get the errors

Anything else?

No response

The text was updated successfully, but these errors were encountered:

bungle · 2025-02-05T09:56:07Z

@dgresh1 just checking, is there any difference if you run it with:

KONG_NEW_DNS_CLIENT=on kong migrations ...

dgresh1 · 2025-02-05T17:35:29Z

@bungle wouldi put this in our primary kong deployment file and we also have a k8s_job spec where containers exist for kong-bootstrap, kong-migrations-up, and kong-migrations-finish?

dgresh1 · 2025-02-05T17:36:18Z

also, am i specifying on or is there anything we value need to be

on kong migratons

?

dgresh1 · 2025-02-05T21:51:33Z

i did the new dns client and am stilling getting the same error message as before.

dgresh1 · 2025-02-07T13:13:18Z

@bungle we also are seeing this issue with 3.8.0

jeremyjpj0916 · 2025-02-09T00:02:27Z

Probably root cause of these issues is same is the issues I detail in my git issue. Something funky going on with Kong with respect to how it does DNS lookups.

lordgreg · 2025-02-13T06:48:26Z

Hi,
(this feedback was also added to #14249)

we are having (maybe) the exact same issues- At some point, Kong just doesn't want to resolve the Postgres host anymore. What is funny is that even when the Kong goes into the CrashLoop and starts again, it doesn't work. The migrations still reply with

failed to get create response: rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: read unix @->/var/run/tw.runc.sock: use of closed network connection"
Error: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: timeout

and only then starts to work when we completely delete the pod and the deployments recreates is for us again.

jeremyjpj0916 · 2025-02-13T09:20:21Z

@lordgreg have not noticed that problem specifically just yet but does seem similar or related to the issues I have found. Feels like Kong isn't respecting DNS timeout settings and also for times where dns fails in 0-1ms sometimes in the logs makes me think as a client its trying to reuse stale sockets or something that it shouldn't be. Very strange behavior. Would have thought Kong's functional test suites would have caught something like this but may go deeper than that. Have you tried adding my DNS tunning and setting the attempts to 3? Seems to help some rn with our stuff.

lordgreg mentioned this issue Feb 13, 2025

Kong 3.9.0 bugs, seems DNS/cosocket related issues #14249

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.7.1 upgrade to 3.9.0 migrations dns resolution issues (Postgresql 14) #14233

3.7.1 upgrade to 3.9.0 migrations dns resolution issues (Postgresql 14) #14233

dgresh1 commented Jan 30, 2025

bungle commented Feb 5, 2025

dgresh1 commented Feb 5, 2025

dgresh1 commented Feb 5, 2025

dgresh1 commented Feb 5, 2025

dgresh1 commented Feb 7, 2025

jeremyjpj0916 commented Feb 9, 2025

lordgreg commented Feb 13, 2025

jeremyjpj0916 commented Feb 13, 2025 •

edited

Loading

3.7.1 upgrade to 3.9.0 migrations dns resolution issues (Postgresql 14) #14233

3.7.1 upgrade to 3.9.0 migrations dns resolution issues (Postgresql 14) #14233

Comments

dgresh1 commented Jan 30, 2025

Is there an existing issue for this?

Kong version ($ kong version)

Current Behavior

Expected Behavior

Steps To Reproduce

Anything else?

bungle commented Feb 5, 2025

dgresh1 commented Feb 5, 2025

dgresh1 commented Feb 5, 2025

dgresh1 commented Feb 5, 2025

dgresh1 commented Feb 7, 2025

jeremyjpj0916 commented Feb 9, 2025

lordgreg commented Feb 13, 2025

jeremyjpj0916 commented Feb 13, 2025 • edited Loading

Kong version (`$ kong version`)

jeremyjpj0916 commented Feb 13, 2025 •

edited

Loading