Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kong 3.9.0 bugs, seems DNS/cosocket related issues #14249

Open
1 task done
jeremyjpj0916 opened this issue Feb 7, 2025 · 23 comments
Open
1 task done

Kong 3.9.0 bugs, seems DNS/cosocket related issues #14249

jeremyjpj0916 opened this issue Feb 7, 2025 · 23 comments

Comments

@jeremyjpj0916
Copy link
Contributor

jeremyjpj0916 commented Feb 7, 2025

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

Kong 3.9.0

Current Behavior

Wanted to walk through some observed issues after cutting over from 3.7.1 Kong traditional mode to 3.9.0 traditional mode. I am also setting the new Kong DNS config turned on as well as leaving the same Old Kong DNS env configs the same in place.

Ex:

- name: KONG_NEW_DNS_CLIENT
          value: "on"
        - name: KONG_RESOLVER_ADDRESS
          value: 10.xx.xx.105,10.x.xx.103
        - name: KONG_RESOLVER_HOSTS_FILE
          value: /etc/hosts
        - name: KONG_RESOLVER_STALE_TTL
          value: "3600"
        - name: KONG_RESOLVER_ERROR_TTL
          value: "30"   
        - name: KONG_DNS_HOSTSFILE
          value: /etc/hosts
        - name: KONG_DNS_STALE_TTL
          value: "3600"
        - name: KONG_DNS_NOT_FOUND_TTL
          value: "30"
        - name: KONG_DNS_ERROR_TTL
          value: "30"
        - name: KONG_DNS_RESOLVER
          value: 10.xx.xx.105,10.x.xx.103
        - name: KONG_DNS_NO_SYNC
          value: "off"
        - name: LOCALDOMAIN
          value: .
        - name: RES_OPTIONS
          value: "ndots:1 attempts:1 timeout:2"
        - name: KONG_DNS_ORDER
          value: LAST,A,SRV,CNAME

First observation is the migrations up does make some changes, the migrations finish command seems to find no changes.Assuming thats to be expected.

Some background. I spin up a HCC K8S Kong job on the new version against a cloned Older Kong pg db copy to perform the upgrades out of place from our live database. Then I cutover our runtime Kong nodes to the newer database when I see that the migrations up and finish were successful.

Now that I have Kong 3.9.0 nodes running in place of the Kong 3.7.1 nodes the following bug observations can be made:

My pod running newer verison of Kong has a lot of CLI commands that either fail or behave in a way they should not:

First lets observe checking the Kong version:

Old 3.7.1:

ubuntu@kong-85ddfd88f-b7nsf:/$ kong version -vv
2025/02/07 19:49:20 [verbose] Kong: 3.7.1
2025/02/07 19:49:20 [debug] ngx_lua: 10026
2025/02/07 19:49:20 [debug] nginx: 1025003
2025/02/07 19:49:20 [debug] Lua: LuaJIT 2.1.0-20231117
3.7.1
ubuntu@kong-85ddfd88f-b7nsf:/$ 

New 3.9.0:

ubuntu@kong-55bb65cb5b-fv29k:/$ kong version -vv 
2025/02/07 19:48:43 [notice] 27780#0: using the "epoll" event method
2025/02/07 19:48:43 [notice] 27780#0: openresty/1.25.3.2
2025/02/07 19:48:43 [notice] 27780#0: OS: Linux 5.14.0-427.50.1.el9_4.x86_64
2025/02/07 19:48:43 [notice] 27780#0: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2025/02/07 19:48:43 [debug] 27780#0: *2 [lua] globalpatches.lua:10: installing the globalpatches
2025/02/07 19:48:43 [debug] 27780#0: *2 [lua] client.lua:618: init(): [dns_client] (re)configuring dns client
2025/02/07 19:48:43 [notice] 27780#0: *2 [lua] client.lua:161: new(): [dns_client] supported types: srv ipv4 ipv6 , context: ngx.timer
2025/02/07 19:48:43 [debug] 27780#0: *2 [lua] globalpatches.lua:445: randomseed(): seeding PRNG from OpenSSL RAND_bytes()
2025/02/07 19:48:43 [verbose] Kong: 3.9.0
2025/02/07 19:48:43 [debug] ngx_lua: 10026
2025/02/07 19:48:43 [debug] nginx: 1025003
2025/02/07 19:48:43 [debug] Lua: LuaJIT 2.1.0-20231117.1
3.9.0
2025/02/07 19:48:43 [notice] 27780#0: gracefully shutting down
2025/02/07 19:48:43 [notice] 27780#0: exiting
2025/02/07 19:48:43 [notice] 27780#0: exit
ubuntu@kong-55bb65cb5b-fv29k:/$ 

Almost feels like the newer CLI is like trying to spin up a totally separate nginx main worker process and init() logging near the top to print that stuff out and exit? That just seems odd/wrong for a CLI command to need to do so when it should just print out relevant values. And might be a helping guide into some of the other issues being observed if understood why thats doing that.

Next issue:

On my runtime 3.9 pods it seems the terminal sesh CLI can't reach the db properly like the startup of my runtime Kong pod could. This is all running from the same pod too so its like startup Kong pod CMD has no issues but the terminal shelled in CLI commands loses the functionality of reaching the db? Example below where the db has already been migrated and setup for each respective version of Kong we run right now but the newer version bombs on connecting to db in same environment setup.

3.7.1

ubuntu@kong-85ddfd88f-b7nsf:/$ kong migrations finish 
No pending migrations to finish

3.9.0

ubuntu@kong-55bb65cb5b-fv29k:/$ kong migrations finish
Error: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.xx.xx.105:53: timeout, took 170 ms. Tried: [["db_server.company.com:A","DNS server error: failed to receive reply from UDP server 10.xx.xx.105:53: timeout, took 170 ms"]]

  Run with --v (verbose) or --vv (debug) for more details
ubuntu@kong-55bb65cb5b-fv29k:/$ 

^ Which is interesting considering the pod I am running that CLI command from is up and healthy talking to that exact postgres DB reference using the exact DNS resolver nameserver lol. How does the CLI process drop the ability to communicate effectively to it from same src active host?

One thing I noticed is 3.7.1 to 3.9.0 changes this events section, could this be playing a role? Or somehow the new DNS implementation is bugging things out?

3.7.1:

server {
    charset UTF-8;
    server_name kong_worker_events;
    listen unix:${{PREFIX}}/worker_events.sock;
    access_log off;
    location / {
        content_by_lua_block {
          require("resty.events.compat").run()
        }
    }
}

3.9.0:

server {
    charset UTF-8;
    server_name kong_worker_events;
    listen unix:${{SOCKET_PATH}}/${{WORKER_EVENTS_SOCK}};
    access_log off;
    location / {
        content_by_lua_block {
          require("resty.events.compat").run()
        }
    }
}

Some kinda socket path reference difference that other parts of Kong did not transition to causing this issue? Not totally sure just trying to think aloud what all changed between the minor versions of Kong here between 3.7.1 to 3.9.0

cc @bungle Mr. Bungle to the rescue?

Expected Behavior

  1. The CLI commands work and behave much like 3.7.1 does.

Steps To Reproduce

  1. Read my original top part.

Anything else?

No response

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 7, 2025

Okay this is strange... I then left the 3.9.0 pod up and reran the same exact CLI command just with a little more verbose enabled and got a response, meaning db connectivity was established, going to skip some of the output since its really long and just keep the good stuff:

3.9.0 CLI ran a second time again in -vv mode.

ubuntu@kong-55bb65cb5b-fv29k:/$ kong migrations finish -vv
2025/02/07 19:55:03 [notice] 29563#0: using the "epoll" event method
2025/02/07 19:55:03 [notice] 29563#0: openresty/1.25.3.2
2025/02/07 19:55:03 [notice] 29563#0: OS: Linux 5.14.0-427.50.1.el9_4.x86_64
2025/02/07 19:55:03 [notice] 29563#0: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] globalpatches.lua:10: installing the globalpatches
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:618: init(): [dns_client] (re)configuring dns client
2025/02/07 19:55:03 [notice] 29563#0: *2 [lua] client.lua:161: new(): [dns_client] supported types: srv ipv4 ipv6 , context: ngx.timer
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] globalpatches.lua:445: randomseed(): seeding PRNG from OpenSSL RAND_bytes()
2025/02/07 19:55:03 [verbose] Kong: 3.9.0
2025/02/07 19:55:03 [debug] ngx_lua: 10026
2025/02/07 19:55:03 [debug] nginx: 1025003
2025/02/07 19:55:03 [debug] Lua: LuaJIT 2.1.0-20231117.1
2025/02/07 19:55:03 [verbose] no config file found at /etc/kong/kong.conf
2025/02/07 19:55:03 [verbose] no config file found at /etc/kong.conf
2025/02/07 19:55:03 [verbose] no config file, skip loading
2025/02/07 19:55:03 [debug] reading environment variables
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:618: init(): [dns_client] (re)configuring dns client
2025/02/07 19:55:03 [notice] 29563#0: *2 [lua] client.lua:161: new(): [dns_client] supported types: srv ipv4 , context: ngx.timer
2025/02/07 19:55:03 [verbose] preparing nginx prefix directory at /usr/local/kong
2025/02/07 19:55:03 [verbose] generating trusted certs combined file in /usr/local/kong/.ca_combined
2025/02/07 19:55:03 [info] 29563#0: *2 [lua] node.lua:303: new(): kong node-id: cba02678-0436-4a42-8871-001cf552e1b7, context: ngx.timer
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:550: resolve_all(): [dns_client] resolve_all db_server.company.com:-1
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:534: [dns_client] cache miss, try to query db_server.company.com:-1
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:362: resolve_query(): [dns_client] r:query(db_server.company.com:1) ans:1 t:182 ms
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:332: process_answers(): [dns_client] processed ans:1
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:567: resolve_all(): [dns_client] cache lookup db_server.company.com:-1 ans:1 hlv:miss
2025/02/07 19:55:03 [debug] 29563#0: *5 [lua] super.lua:161: scaling_log(): [timer-ng] load_avg: 0, runable_jobs_avg: 0, alive_threads_avg: 24
2025/02/07 19:55:03 [debug] loading subsystems migrations...
2025/02/07 19:55:03 [verbose] retrieving database schema state...
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:550: resolve_all(): [dns_client] resolve_all db_server.company.com:-1
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:567: resolve_all(): [dns_client] cache lookup db_server.company.com:-1 ans:1 hlv:hit_lru
2025/02/07 19:55:03 [verbose] schema state retrieved
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:550: resolve_all(): [dns_client] resolve_all db_server.company.com:-1
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:567: resolve_all(): [dns_client] cache lookup db_server.company.com:-1 ans:1 hlv:hit_lru
2025/02/07 19:55:03 [debug] loading subsystems migrations...
2025/02/07 19:55:03 [verbose] retrieving database schema state...
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:550: resolve_all(): [dns_client] resolve_all db_server.company.com:-1
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:567: resolve_all(): [dns_client] cache lookup db_server.company.com:-1 ans:1 hlv:hit_lru
2025/02/07 19:55:03 [verbose] schema state retrieved
2025/02/07 19:55:03 [info] No pending migrations to finish
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:550: resolve_all(): [dns_client] resolve_all db_server.company.com:-1
2025/02/07 19:55:03 [debug] 29563#0: *2 [lua] client.lua:567: resolve_all(): [dns_client] cache lookup db_server.company.com:-1 ans:1 hlv:hit_lru
2025/02/07 19:55:03 [notice] 29563#0: gracefully shutting down
2025/02/07 19:55:04 [notice] 29563#0: exiting
2025/02/07 19:55:04 [notice] 29563#0: exit

Why would running the same command a second time work when first time failed in same pod... let me see if I can reproduce when I rolling redeploy it again fresh and try same CLI actions multiple times.

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 7, 2025

Okay yeah something is really strange here,

From the same pod executing the migrations up on a db thats already up to date:

ubuntu@kong-55bb65cb5b-fv29k:/$ kong migrations up -vv
2025/02/07 20:15:58 [notice] 35455#0: using the "epoll" event method
2025/02/07 20:15:58 [notice] 35455#0: openresty/1.25.3.2
2025/02/07 20:15:58 [notice] 35455#0: OS: Linux 5.14.0-427.50.1.el9_4.x86_64
2025/02/07 20:15:58 [notice] 35455#0: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2025/02/07 20:15:58 [debug] 35455#0: *2 [lua] globalpatches.lua:10: installing the globalpatches
2025/02/07 20:15:58 [debug] 35455#0: *2 [lua] client.lua:618: init(): [dns_client] (re)configuring dns client
2025/02/07 20:15:58 [notice] 35455#0: *2 [lua] client.lua:161: new(): [dns_client] supported types: srv ipv4 ipv6 , context: ngx.timer
2025/02/07 20:15:58 [debug] 35455#0: *2 [lua] globalpatches.lua:445: randomseed(): seeding PRNG from OpenSSL RAND_bytes()
2025/02/07 20:15:58 [verbose] Kong: 3.9.0
2025/02/07 20:15:58 [debug] ngx_lua: 10026
2025/02/07 20:15:58 [debug] nginx: 1025003
2025/02/07 20:15:58 [debug] Lua: LuaJIT 2.1.0-20231117.1
2025/02/07 20:15:58 [verbose] no config file found at /etc/kong/kong.conf
2025/02/07 20:15:58 [verbose] no config file found at /etc/kong.conf
2025/02/07 20:15:58 [verbose] no config file, skip loading
2025/02/07 20:15:58 [debug] reading environment variables
.....
2025/02/07 20:15:58 [debug] 35455#0: *2 [lua] client.lua:618: init(): [dns_client] (re)configuring dns client
2025/02/07 20:15:58 [notice] 35455#0: *2 [lua] client.lua:161: new(): [dns_client] supported types: srv ipv4 , context: ngx.timer
2025/02/07 20:15:58 [verbose] preparing nginx prefix directory at /usr/local/kong
2025/02/07 20:15:58 [verbose] generating trusted certs combined file in /usr/local/kong/.ca_combined
2025/02/07 20:15:58 [info] 35455#0: *2 [lua] node.lua:303: new(): kong node-id: e3474263-dade-44de-a151-7cd3460a4532, context: ngx.timer
2025/02/07 20:15:58 [debug] 35455#0: *2 [lua] client.lua:550: resolve_all(): [dns_client] resolve_all db_server.company.com:-1
2025/02/07 20:15:58 [debug] 35455#0: *2 [lua] client.lua:534: [dns_client] cache miss, try to query db_server.company.com:-1
2025/02/07 20:15:58 [debug] 35455#0: *2 [lua] client.lua:362: resolve_query(): [dns_client] r:query(db_server.company.com:1) ans:- t:153 ms
2025/02/07 20:15:58 [debug] 35455#0: *2 [lua] client.lua:567: resolve_all(): [dns_client] cache lookup db_server.company.com:-1 ans:- hlv:fail
Error: 
/usr/local/share/lua/5.1/kong/cmd/migrations.lua:101: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.xx.xx.105:53: timeout, took 153 ms. Tried: [["db_server.company.com:A","DNS server error: failed to receive reply from UDP server 10.xx.xx.105:53: timeout, took 153 ms"]]
stack traceback:
        [C]: in function 'assert'
        /usr/local/share/lua/5.1/kong/cmd/migrations.lua:101: in function 'cmd_exec'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:31: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:31>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:31: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:15>
        (command line -e):5: in function 'inline_gen'
        init_worker_by_lua(nginx.conf:274):44: in function <init_worker_by_lua(nginx.conf:274):43>
        [C]: in function 'xpcall'
        init_worker_by_lua(nginx.conf:274):52: in function <init_worker_by_lua(nginx.conf:274):50>

Same error executing the same command 4 times in a row, then the 5th time it worked on same pod, wild stuff:

ubuntu@kong-55bb65cb5b-fv29k:/$ kong migrations up -vv
2025/02/07 20:21:52 [notice] 37149#0: using the "epoll" event method
2025/02/07 20:21:52 [notice] 37149#0: openresty/1.25.3.2
2025/02/07 20:21:52 [notice] 37149#0: OS: Linux 5.14.0-427.50.1.el9_4.x86_64
2025/02/07 20:21:52 [notice] 37149#0: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2025/02/07 20:21:52 [debug] 37149#0: *2 [lua] globalpatches.lua:10: installing the globalpatches
2025/02/07 20:21:52 [debug] 37149#0: *2 [lua] client.lua:618: init(): [dns_client] (re)configuring dns client
2025/02/07 20:21:52 [notice] 37149#0: *2 [lua] client.lua:161: new(): [dns_client] supported types: srv ipv4 ipv6 , context: ngx.timer
2025/02/07 20:21:52 [debug] 37149#0: *2 [lua] globalpatches.lua:445: randomseed(): seeding PRNG from OpenSSL RAND_bytes()
2025/02/07 20:21:52 [verbose] Kong: 3.9.0
2025/02/07 20:21:52 [debug] ngx_lua: 10026
2025/02/07 20:21:52 [debug] nginx: 1025003
2025/02/07 20:21:52 [debug] Lua: LuaJIT 2.1.0-20231117.1
2025/02/07 20:21:52 [verbose] no config file found at /etc/kong/kong.conf
2025/02/07 20:21:52 [verbose] no config file found at /etc/kong.conf
2025/02/07 20:21:52 [verbose] no config file, skip loading
2025/02/07 20:21:52 [debug] reading environment variables
.........
2025/02/07 20:21:52 [debug] 37149#0: *2 [lua] client.lua:618: init(): [dns_client] (re)configuring dns client
2025/02/07 20:21:52 [notice] 37149#0: *2 [lua] client.lua:161: new(): [dns_client] supported types: srv ipv4 , context: ngx.timer
2025/02/07 20:21:52 [verbose] preparing nginx prefix directory at /usr/local/kong
2025/02/07 20:21:52 [verbose] generating trusted certs combined file in /usr/local/kong/.ca_combined
2025/02/07 20:21:52 [info] 37149#0: *2 [lua] node.lua:303: new(): kong node-id: 9a281286-e3a9-43d6-80de-f540273b2dc0, context: ngx.timer
2025/02/07 20:21:52 [debug] 37149#0: *2 [lua] client.lua:550: resolve_all(): [dns_client] resolve_all db_server.company.com:-1
2025/02/07 20:21:52 [debug] 37149#0: *2 [lua] client.lua:534: [dns_client] cache miss, try to query db_server.company.com:-1
2025/02/07 20:21:53 [debug] 37149#0: *2 [lua] client.lua:362: resolve_query(): [dns_client] r:query(db_server.company.com:1) ans:1 t:183 ms
2025/02/07 20:21:53 [debug] 37149#0: *2 [lua] client.lua:332: process_answers(): [dns_client] processed ans:1
2025/02/07 20:21:53 [debug] 37149#0: *2 [lua] client.lua:567: resolve_all(): [dns_client] cache lookup db_server.company.com:-1 ans:1 hlv:miss
2025/02/07 20:21:53 [debug] 37149#0: *5 [lua] super.lua:161: scaling_log(): [timer-ng] load_avg: 0, runable_jobs_avg: 0, alive_threads_avg: 24
2025/02/07 20:21:53 [debug] loading subsystems migrations...
2025/02/07 20:21:53 [verbose] retrieving database schema state...
2025/02/07 20:21:53 [debug] 37149#0: *2 [lua] client.lua:550: resolve_all(): [dns_client] resolve_all db_server.company.com:-1
2025/02/07 20:21:53 [debug] 37149#0: *2 [lua] client.lua:567: resolve_all(): [dns_client] cache lookup db_server.company.com:-1 ans:1 hlv:hit_lru
2025/02/07 20:21:53 [verbose] schema state retrieved
2025/02/07 20:21:53 [debug] 37149#0: *2 [lua] client.lua:550: resolve_all(): [dns_client] resolve_all db_server.company.com:-1
2025/02/07 20:21:53 [debug] 37149#0: *2 [lua] client.lua:567: resolve_all(): [dns_client] cache lookup db_server.company.com:-1 ans:1 hlv:hit_lru
2025/02/07 20:21:53 [debug] loading subsystems migrations...
2025/02/07 20:21:53 [verbose] retrieving database schema state...
2025/02/07 20:21:53 [debug] 37149#0: *2 [lua] client.lua:550: resolve_all(): [dns_client] resolve_all db_server.company.com:-1
2025/02/07 20:21:53 [debug] 37149#0: *2 [lua] client.lua:567: resolve_all(): [dns_client] cache lookup db_server.company.com:-1 ans:1 hlv:hit_lru
2025/02/07 20:21:53 [verbose] schema state retrieved
2025/02/07 20:21:53 [info] Database is already up-to-date
2025/02/07 20:21:53 [debug] 37149#0: *2 [lua] client.lua:550: resolve_all(): [dns_client] resolve_all db_server.company.com:-1
2025/02/07 20:21:53 [debug] 37149#0: *2 [lua] client.lua:567: resolve_all(): [dns_client] cache lookup db_server.company.com:-1 ans:1 hlv:hit_lru
2025/02/07 20:21:53 [notice] 37149#0: gracefully shutting down
2025/02/07 20:21:54 [notice] 37149#0: exiting
2025/02/07 20:21:54 [notice] 37149#0: exit

Next up, let me test with the new DNS feat turned off.

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 7, 2025

Testing these CLI commands now on 3.9.0 with:

KONG_NEW_DNS_CLIENT
          value: "off"

So I notice Kong startup much slower again w old vs new:

2025/02/07 20:44:17 [notice] 38#0: *1300 [lua] warmup.lua:80: finished warming up DNS entries' into the cache (in 5767ms), context: ngx.timer

Where new DNS warmup times looks like:

2025/02/07 18:08:19 [notice] 39#0: *1298 [lua] warmup.lua:80: finished warming up DNS entries' into the cache (in 68ms), context: ngx.timer

Huge difference there so bravo Kong on that improvement. 5767ms vs 68ms

But... going back to value off for new dns client still shows same errant behavior with the CLI being flakey and the oddities around checking the Kong version, no difference there... So not totally sure new DNS client toggle is the problem in this newer version of Kong for woes observed.

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 8, 2025

Okay.. it seems regular Kong runtime is impacted too. Am seeing way more DNS related errors than I ever have before too with Kong.

Examples:

2025/02/07 18:07:53 [crit] 39#0: *8 [lua] upstreams.lua:98: could not obtain list of upstreams: [postgres] [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.xxx.xx.xxx:53: timeout, took 4 ms. Tried: [["db_server.company.com:A","DNS server error: failed to receive reply from UDP server 10.xxx.xx.249:53: timeout, took 4 ms"]], context: ngx.timer
2025/02/07 18:07:53 [crit] 39#0: *8 [lua] init.lua:247: init(): failed loading list of upstreams: failed to get from node cache: [postgres] [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.xxx.xx.249:53: timeout, took 4 ms. Tried: [["db_server.company.com:A","DNS server error: failed to receive reply from UDP server 10.xxx.xx.xxx:53: timeout, took 4 ms"]], context: ngx.timer
2025/02/07 18:07:53 [notice] 38#0: *1 [lua] warmup.lua:162: single_dao(): finished preloading 'consumers' into the cache (in 0ms), context: init_worker_by_lua*
2025/02/07 18:07:53 [notice] 38#0: *1 [lua] warmup.lua:113: single_dao(): Preloading 'acls' into the cache..., context: init_worker_by_lua*
2025/02/07 18:07:53 [notice] 38#0: *1 [lua] warmup.lua:162: single_dao(): finished preloading 'acls' into the cache (in 0ms), context: init_worker_by_lua*
2025/02/07 18:07:53 [notice] 38#0: *1 [lua] warmup.lua:113: single_dao(): Preloading 'certificates' into the core_cache..., context: init_worker_by_lua*
2025/02/07 18:07:53 [notice] 38#0: *1 [lua] warmup.lua:162: single_dao(): finished preloading 'certificates' into the core_cache (in 0ms), context: init_worker_by_lua*
2025/02/07 18:07:53 [notice] 38#0: *1 [lua] warmup.lua:113: single_dao(): Preloading 'jwt_secrets' into the cache..., context: init_worker_by_lua*
2025/02/07 18:07:53 [notice] 38#0: *1 [lua] warmup.lua:162: single_dao(): finished preloading 'jwt_secrets' into the cache (in 0ms), context: init_worker_by_lua*
2025/02/07 18:07:53 [notice] 38#0: *1 [lua] warmup.lua:113: single_dao(): Preloading 'oauth2_credentials' into the cache..., context: init_worker_by_lua*
2025/02/07 18:07:53 [notice] 38#0: *1 [lua] warmup.lua:162: single_dao(): finished preloading 'oauth2_credentials' into the cache (in 0ms), context: init_worker_by_lua*
2025/02/07 18:07:53 [notice] 38#0: *1299 [lua] warmup.lua:46: warming up DNS entries ..., context: ngx.timer
2025/02/07 18:07:53 [notice] 39#0: *5 [lua] worker.lua:304: communicate(): worker #1 is ready to accept events from unix:/usr/local/kong/sockets/we, context: ngx.timer

Notice how its odd that the first upstreams fetch failed with dns but subsequent requests seemed to work fine? flakey Kong behavior here.

Other examples I am seeing:

2025/02/07 18:07:53 [notice] 38#0: *1299 [lua] warmup.lua:46: warming up DNS entries ..., context: ngx.timer
2025/02/07 18:07:53 [notice] 39#0: *5 [lua] worker.lua:304: communicate(): worker #1 is ready to accept events from unix:/usr/local/kong/sockets/we, context: ngx.timer
2025/02/07 18:07:53 [notice] 38#0: *1293 [lua] broker.lua:263: run(): worker #1 connected to events broker (worker pid: 39), client: unix:, server: kong_worker_events, request: "GET / HTTP/1.1", host: "localhost"
2025/02/07 18:07:53 [notice] 38#0: *1294 [lua] broker.lua:263: run(): worker #2 connected to events broker (worker pid: 40), client: unix:, server: kong_worker_events, request: "GET / HTTP/1.1", host: "localhost"
2025/02/07 18:07:53 [notice] 40#0: *649 [lua] worker.lua:304: communicate(): worker #2 is ready to accept events from unix:/usr/local/kong/sockets/we, context: ngx.timer
2025/02/07 18:07:53 [notice] 38#0: *1296 [lua] worker.lua:304: communicate(): worker #0 is ready to accept events from unix:/usr/local/kong/sockets/we, context: ngx.timer
2025/02/07 18:07:53 [notice] 38#0: *1939 [lua] broker.lua:263: run(): worker #0 connected to events broker (worker pid: 38), client: unix:, server: kong_worker_events, request: "GET / HTTP/1.1", host: "localhost"
2025/02/07 18:07:54 [notice] 38#0: *1299 [lua] warmup.lua:80: finished warming up DNS entries' into the cache (in 84ms), context: ngx.timer
2025/02/07 18:09:00 [error] 40#0: *2216 [kong] api_helpers.lua:387 [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.xxx.xx.249:53: timeout, took 0 ms. Tried: [["db_server.company.com:A","DNS server error: failed to receive reply from UDP server 10.xxx.xx.249:53: timeout, took 0 ms"]], client: 127.0.0.1, server: kong_admin, request: "PUT /upstreams/healthcheck_db_put HTTP/1.1", host: "localhost:8001"
2025/02/07 18:10:31 [error] 38#0: *1657 [kong] configuration.lua:159 [NONAME] Request to fetch configuration failed: closed, context: ngx.timer
2025/02/07 18:10:31 [error] 38#0: *1644 [kong] prevention.lua:257 [NONAME] Request to fetch prevention rules failed: closed, context: ngx.timer

And this one dealing with an oidc plugin trying to also do a lookup and failing too just using regular lua http resty libs:

2025/02/07 18:16:34 [error] 40#0: *4658 [lua] openidc.lua:511: openidc_discover(): Could not get JSON discovery from cache or lookup accessing url (https://identity_provider.company.com/.well-known/openid-configuration) failed: [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.xxx.xx.249:53: timeout, took 1 ms. Tried: [["identity_provider.company.com:A","DNS server error: failed to receive reply from UDP server 10.xxx.xx.249:53: timeout, took 1 ms"]],

Whats interesting to me is look at the DNS timeout amounts too, why these 0 and 1 ms timeouts, thats totally unreasonable to expect networking to always be sub 1ms. Most DNS queries take a little longer than 1ms for us. sometimes 4-5ms or so. Why is the timeout seemingly set so low at 0-1ms? I do see the earlier DNS one with db did wait 4ms before the timeout error but even thats aggressive. I would like a 1 second timeout for any DNS query at least to give things a chance.

And did test doing nslookup within the same pod to the same 249 nameserver that Kong is running on, never any issues or errors after 20 executions of it:

ubuntu@kong-574bd4fd6b-jfw29:/$ nslookup identity_provider.company.com 10.xxx.xx.249
Server:         10.xxx.xx.249
Address:        10.xxx.xx.249#53

identity_provider.company.com     canonical name = identity_provider-dev.company.uhg.com.
Name:  identity_provider.company.com
Address: 10.xxx.xxx.179

@jeremyjpj0916
Copy link
Contributor Author

Plus my DNS RES_OPTIONS were at least set to:

            - name: LOCALDOMAIN
              value: .
            - name: RES_OPTIONS
              value: ndots:1 attempts:1 timeout:2

Which should be giving me a 2 second timeout even not these 0-1-4ms timeout responses. Something causing cosocket clashing issues or something within the src client node failing to use its own cosockets correctly bouncing things back quickly?

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 8, 2025

This won't be the issue now I don't think, edited this away.

@jeremyjpj0916 jeremyjpj0916 changed the title Kong 3.9.0 bugs Kong 3.9.0 bugs, seems DNS/cosocket related issues Feb 8, 2025
@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 11, 2025

Other things we know did vs did not change between Kong 3.7.1 to 3.9.0:

2025/02/07 19:49:20 [verbose] Kong: 3.7.1
2025/02/07 19:49:20 [debug] ngx_lua: 10026
2025/02/07 19:49:20 [debug] nginx: 1025003
2025/02/07 19:49:20 [debug] Lua: LuaJIT 2.1.0-20231117

2025/02/07 19:48:43 [verbose] Kong: 3.9.0
2025/02/07 19:48:43 [debug] ngx_lua: 10026
2025/02/07 19:48:43 [debug] nginx: 1025003
2025/02/07 19:48:43 [debug] Lua: LuaJIT 2.1.0-20231117.1

So same versions of nginx lua, nginx webserver. LuaJIT got a *.1 minor patch looks like, not sure if that would play any role but I assume not. So issue isn't in nginx lua or nginx webserver land since they all stayed the same.

Wonder if this newer PR could help?:
https://github.com/Kong/kong/pull/14227/files

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 12, 2025

AH HA! I am getting closer to the real issue, so before this always gave us best DNS perf for environment variables with Kong in any given environment:

- name: LOCALDOMAIN
  value: .
- name: RES_OPTIONS
  value: ndots:1 attempts:1 timeout:2

The above always worked fine on older Kongs all the way up to 3.7.1 just fine it seems.. What changed from 3.7.1 to 3.8.0/3.9.0 that caused the differing behavior hmm.

I have specifically isolated that the issues arise when of the above DNS optimization this ENV variable setup is present:

- name: RES_OPTIONS
  value: ndots:1 attempts:1 timeout:2

(Initially removing both, I added LOCALDOMAIN one back and things were stable with DNS after too, so its not that one)

Edit: Worth noting with those unfamiliar with Kong and tuning DNS that you have to set those variables in the NGINX conf too: #13323

cc @chobits @bungle @Tieske

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 12, 2025

Okay updating to this seemingly stable now on 3.9.0 too:

            - name: RES_OPTIONS
              value: ndots:1 attempts:3 timeout:5

Notice my increase in attempts and higher timeout... let me try changing just one of those from the original values above to see if we can better pin point which field it is newer Kong behaves poorly with.

Okay... its even stable with:

            - name: RES_OPTIONS
              value: ndots:1 attempts:3 timeout:2

As well, so back to old timeout value but attempts I increased from 1 to 3 of my original... let me drop that down to 2 to be as close to the original optimizations as I can.

Okay, 2 attempts is also unstable and yielding poor dns results in 3.9.0.

Also the results don't make sense to me based on the DNS tuning configs either. We say timeout 2 seconds above, yet when Kong fails it fails to wait on a UDP reply it seems with a timeout well before 2 seconds?

2025/02/12 07:20:01 [debug] 240#0: *2 [lua] client.lua:618: init(): [dns_client] (re)configuring dns client
2025/02/12 07:20:01 [notice] 240#0: *2 [lua] client.lua:161: new(): [dns_client] supported types: srv ipv4 , context: ngx.timer
2025/02/12 07:20:01 [verbose] preparing nginx prefix directory at /usr/local/kong
2025/02/12 07:20:01 [verbose] generating trusted certs combined file in /usr/local/kong/.ca_combined
2025/02/12 07:20:01 [info] 240#0: *2 [lua] node.lua:303: new(): kong node-id: 7d737539-c9a1-4068-9ebc-e51ff3f89ea7, context: ngx.timer
2025/02/12 07:20:01 [debug] 240#0: *2 [lua] client.lua:550: resolve_all(): [dns_client] resolve_all database_host.co.com:-1
2025/02/12 07:20:01 [debug] 240#0: *2 [lua] client.lua:534: [dns_client] cache miss, try to query database_host.co.com:-1
2025/02/12 07:20:02 [debug] 240#0: *5 [lua] super.lua:161: scaling_log(): [timer-ng] load_avg: 0, runable_jobs_avg: 0, alive_threads_avg: 24
2025/02/12 07:20:02 [debug] 240#0: *2 [lua] client.lua:362: resolve_query(): [dns_client] r:query(database_host.co.com:1) ans:- t:186 ms
2025/02/12 07:20:02 [debug] 240#0: *2 [lua] client.lua:567: resolve_all(): [dns_client] cache lookup database_host.co.com:-1 ans:- hlv:fail
Error: 
/usr/local/share/lua/5.1/kong/cmd/migrations.lua:101: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.x.xxx.103:53: timeout, took 186 ms. Tried: [["database_host.co.com:A","DNS server error: failed to receive reply from UDP server 10.x.xxx.103:53: timeout, took 186 ms"]]
stack traceback:
        [C]: in function 'assert'
        /usr/local/share/lua/5.1/kong/cmd/migrations.lua:101: in function 'cmd_exec'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:31: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:31>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:31: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:15>
        (command line -e):5: in function 'inline_gen'
        init_worker_by_lua(nginx.conf:274):44: in function <init_worker_by_lua(nginx.conf:274):43>
        [C]: in function 'xpcall'
        init_worker_by_lua(nginx.conf:274):52: in function <init_worker_by_lua(nginx.conf:274):50>

Big emphasis on the DNS server error: failed to receive reply from UDP server 10.x.xxx.103:53: timeout, took 186 ms. Tried: [["database_host.co.com:A","DNS server error: failed to receive reply from UDP server 10.x.xxx.103:53: timeout, took 186 ms"]]

Why is it timing out at 186ms when it should wait a full 2 seconds before DNS client should give up on the query? Seems Kong DNS client is not respecting the ENV settings. cc @chobits @bungle @Tieske

Edit Edit: I also tried value: ndots:1 attempts:1 timeout:5 and it was super unstable as well. Seems the timeout field is not playing much help(back on my thesis that Kong DNS isn't really respecting the timeout field anyways).

@bungle
Copy link
Member

bungle commented Feb 12, 2025

About the logging, it seems like this PR did it:
#13143

@windmgc seems like something got worse while getting something "better". Edit: I am looking the logging now as there is something else as well, like reducing the log flood that I am working.

@windmgc
Copy link
Member

windmgc commented Feb 12, 2025

I don't feel this like a big deal as the original kong version command is running with --vv which means debug, and it is only a one-time output instead of recurring inside a log flow.

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 12, 2025

Yeah as long as the added webserver looking logging statements are not causing anything harmful or over processing in the bg too prior then thats not really a major concern to me, just a notable difference folks can expect. The extra log output only occurs on -v and -vv for that if its just extra logging is all.

But the seeming degradation in DNS behavior with Kong to where I am having to tweak my DNS settings more to achieve resolution is troubling/buggy w the above provided analysis and seeming root cause issue.

@lordgreg
Copy link

Hi,

we are having (maybe) the exact same issues- At some point, Kong just doesn't want to resolve the Postgres host anymore. What is funny is that even when the Kong goes into the CrashLoop and starts again, it doesn't work. The migrations still reply with

failed to get create response: rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: read unix @->/var/run/tw.runc.sock: use of closed network connection"
Error: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: timeout

and only then starts to work when we completely delete the pod and the deployments recreates is for us again.

@lordgreg
Copy link

lordgreg commented Feb 13, 2025

Hi @jeremyjpj0916, I will reply on your comment from #14233 (comment) here

@lordgreg have not noticed that problem specifically just yet but does seem similar or related to the issues I have found. Feels like Kong isn't respecting DNS timeout settings and also for times where dns fails in 0-1ms sometimes in the logs makes me think as a client its trying to reuse stale sockets or something that it shouldn't be. Very strange behavior. Would have thought Kong's functional test suites would have caught something like this but may go deeper than that. Have you tried adding my DNS tunning and setting the attempts to 3? Seems to help some rn with our stuff.

I just added the RES_OPTIONS with the value ndots:1 attempts:3 timeout:2. Keep in mind I don't use any other envars regarding DNS settings.

I will keep you posted. 👍🏻

@jeremyjpj0916
Copy link
Contributor Author

Hi @jeremyjpj0916, I will reply on your comment from #14233 (comment) here

@lordgreg have not noticed that problem specifically just yet but does seem similar or related to the issues I have found. Feels like Kong isn't respecting DNS timeout settings and also for times where dns fails in 0-1ms sometimes in the logs makes me think as a client its trying to reuse stale sockets or something that it shouldn't be. Very strange behavior. Would have thought Kong's functional test suites would have caught something like this but may go deeper than that. Have you tried adding my DNS tunning and setting the attempts to 3? Seems to help some rn with our stuff.

I just added the RES_OPTIONS with the value ndots:1 attempts:3 timeout:2. Keep in mind I don't use any other envars regarding DNS settings.

I will keep you posted. 👍🏻

Make sure you check on other settings that has to be done: #13323

@chobits
Copy link
Contributor

chobits commented Feb 14, 2025

From your testing progress, it seems that this pr's suggestion will fix your problem? #13323

Kong currently ignores all envs. a tricky way is to add this into kong.conf: nginx_main_env = RES_OPTIONS,

@chobits
Copy link
Contributor

chobits commented Feb 14, 2025

also for this error /usr/local/share/lua/5.1/kong/cmd/migrations.lua:101: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.xx.xx.105:53: timeout, took 153 ms

I think it's unexpected that the timeout occurred with only 153ms, while the default value should be 2s.

See kong/dns/client.lua

 local r_opts = {
    retrans = opts.retrans or resolv.options.attempts or 5,
    timeout = opts.timeout or resolv.options.timeout or 2000, -- ms. <<<<< here default value
    no_random = no_random,
    nameservers = nameservers,
  }

Hi @jeremyjpj0916 , as if you could reproduce it with simple kong command, could modify kong's source code then rerun it, letting us see the dns queried options. I'll give you a patch:

diff --git a/kong/dns/client.lua b/kong/dns/client.lua
index 81a18efeb3..81c66b600b 100644
--- a/kong/dns/client.lua
+++ b/kong/dns/client.lua
@@ -393,6 +393,8 @@ local function resolve_query(self, name, qtype, tries)
     cache_hit = false,
   })

+  ngx.log(ngx.ERR, "xxx resolver r_opts: ", require("inspect")(self.r_opts))
+
   local r, err = resolver:new(self.r_opts)
   if not r then
     return nil, "failed to instantiate the resolver: " .. err

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 14, 2025

From your testing progress, it seems that this pr's suggestion will fix your problem? #13323

Kong currently ignores all envs. a tricky way is to add this into kong.conf: nginx_main_env = RES_OPTIONS,

@chobits we investigated that prior and yeah I already ensure I set that up like so, I didn't include it in my original config above with the issue though, but will show you how I always set it here:

    spec:
      containers:
      - env:
        - name: KONG_NGINX_MAIN_ENV
          value: "SPLUNK_HOST; env KONG_SSL_CERT_KEY; env KONG_SSL_CERT_DER; env RES_OPTIONS; env LOCALDOMAIN"

Good idea on the other suggestion though, let me patch my image build and drop in that additional logging. Not sure how you would wanna go about tracing cosocket usage misbehavior if that somehow plays a role in some of this too beyond just the dns behavior but lets try logging the observed DNS settings in the env in that spot of code.

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 14, 2025

@chobits Looks like those settings themselves are persisting as expected:

2025/02/14 07:36:49 [error] 38#0: *1541 [lua] client.lua:349: resolve_query(): DEBUGGING xxx resolver r_opts: {
  nameservers = { { "10.xx.xx.105", 53 }, { "10.xxxx.103", 53 } },
  no_random = true,
  retrans = 3,
  timeout = 2
}, context: ngx.timer

Where when I have retrans set to 3 seemingly Kong DNS becomes seemingly stable again on 3.9.0.

Output of the CLI command with your added logging statements:

ubuntu@kong-75dd5df66-mwwst:/$ kong migrations finish -vv
2025/02/14 07:37:41 [notice] 246#0: using the "epoll" event method
2025/02/14 07:37:41 [notice] 246#0: openresty/1.25.3.2
2025/02/14 07:37:41 [notice] 246#0: OS: Linux 5.14.0-427.50.1.el9_4.x86_64
2025/02/14 07:37:41 [notice] 246#0: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] globalpatches.lua:10: installing the globalpatches
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:621: init(): [dns_client] (re)configuring dns client
2025/02/14 07:37:41 [notice] 246#0: *2 [lua] client.lua:161: new(): [dns_client] supported types: srv ipv4 ipv6 , context: ngx.timer
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] globalpatches.lua:445: randomseed(): seeding PRNG from OpenSSL RAND_bytes()
2025/02/14 07:37:41 [verbose] Kong: 3.9.0
2025/02/14 07:37:41 [debug] ngx_lua: 10026
2025/02/14 07:37:41 [debug] nginx: 1025003
2025/02/14 07:37:41 [debug] Lua: LuaJIT 2.1.0-20231117.1
.....
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:621: init(): [dns_client] (re)configuring dns client
2025/02/14 07:37:41 [notice] 246#0: *2 [lua] client.lua:161: new(): [dns_client] supported types: srv ipv4 , context: ngx.timer
2025/02/14 07:37:41 [verbose] preparing nginx prefix directory at /usr/local/kong
2025/02/14 07:37:41 [verbose] generating trusted certs combined file in /usr/local/kong/.ca_combined
2025/02/14 07:37:41 [info] 246#0: *2 [lua] node.lua:303: new(): kong node-id: 8fc7000c-e34c-4fad-92c2-7fe60de9868d, context: ngx.timer
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:553: resolve_all(): [dns_client] resolve_all db_server.co.com:-1
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:537: [dns_client] cache miss, try to query db_server.co.com:-1
2025/02/14 07:37:41 [error] 246#0: *2 [lua] client.lua:349: resolve_query(): DEBUGGING xxx resolver r_opts: {
  nameservers = { { "10.xx.xx.105", 53 }, { "10.xxxx.103", 53 } },
  no_random = true,
  retrans = 3,
  timeout = 2
}, context: ngx.timer
2025/02/14 07:37:41 [debug] 246#0: *5 [lua] super.lua:161: scaling_log(): [timer-ng] load_avg: 0, runable_jobs_avg: 0, alive_threads_avg: 24
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:365: resolve_query(): [dns_client] r:query(db_server.co.com:1) ans:1 t:142 ms
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:332: process_answers(): [dns_client] processed ans:1
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:570: resolve_all(): [dns_client] cache lookup db_server.co.com:-1 ans:1 hlv:miss
2025/02/14 07:37:41 [debug] loading subsystems migrations...
2025/02/14 07:37:41 [verbose] retrieving database schema state...
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:553: resolve_all(): [dns_client] resolve_all db_server.co.com:-1
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:570: resolve_all(): [dns_client] cache lookup db_server.co.com:-1 ans:1 hlv:hit_lru
2025/02/14 07:37:41 [verbose] schema state retrieved
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:553: resolve_all(): [dns_client] resolve_all db_server.co.com:-1
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:570: resolve_all(): [dns_client] cache lookup db_server.co.com:-1 ans:1 hlv:hit_lru
2025/02/14 07:37:41 [debug] loading subsystems migrations...
2025/02/14 07:37:41 [verbose] retrieving database schema state...
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:553: resolve_all(): [dns_client] resolve_all db_server.co.com:-1
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:570: resolve_all(): [dns_client] cache lookup db_server.co.com:-1 ans:1 hlv:hit_lru
2025/02/14 07:37:41 [verbose] schema state retrieved
2025/02/14 07:37:41 [info] No pending migrations to finish
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:553: resolve_all(): [dns_client] resolve_all db_server.co.com:-1
2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:570: resolve_all(): [dns_client] cache lookup db_server.co.com:-1 ans:1 hlv:hit_lru
2025/02/14 07:37:41 [notice] 246#0: gracefully shutting down
2025/02/14 07:37:42 [notice] 246#0: exiting
2025/02/14 07:37:42 [notice] 246#0: exit

Now the same output with me reducing the retransmits back to 1 request like 3.7.1 Kong version used to run stable with DNS, but on 3.9.0 below:

2025/02/14 07:43:35 [error] 38#0: *1298 [lua] client.lua:349: resolve_query(): DEBUGGING xxx resolver r_opts: {
  nameservers = { { "10.xx.xx.105", 53 }, { "10.xxxx.103", 53 } },
  no_random = true,
  retrans = 1,
  timeout = 2
}, context: ngx.timer
2025/02/14 07:43:35 [notice] 38#0: *1298 [lua] warmup.lua:80: finished warming up DNS entries' into the cache (in 62ms), context: ngx.timer

And now lets do the migrations cli command and see the instability logs on 3.9.0 as well:

2025/02/14 07:44:53 [info] 200#0: *2 [lua] node.lua:303: new(): kong node-id: d7b8bb3d-6b7c-4d7b-ae3d-f2eca4476d9e, context: ngx.timer
2025/02/14 07:44:53 [debug] 200#0: *2 [lua] client.lua:553: resolve_all(): [dns_client] resolve_all db_server.co.com:-1
2025/02/14 07:44:53 [debug] 200#0: *2 [lua] client.lua:537: [dns_client] cache miss, try to query db_server.co.com:-1
2025/02/14 07:44:53 [error] 200#0: *2 [lua] client.lua:349: resolve_query(): DEBUGGING xxx resolver r_opts: {
  nameservers = { { "10.xx.xx.105", 53 }, { "10.xxxx.103", 53 } },
  no_random = true,
  retrans = 1,
  timeout = 2
}, context: ngx.timer
2025/02/14 07:44:53 [debug] 200#0: *2 [lua] client.lua:365: resolve_query(): [dns_client] r:query(db_server.co.com:1) ans:- t:144 ms
2025/02/14 07:44:53 [debug] 200#0: *2 [lua] client.lua:570: resolve_all(): [dns_client] cache lookup db_server.co.com:-1 ans:- hlv:fail
Error: 
/usr/local/share/lua/5.1/kong/cmd/migrations.lua:101: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.xx.xx.105:53: timeout, took 144 ms. Tried: [["db_server.co.com:A","DNS server error: failed to receive reply from UDP server 10.xx.xx.105:53: timeout, took 144 ms"]]
stack traceback:
        [C]: in function 'assert'
        /usr/local/share/lua/5.1/kong/cmd/migrations.lua:101: in function 'cmd_exec'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:31: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:31>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:31: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:15>
        (command line -e):5: in function 'inline_gen'
        init_worker_by_lua(nginx.conf:274):44: in function <init_worker_by_lua(nginx.conf:274):43>
        [C]: in function 'xpcall'
        init_worker_by_lua(nginx.conf:274):52: in function <init_worker_by_lua(nginx.conf:274):50>
ubuntu@kong-6c679d8d44-2f5fs:/$ 

Can see the "timeout" took 144ms and supposedly was a timeout with respect to DNS. I think while its weird its not respecting the timeout for dns lookup supposedly a bigger issue Kong users OSS/Enterprise may see if on version > 3.7.1 is that they have to configure their DNS attempts > 1 to like 3 in our case to get stability, and understanding why from 3.7.1 to 3.8-3.9 versions that is happening will be critical to squashing what feels like some kinda bug to me. The timeout bit not behaving to the config is interesting too though but note the only difference above as I redeployed as adjusting my dns attempts param.

Edit - is kinda weird Kong waits seemingly arbitrarily to 144ms in the flaky Kong runtime of attempts=1 runtime pod and in the attempts 3 runtime pod we see it resolves in 142ms:

2025/02/14 07:37:41 [debug] 246#0: *2 [lua] client.lua:365: resolve_query(): [dns_client] r:query(db_server.co.com:1) ans:1 t:142 ms

Why is the 144ms so arbitrary, what is tripping Kong to stop at that point in time waiting eh... weird.

And for further insight I ran nslookup against that db host too in same flakey Kong pod when attempts is left at 1 and see no stability issues with respect to lookups using the dns nameserver Kong uses, did it 20 times same result:

ubuntu@kong-6c679d8d44-2f5fs:/$ nslookup db_server.co.com 10.xx.xx.105
Server:         10.xx.xx.105
Address:        10.xx.xx.105#53

Name:   db_server.co.com
Address: 10.xxx.xxx.19

Does Kong have a pretty robust Kong functional sandbox perhaps with lots of dns records and backend APIs you can setup a attempts = 1 override and see if you can reproduce oddities in talking to your own networked Kong postgres via the Kong migrations CLI commands, and Kong plugins failing to resolve DNS records? Then you can test on Kong 3.7.1 in same kinda environment and should see things stable even at attempts 1.

Noting my earlier provided examples way up where even an oidc plugin we have failed to resolve a hostname in its runtime using the common resty http libs to resolve things like we all typically do, which I suppose does not even use your specific Kong dns library to resolve that given host, so could the issue be elsewhere in Kong causing the problem as well such that plugins can get hit too that just do http networking that will require a DNS lookup?:

2025/02/07 18:16:34 [error] 40#0: *4658 [lua] openidc.lua:511: openidc_discover(): Could not get JSON discovery from cache or lookup accessing url (https://identity_provider.company.com/.well-known/openid-configuration) failed: [cosocket] DNS resolution failed: DNS server error: failed to receive reply from UDP server 10.xxx.xx.249:53: timeout, took 1 ms. Tried: [["identity_provider.company.com:A","DNS server error: failed to receive reply from UDP server 10.xxx.xx.249:53: timeout, took 1 ms"]],

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 14, 2025

Actually @chobits is there a problem in your settings table? I see you note this here:

timeout = opts.timeout or resolv.options.timeout or 2000, -- ms

Yet table has:

2025/02/14 07:36:49 [error] 38#0: *1541 [lua] client.lua:349: resolve_query(): DEBUGGING xxx resolver r_opts: {
  nameservers = { { "10.xx.xx.105", 53 }, { "10.xxxx.103", 53 } },
  no_random = true,
  retrans = 3,
  timeout = 2
}, context: ngx.timer

Does that mean your timeout is only waiting 2ms based on lua data or is that --ms comment invalid and should say -- seconds ? Because RES_OPTIONS is actually in seconds per RFC and its design and not ms so when you do 2 there it should translate to 2 seconds not 2 milliseconds. So maybe thats the big issue here yall are not * 1000 the RES_OPTIONS value?

Edit - let me hardcode that value to 2000 just for testing in my env and see if that helps, gonna try:

timeout = 2000, -- ms

In the dns client

Image

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 14, 2025

@chobits @bungle @Tieske AYYYY thats it!! retransmits at 1 AND hardcoding the right timeout in the code to match what I wanted with RES_OPTIONS timeout 2 = 2 seconds = 2000ms in your code fixed it!

2025/02/14 08:32:42 [error] 172#0: *2 [lua] client.lua:349: resolve_query(): DEBUGGING xxx resolver r_opts: {
  nameservers = { { "10.xx.xx.xxx", 53 }, { "10.x.xxx.xxx", 53 } },
  no_random = true,
  retrans = 1,
  timeout = 2000
}, context: ngx.timer

You need to update Kong code to be this for 3.8 and 3.9 as a hotfix back port to help folks out:

timeout = (opts.timeout and opts.timeout * 1000) or (resolv.options.timeout and resolv.options.timeout * 1000) or 2000, -- ms

Otherwise your DNS client is going to continue to be very impatient with nameservers with 1-2-3-4-5 ms waits haha.

Btw is Kong hiring for fun gateway debugging and analysis roles like this or support or just enhancing core+plugins? Would be fun to get to play around with stuff like this regularly and solve problems $210k base and work remote in my state of South Carolina and I think I would be sold if hours flexible and good work life balance haha.

Edit - Kinda interesting going back on the 144ms timeouts when the code is playing to 2ms DNS.. So like is that just the rest of the Kong time spent doing things before it actually wraps up to conclude 144ms passed when really only 2ms was allowed for that DNS lookup? Not totally sure but regardless am happy to find root cause!

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 18, 2025

Didn't wanna raise a new ticket so wanted to update after rolling out said fix to our larger customer facing environments and seeing what seems to be positive results. Wanted to bring up something I notice from hot path traffic logs though I see in some scenarios, just an odd way to log it w an empty table like that:

2025/02/18 08:06:46 [error] 39#0: *408892 [lua] init.lua:381: execute(): DNS resolution failed: dns server error: 3 name error. Tried: [["some.domain.com:A","dns server error: 3 name error"]], client: 10.29.169.210, server: kong, request: "POST /api/dev/ext/bpa.aai/service/v1 HTTP/1.1", host: "gateway_stage.com", request_id: "44bdef22cb28eb0d37ae287b2627942f"
2025/02/18 08:06:53 [error] 39#0: *409472 [lua] init.lua:381: execute(): DNS resolution failed: dns server error: 3 name error. Tried: {}, client: 10.29.169.204, server: kong, request: "POST/api/dev/ext/bpa.aai/service/v1 HTTP/1.1", host: "gateway_stage.com", request_id: "53442d451de23306657e6a6dfabdeb08"

I am going to go out on a limb and assume Tried: {} is empty because this is in the same pod where 7 seconds later on the same worker process a tx came through and the DNS resolution had already failed prior so a cache it was found on DNS lookup error so thats why the Tried:{} in a subsequent tx was empty cause it had a cache hit on the prior dns lookup error? Worth noting that some.domain.com Does not exist in this above example so first log makes sense to me, second log also makes sense to me somewhat by behavior but would prefer Kongs logging to differentiate a DNS cache hit on error from just tried empty table and log a DNS server error.

On this line is where that log stems from:

https://github.com/Kong/kong/blob/3.9.0/kong/runloop/balancer/init.lua#L381

How about change to this to make it more clear that it failed due to cached failure the second go for better log purposes and not doing empty tries block but at least give a hostname reference key to fall back on?

    hostname = balancer_data.host
    if not ip then
      if next(try_list) == nil then
        log(ERR, "DNS cache indicates a previous resolution failure: ", port, ". Hostname: ", hostname) --empty table scenario, resolution fail cache hit
      else
        log(ERR, "DNS resolution failed: ", port, ". Tried: ", tostring(try_list))
      end
      if port == "dns server error: 3 name error" or
         port == "dns client error: 101 empty record received" then
        return nil, "name resolution failed", 503
      end
    end

I would say think on it, would be a bit cleaner and clearer.

Best,
Jeremy

@jeremyjpj0916
Copy link
Contributor Author

jeremyjpj0916 commented Feb 18, 2025

Other positives I would like to note from the DNS improvements w the fixes in mind too can be seen very easily from older 3.7.1 to 3.9.0 Kong:

10x improvement in DNS throughput use and 70% reduction in dns cpu ultilization(the drop is post cutover on 3.9.0):

Image

Image

Big reduction in DNS response times due to the lower load:

Image

Definitely worth a shout out in yalls releases to note when using newer DNS settings in Kong the benefits customers will see. These are the kinda things I like to see software prioritize, get the underlyings done really well helps the whole ecosystem. Bravo on the majority of the DNS revamp minus the one bug we just squashed in this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants