-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kong 3.9.0 bugs, seems DNS/cosocket related issues #14249
Comments
Okay this is strange... I then left the 3.9.0 pod up and reran the same exact CLI command just with a little more verbose enabled and got a response, meaning db connectivity was established, going to skip some of the output since its really long and just keep the good stuff: 3.9.0 CLI ran a second time again in -vv mode.
Why would running the same command a second time work when first time failed in same pod... let me see if I can reproduce when I rolling redeploy it again fresh and try same CLI actions multiple times. |
Okay yeah something is really strange here, From the same pod executing the migrations up on a db thats already up to date:
Same error executing the same command 4 times in a row, then the 5th time it worked on same pod, wild stuff:
Next up, let me test with the new DNS feat turned off. |
Testing these CLI commands now on 3.9.0 with:
So I notice Kong startup much slower again w old vs new:
Where new DNS warmup times looks like:
Huge difference there so bravo Kong on that improvement. 5767ms vs 68ms But... going back to value off for new dns client still shows same errant behavior with the CLI being flakey and the oddities around checking the Kong version, no difference there... So not totally sure new DNS client toggle is the problem in this newer version of Kong for woes observed. |
Okay.. it seems regular Kong runtime is impacted too. Am seeing way more DNS related errors than I ever have before too with Kong. Examples:
Notice how its odd that the first upstreams fetch failed with dns but subsequent requests seemed to work fine? flakey Kong behavior here. Other examples I am seeing:
And this one dealing with an oidc plugin trying to also do a lookup and failing too just using regular lua http resty libs:
Whats interesting to me is look at the DNS timeout amounts too, why these 0 and 1 ms timeouts, thats totally unreasonable to expect networking to always be sub 1ms. Most DNS queries take a little longer than 1ms for us. sometimes 4-5ms or so. Why is the timeout seemingly set so low at 0-1ms? I do see the earlier DNS one with db did wait 4ms before the timeout error but even thats aggressive. I would like a 1 second timeout for any DNS query at least to give things a chance. And did test doing nslookup within the same pod to the same 249 nameserver that Kong is running on, never any issues or errors after 20 executions of it:
|
Plus my DNS RES_OPTIONS were at least set to:
Which should be giving me a 2 second timeout even not these 0-1-4ms timeout responses. Something causing cosocket clashing issues or something within the src client node failing to use its own cosockets correctly bouncing things back quickly? |
This won't be the issue now I don't think, edited this away. |
Other things we know did vs did not change between Kong 3.7.1 to 3.9.0: 2025/02/07 19:49:20 [verbose] Kong: 3.7.1 2025/02/07 19:48:43 [verbose] Kong: 3.9.0 So same versions of nginx lua, nginx webserver. LuaJIT got a *.1 minor patch looks like, not sure if that would play any role but I assume not. So issue isn't in nginx lua or nginx webserver land since they all stayed the same. Wonder if this newer PR could help?: |
AH HA! I am getting closer to the real issue, so before this always gave us best DNS perf for environment variables with Kong in any given environment:
The above always worked fine on older Kongs all the way up to 3.7.1 just fine it seems.. What changed from 3.7.1 to 3.8.0/3.9.0 that caused the differing behavior hmm. I have specifically isolated that the issues arise when of the above DNS optimization this ENV variable setup is present:
(Initially removing both, I added LOCALDOMAIN one back and things were stable with DNS after too, so its not that one) Edit: Worth noting with those unfamiliar with Kong and tuning DNS that you have to set those variables in the NGINX conf too: #13323 |
Okay updating to this seemingly stable now on 3.9.0 too:
Notice my increase in attempts and higher timeout... let me try changing just one of those from the original values above to see if we can better pin point which field it is newer Kong behaves poorly with. Okay... its even stable with:
As well, so back to old timeout value but attempts I increased from 1 to 3 of my original... let me drop that down to 2 to be as close to the original optimizations as I can. Okay, 2 attempts is also unstable and yielding poor dns results in 3.9.0. Also the results don't make sense to me based on the DNS tuning configs either. We say timeout 2 seconds above, yet when Kong fails it fails to wait on a UDP reply it seems with a timeout well before 2 seconds?
Big emphasis on the DNS server error: failed to receive reply from UDP server 10.x.xxx.103:53: timeout, took 186 ms. Tried: [["database_host.co.com:A","DNS server error: failed to receive reply from UDP server 10.x.xxx.103:53: timeout, took 186 ms"]] Why is it timing out at 186ms when it should wait a full 2 seconds before DNS client should give up on the query? Seems Kong DNS client is not respecting the ENV settings. cc @chobits @bungle @Tieske Edit Edit: I also tried value: ndots:1 attempts:1 timeout:5 and it was super unstable as well. Seems the timeout field is not playing much help(back on my thesis that Kong DNS isn't really respecting the timeout field anyways). |
I don't feel this like a big deal as the original |
Yeah as long as the added webserver looking logging statements are not causing anything harmful or over processing in the bg too prior then thats not really a major concern to me, just a notable difference folks can expect. The extra log output only occurs on -v and -vv for that if its just extra logging is all. But the seeming degradation in DNS behavior with Kong to where I am having to tweak my DNS settings more to achieve resolution is troubling/buggy w the above provided analysis and seeming root cause issue. |
Hi, we are having (maybe) the exact same issues- At some point, Kong just doesn't want to resolve the Postgres host anymore. What is funny is that even when the Kong goes into the CrashLoop and starts again, it doesn't work. The migrations still reply with failed to get create response: rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: read unix @->/var/run/tw.runc.sock: use of closed network connection"
Error: [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: timeout and only then starts to work when we completely delete the pod and the deployments recreates is for us again. |
Hi @jeremyjpj0916, I will reply on your comment from #14233 (comment) here
I just added the RES_OPTIONS with the value I will keep you posted. 👍🏻 |
Make sure you check on other settings that has to be done: #13323 |
From your testing progress, it seems that this pr's suggestion will fix your problem? #13323 Kong currently ignores all envs. a tricky way is to add this into kong.conf: |
also for this error I think it's unexpected that the timeout occurred with only 153ms, while the default value should be 2s. See
Hi @jeremyjpj0916 , as if you could reproduce it with simple kong command, could modify kong's source code then rerun it, letting us see the dns queried options. I'll give you a patch:
|
@chobits we investigated that prior and yeah I already ensure I set that up like so, I didn't include it in my original config above with the issue though, but will show you how I always set it here:
Good idea on the other suggestion though, let me patch my image build and drop in that additional logging. Not sure how you would wanna go about tracing cosocket usage misbehavior if that somehow plays a role in some of this too beyond just the dns behavior but lets try logging the observed DNS settings in the env in that spot of code. |
@chobits Looks like those settings themselves are persisting as expected:
Where when I have retrans set to 3 seemingly Kong DNS becomes seemingly stable again on 3.9.0. Output of the CLI command with your added logging statements:
Now the same output with me reducing the retransmits back to 1 request like 3.7.1 Kong version used to run stable with DNS, but on 3.9.0 below:
And now lets do the migrations cli command and see the instability logs on 3.9.0 as well:
Can see the "timeout" took 144ms and supposedly was a timeout with respect to DNS. I think while its weird its not respecting the timeout for dns lookup supposedly a bigger issue Kong users OSS/Enterprise may see if on version > 3.7.1 is that they have to configure their DNS attempts > 1 to like 3 in our case to get stability, and understanding why from 3.7.1 to 3.8-3.9 versions that is happening will be critical to squashing what feels like some kinda bug to me. The timeout bit not behaving to the config is interesting too though but note the only difference above as I redeployed as adjusting my dns attempts param. Edit - is kinda weird Kong waits seemingly arbitrarily to 144ms in the flaky Kong runtime of attempts=1 runtime pod and in the attempts 3 runtime pod we see it resolves in 142ms:
Why is the 144ms so arbitrary, what is tripping Kong to stop at that point in time waiting eh... weird. And for further insight I ran nslookup against that db host too in same flakey Kong pod when attempts is left at 1 and see no stability issues with respect to lookups using the dns nameserver Kong uses, did it 20 times same result:
Does Kong have a pretty robust Kong functional sandbox perhaps with lots of dns records and backend APIs you can setup a attempts = 1 override and see if you can reproduce oddities in talking to your own networked Kong postgres via the Kong migrations CLI commands, and Kong plugins failing to resolve DNS records? Then you can test on Kong 3.7.1 in same kinda environment and should see things stable even at attempts 1. Noting my earlier provided examples way up where even an oidc plugin we have failed to resolve a hostname in its runtime using the common resty http libs to resolve things like we all typically do, which I suppose does not even use your specific Kong dns library to resolve that given host, so could the issue be elsewhere in Kong causing the problem as well such that plugins can get hit too that just do http networking that will require a DNS lookup?:
|
Actually @chobits is there a problem in your settings table? I see you note this here:
Yet table has:
Does that mean your timeout is only waiting 2ms based on lua data or is that --ms comment invalid and should say -- seconds ? Because RES_OPTIONS is actually in seconds per RFC and its design and not ms so when you do 2 there it should translate to 2 seconds not 2 milliseconds. So maybe thats the big issue here yall are not * 1000 the RES_OPTIONS value? Edit - let me hardcode that value to 2000 just for testing in my env and see if that helps, gonna try:
In the dns client |
@chobits @bungle @Tieske AYYYY thats it!! retransmits at 1 AND hardcoding the right timeout in the code to match what I wanted with RES_OPTIONS timeout 2 = 2 seconds = 2000ms in your code fixed it!
You need to update Kong code to be this for 3.8 and 3.9 as a hotfix back port to help folks out:
Otherwise your DNS client is going to continue to be very impatient with nameservers with 1-2-3-4-5 ms waits haha. Btw is Kong hiring for fun gateway debugging and analysis roles like this or support or just enhancing core+plugins? Would be fun to get to play around with stuff like this regularly and solve problems $210k base and work remote in my state of South Carolina and I think I would be sold if hours flexible and good work life balance haha. Edit - Kinda interesting going back on the 144ms timeouts when the code is playing to 2ms DNS.. So like is that just the rest of the Kong time spent doing things before it actually wraps up to conclude 144ms passed when really only 2ms was allowed for that DNS lookup? Not totally sure but regardless am happy to find root cause! |
Didn't wanna raise a new ticket so wanted to update after rolling out said fix to our larger customer facing environments and seeing what seems to be positive results. Wanted to bring up something I notice from hot path traffic logs though I see in some scenarios, just an odd way to log it w an empty table like that:
I am going to go out on a limb and assume Tried: {} is empty because this is in the same pod where 7 seconds later on the same worker process a tx came through and the DNS resolution had already failed prior so a cache it was found on DNS lookup error so thats why the Tried:{} in a subsequent tx was empty cause it had a cache hit on the prior dns lookup error? Worth noting that some.domain.com Does not exist in this above example so first log makes sense to me, second log also makes sense to me somewhat by behavior but would prefer Kongs logging to differentiate a DNS cache hit on error from just tried empty table and log a DNS server error. On this line is where that log stems from: https://github.com/Kong/kong/blob/3.9.0/kong/runloop/balancer/init.lua#L381 How about change to this to make it more clear that it failed due to cached failure the second go for better log purposes and not doing empty tries block but at least give a hostname reference key to fall back on?
I would say think on it, would be a bit cleaner and clearer. Best, |
Other positives I would like to note from the DNS improvements w the fixes in mind too can be seen very easily from older 3.7.1 to 3.9.0 Kong: 10x improvement in DNS throughput use and 70% reduction in dns cpu ultilization(the drop is post cutover on 3.9.0): Big reduction in DNS response times due to the lower load: Definitely worth a shout out in yalls releases to note when using newer DNS settings in Kong the benefits customers will see. These are the kinda things I like to see software prioritize, get the underlyings done really well helps the whole ecosystem. Bravo on the majority of the DNS revamp minus the one bug we just squashed in this thread. |
Is there an existing issue for this?
Kong version (
$ kong version
)Kong 3.9.0
Current Behavior
Wanted to walk through some observed issues after cutting over from 3.7.1 Kong traditional mode to 3.9.0 traditional mode. I am also setting the new Kong DNS config turned on as well as leaving the same Old Kong DNS env configs the same in place.
Ex:
First observation is the migrations up does make some changes, the migrations finish command seems to find no changes.Assuming thats to be expected.
Some background. I spin up a HCC K8S Kong job on the new version against a cloned Older Kong pg db copy to perform the upgrades out of place from our live database. Then I cutover our runtime Kong nodes to the newer database when I see that the migrations up and finish were successful.
Now that I have Kong 3.9.0 nodes running in place of the Kong 3.7.1 nodes the following bug observations can be made:
My pod running newer verison of Kong has a lot of CLI commands that either fail or behave in a way they should not:
First lets observe checking the Kong version:
Old 3.7.1:
New 3.9.0:
Almost feels like the newer CLI is like trying to spin up a totally separate nginx main worker process and init() logging near the top to print that stuff out and exit? That just seems odd/wrong for a CLI command to need to do so when it should just print out relevant values. And might be a helping guide into some of the other issues being observed if understood why thats doing that.
Next issue:
On my runtime 3.9 pods it seems the terminal sesh CLI can't reach the db properly like the startup of my runtime Kong pod could. This is all running from the same pod too so its like startup Kong pod CMD has no issues but the terminal shelled in CLI commands loses the functionality of reaching the db? Example below where the db has already been migrated and setup for each respective version of Kong we run right now but the newer version bombs on connecting to db in same environment setup.
3.7.1
3.9.0
^ Which is interesting considering the pod I am running that CLI command from is up and healthy talking to that exact postgres DB reference using the exact DNS resolver nameserver lol. How does the CLI process drop the ability to communicate effectively to it from same src active host?
One thing I noticed is 3.7.1 to 3.9.0 changes this events section, could this be playing a role? Or somehow the new DNS implementation is bugging things out?
3.7.1:
3.9.0:
Some kinda socket path reference difference that other parts of Kong did not transition to causing this issue? Not totally sure just trying to think aloud what all changed between the minor versions of Kong here between 3.7.1 to 3.9.0
cc @bungle Mr. Bungle to the rescue?
Expected Behavior
Steps To Reproduce
Anything else?
No response
The text was updated successfully, but these errors were encountered: