-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker image hanging on Cassandra access for 30 minutes (Kubernetes only) #4559
Comments
Hi again, Here are complementary investigations we run on our side: it looks the problem is linked to the new DNS query implementation (I’m not saying it is due to it, just linked to it so far 😋 ). Indeed, we performed many tests, and our current hypothesis is that there might be an issue in the “network layer setup/establishment” that makes the Kong DNS query to Cassandra blocked in the new implementation while it was not in the previous one. Here are the tests we run (we have screenshots of WireShark available if you want):
So, here is our current status: we think there is a problem around the “network layer setup/establishment”. This problem was most probably not visible with the former implementation of the DNS query in Kong, but the new implementation done through #4296 and #4378 is now revealing the bug. Where is the bug ? In Kong (well, we don't think it is, but at least, the new implementation is revealing the issue as a side effect) In Alpine ? In Kubernetes (we've not seen the issue in raw Docker) ? In the network overlay deployed on Kubernetes (we have a cluster with Calico, but the other one does not have it) ? For the moment, we don’t know… but any idea, any help would be welcome ! We are still not comfortable with:
If you have an easy debugging capability, could you try to see on which call Kong is hanging? |
Episode III… I think we got it!
We have reproduced the exact same behavior explained in the previous posts, but using an “aggressive” Pod which only runs
As a sum-up of these tests:
|
That's quite a read thanks for the very thorough investigation / write up 👍 can you share your k8s yaml (or helm etc) that you're using (is it https://github.com/Kong/kong-dist-kubernetes or closely related?) |
Hi @hutchic, you are welcome !! I confirm what you said in Kong/docker-kong#258 (comment): this is a Cassandra issue only, and on Kubernetes only. Here is my test yaml file:
I made further investigations to locate where the issue could be. It is quite difficult as the logs are not displayed (the freeze of Kong blocks the display of the logs in the middle of the configuration parameter outputs: most probably no flush on stdout is done => none of the logs you could add to check the DNS processing is displayed). I managed it by inserting At the end, the configuration that is provided to the Resolver looks correct, with a number of retries set to 5 and a timeout set to 2000ms:
What is strange is that the socket that is created by the Resolver is supposed to be configured with the right data: 5 retries and a timeout of 2000 milli-seconds... but when executed, the timeout is indeed of 2000... seconds !! |
Sorry for only seeing this now. I was about to point to the Cassandra connector's override of the TPC/UDP sockets, but I see that your investigation already led you there @pamiel :) You will notice that the DNS resolver instantiated there specifies Here is the issue though: with cosockets (the OpenResty sockets), the timeout is specified in miliseconds (hence, 2 seconds), but with LuaSocket (since we overrode the sockets to resolve the hostnames in |
Note: lua-resty-dns-client may benefit from something like https://github.com/thibaultcha/lua-resty-socket (which is used by lua-cassandra and other driver modules out there). Such a module acts as a compatibility layer between LuaSocket and cosockets, and allows for transparent interoperability (i.e. timeout arguments can be specified as milliseconds, and will be converted to seconds before invoking the underlying LuaSocket |
@thibaultcha , indeed by default my |
Yes, I believe (without testing it myself) specifying this value in |
Hello @pamiel , are you still experiencing this problem, or can we close this issue? |
I am closing this due to lack of activity. Please reopen if necessary. |
Summary
I’m facing a very strange situation: command lines in my (home-made) Docker image of Kong in version 1.0.3 are working correctly, but since version 1.1 (tested on 1.1.1 and 1.1.2), the CLI commands are “hanging” at the very beginning of the process during about 30 minutes before moving forward!
This is the case for the
kong start
command as well as for much simpler commands such askong migrations list
.This happens when running Kong with Cassandra, but not with Postgres.
I can reproduce it when running on Kubernetes, but not on plain Docker.
I suspect this could be a side effect of resolutions of #4296 and #4378... introduced in Kong 1.1...
Steps To Reproduce
kong migrations list
This happens on my own Docker image... not tested so far with the "official" Docker image of Kong produced through docker-kong.
Additional Details & Logs
With logs sets to “debug”, hanging usually happens at the end of the display of the Kong configuration… but not exactly every time at the exact same location.
Then, the process is “hanging” during around 30 minutes (maybe a bit more: 32 to 33 minutes)… and it is finally “unlocked” and continues its processing until the end !! For the
kong migrations list
, it just displays the remaining logs:Looking to the time of the logs, it looks the
resolved Cassandra contact point 'cassandra-0'
is the first log displayed after the hanging.During the blocking situation, jumping into the container, a
ps
shows:It looks there is absolutely no activity on process 8 !
No log inside the
/tmp/resty_xOpJBTuZMW/logs/error.log
At first, trying to understand the root cause of the blocking situation, I tried to have a look to the file descriptors by running
ls -l /proc/8/fd
(in case this could be a problem with file access):I honestly don’t know if the 2 anonymous inodes could be the source cause of the issue !
The, I looked to the network by running
netstat -a
:Even more strange, still inside the container, I tried to run the exact same command manually (e.g.
kong migrations list
)… and it succeeds !!! I can run it as many times, runkong start
, etc, everything is ok… except that the one I’m really expected to run (i.e. the one automatically launched at Pod starting time) is still blocked !Where does this happen ?
Differences with the “official” Docker image made available through the docker-kong project:
kong start
instead, as I used to do that since ages :P and as I did not encounter any issue in using it so far. I however don’t think the current problem is linked to that as I the issue happens with much simpler commands such as “kong migrations list” which does not start Kong… unless akong prepare
shall also be done, even for the simple call ofkong migrations
?/usr/local/kong
but my prefix is not: this is/opt/kong/
The text was updated successfully, but these errors were encountered: