Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker image hanging on Cassandra access for 30 minutes (Kubernetes only) #4559

Closed
pamiel opened this issue Apr 30, 2019 · 10 comments
Closed
Labels
pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc...

Comments

@pamiel
Copy link
Contributor

pamiel commented Apr 30, 2019

Summary

I’m facing a very strange situation: command lines in my (home-made) Docker image of Kong in version 1.0.3 are working correctly, but since version 1.1 (tested on 1.1.1 and 1.1.2), the CLI commands are “hanging” at the very beginning of the process during about 30 minutes before moving forward!
This is the case for the kong start command as well as for much simpler commands such as kong migrations list.

This happens when running Kong with Cassandra, but not with Postgres.
I can reproduce it when running on Kubernetes, but not on plain Docker.

I suspect this could be a side effect of resolutions of #4296 and #4378... introduced in Kong 1.1...

Steps To Reproduce

  1. On a Kubernetes cluster, run a Cassandra in version 2.2 or 3.10
  2. Run a Kubernetes Pod of (my !) Docker image of Kong, with a command as simple as kong migrations list
  3. Pod is hanging for the next 30 minutes!

This happens on my own Docker image... not tested so far with the "official" Docker image of Kong produced through docker-kong.

Additional Details & Logs

With logs sets to “debug”, hanging usually happens at the end of the display of the Kong configuration… but not exactly every time at the exact same location.

2019/04/30 14:31:10 [warn] You are using Cassandra but your 'db_update_propagation' setting is set to '0' (default). Due to the distributed nature of Cassandra, you should increase this value.
2019/04/30 14:31:10 [verbose] Kong: 1.1.2
2019/04/30 14:31:10 [debug] ngx_lua: 10013
2019/04/30 14:31:10 [debug] nginx: 1013006
2019/04/30 14:31:10 [debug] Lua: LuaJIT 2.1.0-beta3
2019/04/30 14:31:10 [verbose] reading config file at /opt/kong/etc/kong.conf
2019/04/30 14:31:10 [debug] reading environment variables
2019/04/30 14:31:10 [debug] KONG_PREFIX ENV found with "/opt/kong/var"
2019/04/30 14:31:10 [debug] KONG_DATABASE ENV found with "cassandra"
2019/04/30 14:31:10 [debug] KONG_CASSANDRA_CONTACT_POINTS ENV found with "cassandra-0"
2019/04/30 14:31:10 [debug] KONG_LOG_LEVEL ENV found with "info"
2019/04/30 14:31:10 [debug] KONG_CASSANDRA_KEYSPACE ENV found with "kong"
2019/04/30 14:31:10 [debug] KONG_NGINX_DAEMON ENV found with "off"
2019/04/30 14:31:10 [debug] admin_access_log = "logs/admin_access.log"
2019/04/30 14:31:10 [debug] admin_error_log = "logs/error.log"
2019/04/30 14:31:10 [debug] admin_listen = {"127.0.0.1:8001","0.0.0.0:8444 ssl"}
2019/04/30 14:31:10 [debug] admin_ssl_cert = "/opt/kong/etc/certs/changeme-kong_kong_com-super-bundle.pem"
2019/04/30 14:31:10 [debug] admin_ssl_cert_key = "/opt/kong/etc/certs/changeme-kong_kong_com-key.pem"
2019/04/30 14:31:10 [debug] anonymous_reports = false
2019/04/30 14:31:10 [debug] cassandra_consistency = "LOCAL_QUORUM"
2019/04/30 14:31:10 [debug] cassandra_contact_points = {"cassandra-0"}
2019/04/30 14:31:10 [debug] cassandra_data_centers = {"dc1:1"}
2019/04/30 14:31:10 [debug] cassandra_keyspace = "kong"
2019/04/30 14:31:10 [debug] cassandra_lb_policy = "RequestRoundRobin"
2019/04/30 14:31:10 [debug] cassandra_local_datacenter = "dc1"
2019/04/30 14:31:10 [debug] cassandra_port = 9042
2019/04/30 14:31:10 [debug] cassandra_repl_factor = 1
2019/04/30 14:31:10 [debug] cassandra_repl_strategy = "NetworkTopologyStrategy"
2019/04/30 14:31:10 [debug] cassandra_schema_consensus_timeout = 120000
2019/04/30 14:31:10 [debug] cassandra_ssl = false
2019/04/30 14:31:10 [debug] cassandra_ssl_verify = false
2019/04/30 14:31:10 [debug] cassandra_timeout = 5000
2019/04/30 14:31:10 [debug] cassandra_username = "kong"
2019/04/30 14:31:10 [debug] client_body_buffer_size = "8k"
2019/04/30 14:31:10 [debug] client_max_body_size = "0"
2019/04/30 14:31:10 [debug] client_ssl = false
2019/04/30 14:31:10 [debug] database = "cassandra"
2019/04/30 14:31:10 [debug] db_cache_ttl = 0
2019/04/30 14:31:10 [debug] db_resurrect_ttl = 30
2019/04/30 14:31:10 [debug] db_update_frequency = 5
2019/04/30 14:31:10 [debug] db_update_propagation = 0
2019/04/30 14:31:10 [debug] declarative_config = "/opt/kong/etc/endpoints/kong-endpoints.yaml"
2019/04/30 14:31:10 [debug] dns_error_ttl = 1
2019/04/30 14:31:10 [debug] dns_hostsfile = "/etc/hosts"
2019/04/30 14:31:10 [debug] dns_no_sync = false
2019/04/30 14:31:10 [debug] dns_not_found_ttl = 30
2019/04/30 14:31:10 [debug] dns_order = {"LAST","SRV","A","CNAME"}
2019/04/30 14:31:10 [debug] dns_resolver = {}
2019/04/30 14:31:10 [debug] dns_stale_ttl = 4
2019/04/30 14:31:10 [debug] error_default_type = "application/json"
2019/04/30 14:31:10 [debug] headers = {"server_tokens","latency_tokens"}
2019/04/30 14:31:10 [debug] log_level = "info"
2019/04/30 14:31:10 [debug] lua_package_cpath = ""
2019/04/30 14:31:10 [debug] lua_package_path = "/opt/kong/lib/?.lua;/opt/kong/lib/?/init.lua;"
2019/04/30 14:31:10 [debug] lua_socket_pool_size = 30
2019/04/30 14:31:10 [debug] lua_ssl_verify_depth = 1
2019/04/30 14:31:10 [debug] mem_cache_size = "128m"
2019/04/30 14:31:10 [debug] nginx_admin_directives = {{value="/opt/kong/lib/my_nginx_admin.conf",name="include"}}
2019/04/30 14:31:10 [debug] nginx_admin_include = "/opt/kong/lib/my_nginx_admin.conf"
2019/04/30 14:31:10 [debug] nginx_daemon = "off"
2019/04/30 14:31:10 [debug] nginx_http_directives = {{value="/opt/kong/lib/my_nginx_http.conf",name="include"}}
2019/04/30 14:31:10 [debug] nginx_http_include = "/opt/kong/lib/my_nginx_http.conf"
2019/04/30 14:31:10 [debug] nginx_optimizations = true
2019/04/30 14:31:10 [debug] nginx_proxy_directives = {{value="/opt/kong/lib/my_nginx_proxy.conf",name="include"}}
2019/04/30 14:31:10 [debug] nginx_proxy_include = "/opt/kong/lib/my_nginx_proxy.conf"
2019/04/30 14:31:10 [debug] nginx_sproxy_directives = {}
2019/04/30 14:31:10 [debug] nginx_stream_directives = {}
2019/04/30 14:31:10 [debug] nginx_user = "bob bob"
2019/04/30 14:31:10 [debug] nginx_worker_processes = "auto"
2019/04/30 14:31:10 [debug] origins = {}
2019/04/30 14:31:10 [debug] pg_database = "kong"
2019/04/30 14:31:10 [debug] pg_host = "127.0.0.1"
2019/04/30 14:31:10 [debug] pg_port = 5432
2019/04/30 14:31:10 [debug] pg_ssl = false
2019/04/30 14:31:10 [debug] pg_ssl_verify = false
2019/04/30 14:31:10 [debug] pg_timeout = 5000
2019/04/30 14:31:10 [debug] pg_user = "kong"
2019/04/30 14:31:10 [debug] plugins = {"bundled","my-verify-jwt","my-verify-roles"}
2019/04/30 14:31:10 [debug] prefix = "/opt/kong/var"
2019/04/30 14:31:10 [debug] proxy_access_log = "logs/access.log"

Then, the process is “hanging” during around 30 minutes (maybe a bit more: 32 to 33 minutes)… and it is finally “unlocked” and continues its processing until the end !! For the kong migrations list, it just displays the remaining logs:

2019/04/30 14:31:10 [debug] proxy_listen = {"127.0.0.1:8000","0.0.0.0:8443 ssl"}
2019/04/30 14:31:10 [debug] real_ip_header = "X-Real-IP"
2019/04/30 14:31:10 [debug] real_ip_recursive = "off"
2019/04/30 14:31:10 [debug] ssl_cert = "/opt/kong/etc/certs/changeme-kong_kong_com-super-bundle.pem"
2019/04/30 14:31:10 [debug] ssl_cert_key = "/opt/kong/etc/certs/changeme-kong_kong_com-key.pem"
2019/04/30 14:31:10 [debug] ssl_cipher_suite = "modern"
2019/04/30 14:31:10 [debug] ssl_ciphers = "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256"
2019/04/30 14:31:10 [debug] stream_listen = {"off"}
2019/04/30 14:31:10 [debug] trusted_ips = {}
2019/04/30 14:31:10 [debug] upstream_keepalive = 60
2019/04/30 14:31:10 [verbose] prefix in use: /opt/kong/var
2019/04/30 15:04:30 [debug] resolved Cassandra contact point 'cassandra-0' to: 172.20.139.20
2019/04/30 15:04:30 [debug] loading subsystems migrations...
2019/04/30 15:04:30 [verbose] retrieving keyspace schema state...
2019/04/30 15:04:30 [verbose] schema state retrieved
2019/04/30 15:04:30 [info] executed migrations:
core: 000_base, 001_14_to_15, 002_15_to_1, 003_100_to_110
rate-limiting: 000_base_rate_limiting, 001_14_to_15, 002_15_to_10
hmac-auth: 000_base_hmac_auth, 001_14_to_15
oauth2: 000_base_oauth2, 001_14_to_15, 002_15_to_10
jwt: 000_base_jwt, 001_14_to_15
basic-auth: 000_base_basic_auth, 001_14_to_15
key-auth: 000_base_key_auth, 001_14_to_15
acl: 000_base_acl, 001_14_to_15
response-ratelimiting: 000_base_response_rate_limiting, 001_14_to_15, 002_15_to_10
2019/04/30 15:04:30 [info] database is up-to-date

Looking to the time of the logs, it looks the resolved Cassandra contact point 'cassandra-0' is the first log displayed after the hanging.

During the blocking situation, jumping into the container, a ps shows:

PID   USER     TIME  COMMAND
    1 bob       0:00 perl /usr/local/openresty/bin/resty /usr/local/bin/kong migrations list --vv -c /opt/kong/etc/kong.conf
    8 bob       0:00 /usr/local/openresty/nginx/sbin/nginx -p /tmp/resty_xOpJBTuZMW/ -c conf/nginx.conf

It looks there is absolutely no activity on process 8 !
No log inside the /tmp/resty_xOpJBTuZMW/logs/error.log

At first, trying to understand the root cause of the blocking situation, I tried to have a look to the file descriptors by running ls -l /proc/8/fd (in case this could be a problem with file access):

lr-x------    1 bob      bob         64 Apr 30 09:23 0 -> pipe:[1708382985]
l-wx------    1 bob      bob         64 Apr 30 09:23 1 -> pipe:[1708382986]
l-wx------    1 bob      bob         64 Apr 30 09:23 2 -> pipe:[1708382987]
lrwx------    1 bob      bob         64 Apr 30 09:23 3 -> anon_inode:[eventpoll]
lrwx------    1 bob      bob         64 Apr 30 09:23 4 -> anon_inode:[eventfd]
lrwx------    1 bob      bob         64 Apr 30 09:23 5 -> socket:[1708412118]
lrwx------    1 bob      bob         64 Apr 30 09:23 6 -> socket:[1708412119]

I honestly don’t know if the 2 anonymous inodes could be the source cause of the issue !

The, I looked to the network by running netstat -a:

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
udp        0      0 kong-test:38959         kube-dns.kube-system.svc.cluster.local:domain ESTABLISHED
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags       Type       State         I-Node Path

Even more strange, still inside the container, I tried to run the exact same command manually (e.g. kong migrations list)… and it succeeds !!! I can run it as many times, run kong start, etc, everything is ok… except that the one I’m really expected to run (i.e. the one automatically launched at Pod starting time) is still blocked !

Where does this happen ?

  • This happens when running the Docker image under Kubernetes. However, I did not succeed in reproducing the issue in a plain Docker deployment!
  • I reproduced the issue on different Kubernetes clusters: on a own deployed Kubernetes, and on an EKS provided by AWS. On these same clusters, exactly the same test with the “equivalent” Docker image in version 1.0.3 and the same Kong configuration is working well all the time.
  • I reproduced the issue on Cassandra 2.2 and Cassandra 3.10... but not on Postgres. My Cassandra instance is a single node Dockerized Cassandra.

Differences with the “official” Docker image made available through the docker-kong project:

  • I’m not preparing the kong prefix and then running nginx; I’m “still” using kong start instead, as I used to do that since ages :P and as I did not encounter any issue in using it so far. I however don’t think the current problem is linked to that as I the issue happens with much simpler commands such as “kong migrations list” which does not start Kong… unless a kong prepare shall also be done, even for the simple call of kong migrations ?
  • Kong is installed in the regular /usr/local/kong but my prefix is not: this is /opt/kong/
@pamiel
Copy link
Contributor Author

pamiel commented May 3, 2019

Hi again,
Just tested with the official “docker.io/kong:1.1.2-alpine” image… and the issue is there as well (it is not only on my own Docker image !).

Here are complementary investigations we run on our side: it looks the problem is linked to the new DNS query implementation (I’m not saying it is due to it, just linked to it so far 😋 ). Indeed, we performed many tests, and our current hypothesis is that there might be an issue in the “network layer setup/establishment” that makes the Kong DNS query to Cassandra blocked in the new implementation while it was not in the previous one.

Here are the tests we run (we have screenshots of WireShark available if you want):

  • Test 1: With Kong version 1.0.3, the DNS queries monitored with WireShark shows 2 type “A” DNS queries: 1 from Kong container to the Core DNS Kubernetes Service IP address (i.e. the Cluster IP of the Core DNS Kubernetes Service), just followed by a second identical query to the IP address of the Pod implementing the Core DNS. This is, I guess, the local relay done by the kube-proxy to the Pod. Answers (again, 2 of them are tracked by WireShark) immediately provide the IP address of Cassandra. Everything works fine.
    Note: dns_order was set to the default value: LAST,SRV,A,CNAME, but “A” query came first. Is “SRV” implemented in version 1.0.3?…

  • Test 2: With Kong 1.1.2, WireShark shows only a single “SVR” DNS query… and nothing else! The second one (the relay done by the kube-proxy) does not appear, nor any answer from the DNS. Kong looks blocked on sending this very first UDP packet (during the next 30 minutes) !
    Why are we considering that there is an issue in the “network layer setup/establishment”? Because the 2nd query (i.e. the relay of the first query, made by the kube-proxy) is not caught by WireShark. The original packet is most probably lost during sent time (explains that we don’t have any response) but after WireShark has tracked it; and the client (i.e. Kong) is also most probably blocked on the send call (otherwise we would enter into a retry mechanism that would be faster than 30 minutes).
    Note1: dns_order was set to the default value: LAST,SRV,A,CNAME, so “SRV” query came first. In order to check if the issue was coming from the “SRV” implementation, we set dns_order to “A” only, and we saw a “A” query instead or “SRV”… but with exactly the same issue: only one query, Kong being immediately blocked.
    Note 2: the “A” query is strictly identical to the one obtained during test 1 => this is consequently not an issue in the implementation of the formatting of the DNS query.

  • Test 3: What brings us to the hypothesis of a “network layer setup/establishment” issue is the fact that we run exactly the same test as test 2, but inserting a sleep 5 at the beginning of the docker-entrypoint.sh, and this time, everything is working fine !! We gave 5 seconds to the process/environment to setup properly all layers, and then the DNS queries are made correctly !
    This also explains why, when Kong is stuck executing a kong migrations list as part of its docker-entrypoint.sh, we can jump into the container and manually run the same kong migrations list and everything goes fine: the container has been launched for several seconds => all network layers are up and running.
    Note 1: We tried with 1 seconds and 2 seconds instead of 5… and this is not sufficient !
    Note 2: We surprisingly see not only 2 DNS queries… but 8: the 2 first are the one looking for cassandra.default.svc.cluster.local and the response immediately provides the corresponding IP address of Cassandra. We would expect the process to stop here… but indeed 2 other SRV queries are then tracked by WireShark for cassandra.svc.cluster.local, with 2 answers “No such name SRV”, and then 2 others for cassandra.cluster.local, and then 2 for cassandra, all of them with answers “No such name SRV”. This is most probably not linked with the issue we are studying in this thread, but isn’t there a bug in the implementation of the “SVR” which queries all possible options (according to the resolve.conf) while the first request has already returned a valid response?

So, here is our current status: we think there is a problem around the “network layer setup/establishment”. This problem was most probably not visible with the former implementation of the DNS query in Kong, but the new implementation done through #4296 and #4378 is now revealing the bug.

Where is the bug ? In Kong (well, we don't think it is, but at least, the new implementation is revealing the issue as a side effect) In Alpine ? In Kubernetes (we've not seen the issue in raw Docker) ? In the network overlay deployed on Kubernetes (we have a cluster with Calico, but the other one does not have it) ? For the moment, we don’t know… but any idea, any help would be welcome !

We are still not comfortable with:

  • The fact that a sleep 5 is solving the issue… but not a sleep 1! (Indeed, I've seen it work once with sleep 1, but most of the time it has not...)
  • Why the blocking situation last 30 minutes? To be more precise, following the logs, it appears to be more precisely 33min and around 20 sec… so maybe a total of 2000 seconds! Where is configured this 30min/2000sec ? We tried to change the dns_not_found_ttl parameter (which default value is … 30 !) to 1 (in case the unit would be minutes instead of seconds 😋) … but no: still 30 minutes waiting

If you have an easy debugging capability, could you try to see on which call Kong is hanging?
We gonna try to create a very simple process that immediately queries the DNS (out of any Kong context), to check if an Alpine or a Kubernetes issue could be highlighted.

@pamiel
Copy link
Contributor Author

pamiel commented May 6, 2019

Episode III

I think we got it!
There are most likely 2 issues:

  • One on Docker or Kubernetes side (Alpine is out of the game) – still investigating to find exactly where it is –
  • And One on Kong side, on the new implementation of the DNS query for Cassandra

We have reproduced the exact same behavior explained in the previous posts, but using an “aggressive” Pod which only runs nslookup or dig:

  • Test 1: Creating a simple Alpine pod that executes the command time dig <ip address of DNS> cassandra-0, we see in the logs that it took 2.5 seconds to get the answer from the DNS. Looking to the traces in Wireshark, we can see that a first “A” query is sent and does not have any response, followed 2.5sec later by a second “A” query that succeeds (i.e. that has a response).
    Thanks to that, we identified:
    -- The Docker or Kubernetes bug: still to be characterized in more details, but we think that either Kubernetes tells Docker that the network layer is OK before it is in reality, or Docker tells Kubernetes Pod that the network layer is OK before it is in reality.
    -- The fact that tools such as dig and nslookup are implementing a timeout on the DNS query (2.5 seconds), and a retry mechanism (so that after these 2.5 seconds without receiving any answer, a retry is done and is working well).

  • Test 2: we added the options +time=1 +retry=0 to dig, which sets a timeout to 1 second, and 0 retries. And now dig exits after 1 second (the timeout value) by saying that the server name is not found (DNS resolution failed). In WireShark, we only see the first “A” answer… and nothing else (exactly what we saw in the tests with the Kong images presented in my previous posts: only one request without any answer). As there is no retry, then dig can only exist with "DNS resolution failed"

  • Test 3: introducing a sleep 5 before the call to dig… well, things are going well whatever the value for the +retry parameter as the network layer anyway had enough time to be set up, before the first attempt to send the “A” query => the retry mechanism is not used in this test.

  • Test 4, we reproduced the exact same behavior and traces with the official Docker image of Kong (and using nslookup only, as dig is not available in this image):
    -- If just running kong docker-start, Kong starts and is blocked
    -- If running the command time nslookup cassandra-0 <ip address of DNS>; kong docker-start then Kong is not blocked but the traces shows that the nslookup is first taking 2.5 seconds (as it internally implements the retry mechanism until the network layer is established, and thus the name resolution is done), and after that, Kong is started normally

  • Test 5: Trying the same tests again in a plain Docker environment, the nslookup and dig is taking 10 ms only: very far from the second or more in a Kubernetes environment => the network layers are established very fast => the retry mechanism in the client application is never used => Kong is working.

As a sum-up of these tests:

  • The Official Kong Docker image, as is, cannot be started on Kubernetes (encountered on 3 different clusters using different NCI). This is a quite blocking issue.
  • This is due to an issue in the establishment of the network layer that is very long on Kubernetes. Still investigating where is the root cause
  • However, contrary to what is implemented in nslookup or dig, the Kong implementation of the DNS query (for Cassandra) is not resilient and no timeout and retry mechanism is implemented so far. This is not related to the TTL of the query for which there are many configuration parameters already, but this is for a timeout when no answer at all is received for a request.
    We have not investigated to know if the former implementation of the DNS query for Cassandra used to have such a retry mechanism (or if the DNS query for Postgres or for the upstream micro-services have such a mechanism), but it is most likely that the answer is “no”

@hutchic
Copy link
Contributor

hutchic commented May 9, 2019

That's quite a read thanks for the very thorough investigation / write up 👍

can you share your k8s yaml (or helm etc) that you're using (is it https://github.com/Kong/kong-dist-kubernetes or closely related?)

@pamiel
Copy link
Contributor Author

pamiel commented May 10, 2019

Hi @hutchic, you are welcome !!

I confirm what you said in Kong/docker-kong#258 (comment): this is a Cassandra issue only, and on Kubernetes only.

Here is my test yaml file:

kind: Pod
apiVersion: v1
metadata:
  name: "kong-dns-test"
spec:
  containers:
    -
      name: "kong-dns-test"
      image: "docker.io/kong:1.1.2-alpine"
      command: ["kong"]
      args: ["migrations", "list", "--vv"]
      env:
        - name: KONG_LOG_LEVEL
          value: "debug"
        - name: KONG_DATABASE
          value: "cassandra"
        - name: KONG_CASSANDRA_CONTACT_POINTS
          value: "kong-database-2"
  restartPolicy: Never

I made further investigations to locate where the issue could be. It is quite difficult as the logs are not displayed (the freeze of Kong blocks the display of the logs in the middle of the configuration parameter outputs: most probably no flush on stdout is done => none of the logs you could add to check the DNS processing is displayed). I managed it by inserting os.execute ("echo xyz") commands...

At the end, the configuration that is provided to the Resolver looks correct, with a number of retries set to 5 and a timeout set to 2000ms:

{
  search = {
    "default.svc.cluster.local",
    "svc.cluster.local",
    "cluster.local"
  },
  nameservers = {
    "192.168.96.84"
  },
  badTtl = 30,
  order = {
    "LAST",
    "SRV",
    "A",
    "CNAME"
  },
  noSynchronisation = true,
  emptyTtl = 1,
  ndots = 5,
  timeout = 2000,
  hosts = "/etc/hosts",
  enable_ipv6 = true,
  staleTtl = 4,
  retrans = 5
}

What is strange is that the socket that is created by the Resolver is supposed to be configured with the right data: 5 retries and a timeout of 2000 milli-seconds... but when executed, the timeout is indeed of 2000... seconds !!
I fear that this is the overwrite of nxg.socker.udp that is done here https://github.com/Kong/kong/blob/master/kong/db/strategies/cassandra/connector.lua#L66 that is ignoring the timeout (and taking a very long default value), or that has a different behavior in its timeout implementation and expecting the time to be set in seconds instead of milli-seconds !
Indeed, I don't know where this "socket" package comes from (local udp = require("socket").udp(...))

@thibaultcha
Copy link
Member

thibaultcha commented May 15, 2019

Sorry for only seeing this now. I was about to point to the Cassandra connector's override of the TPC/UDP sockets, but I see that your investigation already led you there @pamiel :)

You will notice that the DNS resolver instantiated there specifies timeout = nil (source). This means that the resolver will use the timeout option specified in resolv.conf (see man 5 resolv.conf). What does this value look like in your pod? It is supposed to be a value in seconds. Otherwise, the default is set here.

Here is the issue though: with cosockets (the OpenResty sockets), the timeout is specified in miliseconds (hence, 2 seconds), but with LuaSocket (since we overrode the sockets to resolve the hostnames in init_by_lua in the Cassandra connector), the argument is specified in seconds, hence, 2000 seconds. cc @Tieske

@thibaultcha
Copy link
Member

thibaultcha commented May 15, 2019

Note: lua-resty-dns-client may benefit from something like https://github.com/thibaultcha/lua-resty-socket (which is used by lua-cassandra and other driver modules out there). Such a module acts as a compatibility layer between LuaSocket and cosockets, and allows for transparent interoperability (i.e. timeout arguments can be specified as milliseconds, and will be converted to seconds before invoking the underlying LuaSocket settimeout()).

@pamiel
Copy link
Contributor Author

pamiel commented May 15, 2019

@thibaultcha , indeed by default myresolv.conf has no option =>this is the default value that is used.
During my numerous tests, I however made some tests setting the timeout in resolv.conf: as far as I remember, I tried setting it with both 2 and 2000... and Kong was hanging in both cases.
Does it mean that this value is not taken into account at all ? I don't know: maybe the value 2 was multiplied by 1000 and considered as seconds instead of milliseconds by Luasocket => 30 min waiting; and 2000 was waiting 2,000,000 seconds !... but I don't remember me waiting until the unlock of the Pod for these tests; sorry for that.

@thibaultcha
Copy link
Member

maybe the value 2 was multiplied by 1000 and considered as seconds instead of milliseconds by Luasocket

Yes, I believe (without testing it myself) specifying this value in resolv.conf will have the same effect with the current implementation of the C* connector: you'll still have a 20 or 30 min long timeout because lua-resty-dns-client will still multiply it by 1000.

@kikito kikito added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Oct 28, 2020
@kikito
Copy link
Member

kikito commented Oct 28, 2020

Hello @pamiel , are you still experiencing this problem, or can we close this issue?

@kikito
Copy link
Member

kikito commented Nov 24, 2020

I am closing this due to lack of activity. Please reopen if necessary.

@kikito kikito closed this as completed Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc...
Projects
None yet
Development

No branches or pull requests

4 participants