/cluster API only lists self #950

mars · 2016-02-04T19:06:00Z

I have a two node Kong cluster, which for queries to the /cluster Admin API only returns a single member, the one servicing the API request.

Other than this error, Kong appears to be functioning correctly, able to start-up, connect to Cassandra via SSL, and service both proxy & admin requests.

Here's what the logs look like during one of these /cluster Admin API requests:

app[web.1]: 2016/02/04 17:49:57 [info] 72#0: *48 [lua] log.lua:22: info(): Host at 10.1.13.186:9042 required authentication, client: x.x.x.x, server: _, request: "GET /kong-admin/cluster HTTP/1.1", host: "kong-proxy.example.com"
app[web.1]: 2016/02/04 17:49:58 [info] 72#0: *48 [lua] log.lua:22: info(): Host at 10.1.60.153 required authentication, client: x.x.x.x, server: _, request: "GET /kong-admin/cluster HTTP/1.1", host: "kong-proxy.example.com"
app[web.1]: 2016/02/04 17:49:58 [info] 72#0: *48 [lua] log.lua:22: info(): Host at 10.1.16.105 required authentication, client: x.x.x.x, server: _, request: "GET /kong-admin/cluster HTTP/1.1", host: "kong-proxy.example.com"
app[web.1]: 2016/02/04 17:49:58 [info] 72#0: *48 [lua] log.lua:22: info(): Host at 10.1.13.186:9042 required authentication, client: x.x.x.x, server: _, request: "GET /kong-admin/cluster HTTP/1.1", host: "kong-proxy.example.com"
app[web.1]: 2016/02/04 17:49:58 [notice] 72#0: signal 17 (SIGCHLD) received
app[web.1]: 2016/02/04 17:49:58 [info] 72#0: waitpid() failed (10: No child processes)
app[web.1]: 2016/02/04 17:49:58 [notice] 72#0: signal 17 (SIGCHLD) received
app[web.1]: 2016/02/04 17:49:58 [info] 72#0: waitpid() failed (10: No child processes)
app[web.1]: 2016/02/04 17:49:58 [notice] 72#0: signal 17 (SIGCHLD) received
app[web.1]: 2016/02/04 17:49:58 [info] 72#0: waitpid() failed (10: No child processes)
heroku[router]: at=info method=GET path="/kong-admin/cluster" host=kong-proxy.example.com request_id=e01eb6d7-917b-4ca4-b155-8cf7c258b5d8 dyno=web.1 connect=0ms service=500ms status=200 bytes=471
app[web.2]: 2016/02/04 17:50:12 [notice] 73#0: signal 17 (SIGCHLD) received
app[web.2]: 2016/02/04 17:50:12 [notice] 73#0: unknown process 211 exited with code 0
app[web.2]: 2016/02/04 17:50:12 [error] 73#0: [lua] cluster.lua:84: Cassandra error: 10.1.60.153, context: ngx.timer
app[web.1]: 2016/02/04 17:50:13 [notice] 71#0: signal 17 (SIGCHLD) received
app[web.1]: 2016/02/04 17:50:13 [notice] 71#0: unknown process 218 exited with code 0
app[web.1]: 2016/02/04 17:50:13 [error] 71#0: [lua] cluster.lua:84: Cassandra error: 10.1.60.153, context: ngx.timer
app[web.2]: 2016/02/04 17:50:42 [notice] 73#0: signal 17 (SIGCHLD) received
app[web.2]: 2016/02/04 17:50:42 [notice] 73#0: unknown process 213 exited with code 0
app[web.2]: 2016/02/04 17:50:42 [error] 73#0: [lua] cluster.lua:84: Cassandra error: 10.1.16.105, context: ngx.timer
app[web.1]: 2016/02/04 17:50:43 [notice] 71#0: signal 17 (SIGCHLD) received
app[web.1]: 2016/02/04 17:50:43 [notice] 71#0: unknown process 220 exited with code 0
app[web.1]: 2016/02/04 17:50:43 [error] 71#0: [lua] cluster.lua:84: Cassandra error: 10.1.16.105, context: ngx.timer
app[web.2]: 2016/02/04 17:51:12 [notice] 73#0: signal 17 (SIGCHLD) received
app[web.2]: 2016/02/04 17:51:12 [notice] 73#0: unknown process 215 exited with code 0
app[web.2]: 2016/02/04 17:51:12 [error] 73#0: [lua] cluster.lua:84: Cassandra error: 10.1.13.186:9042, context: ngx.timer
app[web.1]: 2016/02/04 17:51:13 [notice] 71#0: signal 17 (SIGCHLD) received
app[web.1]: 2016/02/04 17:51:13 [notice] 71#0: unknown process 222 exited with code 0
app[web.1]: 2016/02/04 17:51:13 [error] 71#0: [lua] cluster.lua:84: Cassandra error: 10.1.13.186:9042, context: ngx.timer

I've verified that the serf agents are reachable, and can be manually joined together:

~ $ serf agent -bind $SERF_CLUSTER_LISTEN -rpc-addr $SERF_CLUSTER_LISTEN_RPC -encrypt $SERF_ENCRYPT -log-level err -profile wan -node mars-bash &
[1] 84
==> Starting Serf agent...
==> Starting Serf agent RPC...
==> Serf agent running!
         Node name: 'mars-bash'
         Bind addr: '10.0.132.112:7946'
          RPC addr: '127.0.0.1:7373'
         Encrypted: true
          Snapshot: false
           Profile: wan

==> Log data will now stream in as it occurs:

~ $ 
~ $ serf join 10.0.158.174:7946
Successfully joined cluster by contacting 1 nodes.
~ $ serf members               
mars-bash                                                    10.0.132.112:7946  alive  
dyno-2b8cf0ce-5bdd-40dd-8e41-ae54d85e7e06_10.0.158.174:7946  10.0.158.174:7946  alive
~ $ serf join 10.0.136.36:7946 
Successfully joined cluster by contacting 1 nodes.
~ $ serf members
mars-bash                                                    10.0.132.112:7946  alive  
dyno-2b8cf0ce-5bdd-40dd-8e41-ae54d85e7e06_10.0.158.174:7946  10.0.158.174:7946  alive  
dyno-22246360-ef5f-43e7-8376-f9493b119d2e_10.0.136.36:7946   10.0.136.36:7946   alive

…once I manually join them, the /cluster Admin API responds with the those three members, although the log output looks the same.

Those Host at x.x.x.x required authentication & Cassandra error logs lines all seem suspect.

I am at a loss for finding a cause. Any ideas what might be going wrong?

The text was updated successfully, but these errors were encountered:

mars · 2016-02-04T19:13:27Z

This issue is with Kong 0.6.1

mars · 2016-02-04T19:45:03Z

Problem solved. Err, well, at least the cause is found.

I am working on running Kong using an external supervisor for #928, and found that because of #934, Kong's serf self:_autojoin(node_name) is never being called.

Closing as this an issue in my fork

subnetmarco · 2016-02-04T20:04:24Z

@mars do also the Cassandra errors disappear?

mars · 2016-02-04T20:32:59Z

Howdy @thefosk !

I just tried adding a conditional autojoin to Kong.init to be executed only when using an external process supervisor.

This solved that the /cluster Admin API listed a single node.

But those Cassandra errors are still appearing.

subnetmarco · 2016-02-04T20:46:13Z

@mars so, we really really want everybody to use auto-join, but there is a hidden configuration property that you can use to disable auto-join. It was intended for extreme debug/use-cases, and if a feature it's not documented it doesn't exist anyways :)

cluster:
  auto-join: false

Regarding the Cassandra errors, it fails when executing this request:

local nodes, err = dao.nodes:find_by_keys({name = node_name})

maybe @thibaultcha can give more insights on the error?

mars · 2016-02-04T21:06:52Z

With my auto-join in Kong.init addition, the cluster does populate with both nodes automatically. I don't think I need auto-join: false but I'm glad to know it's there.

Yes, that failing query is mysterious to me, because the "error" message is a contact point, not an error description!

jykae · 2016-03-01T18:00:05Z

@mars @thefosk I just recently tried Kong and auto-join fails for me also, is your fix coming to Kong or would you document the feature? Overall I am very happy about the documentation of Admin API, and everything.
I am using the latest 0.7.0 development installation on Vagrant box for evaluation.

We have frontend for another API proxy and we have plans to support also Kong on the frontend in the future https://github.com/apinf/api-umbrella-dashboard and allow users to select which API proxy they would like to use.

jykae · 2016-03-01T18:17:08Z

Also how auto-join basically should work, how Kong finds the nodes?

I traced files https://github.com/Mashape/kong/blob/master/kong/cli/services/serf.lua#L98 and https://github.com/Mashape/kong/blob/master/kong/dao/cassandra/factory.lua#L45

I am usually diving source-first to the applications, so just found also documentation about clustering here: https://getkong.org/docs/0.7.x/clustering/

Trying to figure out more other day..

subnetmarco · 2016-03-01T19:37:25Z

By default Kong will advertise into the datastore the first non-loopback IPv4 address. The other nodes, that point to the same datastore, will then try to join the other nodes using their advertised address.

If auto-join doesn't work, it's usually for two reasons:

Kong needs both TCP and UDP traffic allowed on port 7946 (https://getkong.org/docs/0.7.x/network/).
The automatically detected IP address is not correct, so to manually set an IP address that the node should advertise, you need to change the cluster.advertise property.

mars closed this as completed Feb 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/cluster API only lists self #950

/cluster API only lists self #950

mars commented Feb 4, 2016

mars commented Feb 4, 2016

mars commented Feb 4, 2016

subnetmarco commented Feb 4, 2016

mars commented Feb 4, 2016

subnetmarco commented Feb 4, 2016

mars commented Feb 4, 2016

jykae commented Mar 1, 2016

jykae commented Mar 1, 2016

subnetmarco commented Mar 1, 2016

/cluster API only lists self #950

/cluster API only lists self #950

Comments

mars commented Feb 4, 2016

mars commented Feb 4, 2016

mars commented Feb 4, 2016

subnetmarco commented Feb 4, 2016

mars commented Feb 4, 2016

subnetmarco commented Feb 4, 2016

mars commented Feb 4, 2016

jykae commented Mar 1, 2016

jykae commented Mar 1, 2016

subnetmarco commented Mar 1, 2016