Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/cluster API only lists self #950

Closed
mars opened this issue Feb 4, 2016 · 9 comments
Closed

/cluster API only lists self #950

mars opened this issue Feb 4, 2016 · 9 comments

Comments

@mars
Copy link
Contributor

mars commented Feb 4, 2016

I have a two node Kong cluster, which for queries to the /cluster Admin API only returns a single member, the one servicing the API request.

Other than this error, Kong appears to be functioning correctly, able to start-up, connect to Cassandra via SSL, and service both proxy & admin requests.

Here's what the logs look like during one of these /cluster Admin API requests:

app[web.1]: 2016/02/04 17:49:57 [info] 72#0: *48 [lua] log.lua:22: info(): Host at 10.1.13.186:9042 required authentication, client: x.x.x.x, server: _, request: "GET /kong-admin/cluster HTTP/1.1", host: "kong-proxy.example.com"
app[web.1]: 2016/02/04 17:49:58 [info] 72#0: *48 [lua] log.lua:22: info(): Host at 10.1.60.153 required authentication, client: x.x.x.x, server: _, request: "GET /kong-admin/cluster HTTP/1.1", host: "kong-proxy.example.com"
app[web.1]: 2016/02/04 17:49:58 [info] 72#0: *48 [lua] log.lua:22: info(): Host at 10.1.16.105 required authentication, client: x.x.x.x, server: _, request: "GET /kong-admin/cluster HTTP/1.1", host: "kong-proxy.example.com"
app[web.1]: 2016/02/04 17:49:58 [info] 72#0: *48 [lua] log.lua:22: info(): Host at 10.1.13.186:9042 required authentication, client: x.x.x.x, server: _, request: "GET /kong-admin/cluster HTTP/1.1", host: "kong-proxy.example.com"
app[web.1]: 2016/02/04 17:49:58 [notice] 72#0: signal 17 (SIGCHLD) received
app[web.1]: 2016/02/04 17:49:58 [info] 72#0: waitpid() failed (10: No child processes)
app[web.1]: 2016/02/04 17:49:58 [notice] 72#0: signal 17 (SIGCHLD) received
app[web.1]: 2016/02/04 17:49:58 [info] 72#0: waitpid() failed (10: No child processes)
app[web.1]: 2016/02/04 17:49:58 [notice] 72#0: signal 17 (SIGCHLD) received
app[web.1]: 2016/02/04 17:49:58 [info] 72#0: waitpid() failed (10: No child processes)
heroku[router]: at=info method=GET path="/kong-admin/cluster" host=kong-proxy.example.com request_id=e01eb6d7-917b-4ca4-b155-8cf7c258b5d8 dyno=web.1 connect=0ms service=500ms status=200 bytes=471
app[web.2]: 2016/02/04 17:50:12 [notice] 73#0: signal 17 (SIGCHLD) received
app[web.2]: 2016/02/04 17:50:12 [notice] 73#0: unknown process 211 exited with code 0
app[web.2]: 2016/02/04 17:50:12 [error] 73#0: [lua] cluster.lua:84: Cassandra error: 10.1.60.153, context: ngx.timer
app[web.1]: 2016/02/04 17:50:13 [notice] 71#0: signal 17 (SIGCHLD) received
app[web.1]: 2016/02/04 17:50:13 [notice] 71#0: unknown process 218 exited with code 0
app[web.1]: 2016/02/04 17:50:13 [error] 71#0: [lua] cluster.lua:84: Cassandra error: 10.1.60.153, context: ngx.timer
app[web.2]: 2016/02/04 17:50:42 [notice] 73#0: signal 17 (SIGCHLD) received
app[web.2]: 2016/02/04 17:50:42 [notice] 73#0: unknown process 213 exited with code 0
app[web.2]: 2016/02/04 17:50:42 [error] 73#0: [lua] cluster.lua:84: Cassandra error: 10.1.16.105, context: ngx.timer
app[web.1]: 2016/02/04 17:50:43 [notice] 71#0: signal 17 (SIGCHLD) received
app[web.1]: 2016/02/04 17:50:43 [notice] 71#0: unknown process 220 exited with code 0
app[web.1]: 2016/02/04 17:50:43 [error] 71#0: [lua] cluster.lua:84: Cassandra error: 10.1.16.105, context: ngx.timer
app[web.2]: 2016/02/04 17:51:12 [notice] 73#0: signal 17 (SIGCHLD) received
app[web.2]: 2016/02/04 17:51:12 [notice] 73#0: unknown process 215 exited with code 0
app[web.2]: 2016/02/04 17:51:12 [error] 73#0: [lua] cluster.lua:84: Cassandra error: 10.1.13.186:9042, context: ngx.timer
app[web.1]: 2016/02/04 17:51:13 [notice] 71#0: signal 17 (SIGCHLD) received
app[web.1]: 2016/02/04 17:51:13 [notice] 71#0: unknown process 222 exited with code 0
app[web.1]: 2016/02/04 17:51:13 [error] 71#0: [lua] cluster.lua:84: Cassandra error: 10.1.13.186:9042, context: ngx.timer

I've verified that the serf agents are reachable, and can be manually joined together:

~ $ serf agent -bind $SERF_CLUSTER_LISTEN -rpc-addr $SERF_CLUSTER_LISTEN_RPC -encrypt $SERF_ENCRYPT -log-level err -profile wan -node mars-bash &
[1] 84
==> Starting Serf agent...
==> Starting Serf agent RPC...
==> Serf agent running!
         Node name: 'mars-bash'
         Bind addr: '10.0.132.112:7946'
          RPC addr: '127.0.0.1:7373'
         Encrypted: true
          Snapshot: false
           Profile: wan

==> Log data will now stream in as it occurs:

~ $ 
~ $ serf join 10.0.158.174:7946
Successfully joined cluster by contacting 1 nodes.
~ $ serf members               
mars-bash                                                    10.0.132.112:7946  alive  
dyno-2b8cf0ce-5bdd-40dd-8e41-ae54d85e7e06_10.0.158.174:7946  10.0.158.174:7946  alive
~ $ serf join 10.0.136.36:7946 
Successfully joined cluster by contacting 1 nodes.
~ $ serf members
mars-bash                                                    10.0.132.112:7946  alive  
dyno-2b8cf0ce-5bdd-40dd-8e41-ae54d85e7e06_10.0.158.174:7946  10.0.158.174:7946  alive  
dyno-22246360-ef5f-43e7-8376-f9493b119d2e_10.0.136.36:7946   10.0.136.36:7946   alive

…once I manually join them, the /cluster Admin API responds with the those three members, although the log output looks the same.

Those Host at x.x.x.x required authentication & Cassandra error logs lines all seem suspect.

I am at a loss for finding a cause. Any ideas what might be going wrong?

@mars
Copy link
Contributor Author

mars commented Feb 4, 2016

This issue is with Kong 0.6.1

@mars
Copy link
Contributor Author

mars commented Feb 4, 2016

Problem solved. Err, well, at least the cause is found.

I am working on running Kong using an external supervisor for #928, and found that because of #934, Kong's serf self:_autojoin(node_name) is never being called.

Closing as this an issue in my fork

@mars mars closed this as completed Feb 4, 2016
@subnetmarco
Copy link
Member

@mars do also the Cassandra errors disappear?

@mars
Copy link
Contributor Author

mars commented Feb 4, 2016

Howdy @thefosk !

I just tried adding a conditional autojoin to Kong.init to be executed only when using an external process supervisor.

This solved that the /cluster Admin API listed a single node.

But those Cassandra errors are still appearing.

@subnetmarco
Copy link
Member

@mars so, we really really want everybody to use auto-join, but there is a hidden configuration property that you can use to disable auto-join. It was intended for extreme debug/use-cases, and if a feature it's not documented it doesn't exist anyways :)

cluster:
  auto-join: false

Regarding the Cassandra errors, it fails when executing this request:

local nodes, err = dao.nodes:find_by_keys({name = node_name})

maybe @thibaultcha can give more insights on the error?

@mars
Copy link
Contributor Author

mars commented Feb 4, 2016

With my auto-join in Kong.init addition, the cluster does populate with both nodes automatically. I don't think I need auto-join: false but I'm glad to know it's there.

Yes, that failing query is mysterious to me, because the "error" message is a contact point, not an error description!

@jykae
Copy link

jykae commented Mar 1, 2016

@mars @thefosk I just recently tried Kong and auto-join fails for me also, is your fix coming to Kong or would you document the feature? Overall I am very happy about the documentation of Admin API, and everything.
I am using the latest 0.7.0 development installation on Vagrant box for evaluation.

We have frontend for another API proxy and we have plans to support also Kong on the frontend in the future https://github.com/apinf/api-umbrella-dashboard and allow users to select which API proxy they would like to use.

@jykae
Copy link

jykae commented Mar 1, 2016

Also how auto-join basically should work, how Kong finds the nodes?

I traced files https://github.com/Mashape/kong/blob/master/kong/cli/services/serf.lua#L98 and https://github.com/Mashape/kong/blob/master/kong/dao/cassandra/factory.lua#L45

I am usually diving source-first to the applications, so just found also documentation about clustering here: https://getkong.org/docs/0.7.x/clustering/

Trying to figure out more other day..

@subnetmarco
Copy link
Member

By default Kong will advertise into the datastore the first non-loopback IPv4 address. The other nodes, that point to the same datastore, will then try to join the other nodes using their advertised address.

If auto-join doesn't work, it's usually for two reasons:

  • Kong needs both TCP and UDP traffic allowed on port 7946 (https://getkong.org/docs/0.7.x/network/).
  • The automatically detected IP address is not correct, so to manually set an IP address that the node should advertise, you need to change the cluster.advertise property.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants