Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service resynced every ~2mins, causes consul index to grow #4960

Closed
Tracked by #115
snownd opened this issue Nov 15, 2018 · 15 comments
Closed
Tracked by #115

Service resynced every ~2mins, causes consul index to grow #4960

snownd opened this issue Nov 15, 2018 · 15 comments

Comments

@snownd
Copy link

snownd commented Nov 15, 2018

We are running agent in a k8s pod. After register a service, the agent syned service continuously。

    2018/11/15 02:04:41 [INFO] agent: Synced node info
    2018/11/15 02:04:46 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:04:46 [INFO] agent: Synced check "service:54ebba0c-841d-4d6a-8f09-734bab38d390"
==> Newer Consul version available: 1.4.0 (currently running: 1.2.3)
    2018/11/15 02:06:08 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:07:46 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:09:03 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:10:34 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:11:46 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:13:34 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:14:52 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:16:04 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"

The k8s deploymend file is like:

      - image: consul:1.2.3
        imagePullPolicy: IfNotPresent
        name: consul-agent
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 5; consul leave -http-addr=http://$BIND_IP:8500"]
       args:
        - "agent"
        - "-bind=0.0.0.0"
        - "-retry-join=consul-server.base"
        - "-datacenter=dc"
@pierresouchay
Copy link
Contributor

This is by design I think.

@snownd
Copy link
Author

snownd commented Nov 16, 2018

This is by design I think.

But the growth of index makes blocking queries unusable. And this situation did not show up when I ran an agent without k8s.

@pierresouchay
Copy link
Contributor

Blocking queries do work.
What is registered within your K8S service exactly? Are the healthchecks working properly?

@banks
Copy link
Member

banks commented Nov 16, 2018

@duanxuelin as Pierre said the log output is normal - that is the agent performing periodic anti-entropy sync with servers.

I assume by "But the growth of index makes blocking queries unusable." you mean that every time it syncs the service ModifiedIndex is growing which means it's constantly sending updates to blocking query watchers right?

Can you show us the check status? If checks are failing then their output gets updated on the servers every anti-entroy sync which is likely what you are seeing, but that shouldn't update the service index used to just watch the whole service unless the check is flapping between passing and critical and so the list of services available changes every time.

@snownd
Copy link
Author

snownd commented Nov 19, 2018

@banks yes, do as you said, the change of ModifiedIndex makes my service gets update message constantly. And the check status is passing, at least the dashboard shows that. Meanwhile I do get a connection event every second with tcp check.

Here are codes with nodejs:

 const ret = await consul.agent.service.register({
      name: serviceName,
      id: serviceId,
      address: serviceIP,
      port: servicePort,
      check: {
        deregistercriticalserviceafter: '1m',
        tcp: `127.0.0.1:${checkPort}`,
        timeout: '3s',
        interval: '1s'
      }
    });
    const watch = consul.watch({
      method: consul.health.service,
      options: {service: serviceName, passing: true}
    });

@banks
Copy link
Member

banks commented Nov 20, 2018

Which node JS library is that? I'm unfamiliar with the JS libs that have been contributed although it's unlikely to be the cause of the issue.

Running that check every 1 second is somewhat aggressive. It should work but it wouldn't surprise me if you hit occasional errors with that which might cause the health update even though when you reload 1 second later it looks like it is fine, Consul doesn't keep any history of checks so as soon as one check fails it updates the state in the cluster to failed.

But I doubt that's the real problem. Can you confirm exactly what the lifecycle of your pod and service is? Is that JS in your application? Does it register itself with that check just on startup or more often? Re-registering can sometimes cause an update to the index even if nothing changed.

Finally, could you try:

  • Configure consul agent with -log-level DEBUG (or via config file if you have one)
  • curl the health API on that agent, or another agent if easier in a loop and paste a few of the outputs so we can see what changed etc. something like (assuming local to the agent) curl -is localhost:8500/v1/health/service/:serviceName do it a few minutes apart so we can see the Index growing and the output each time. The logs from above may help us correlate whether the app is re-registering or failing a health check or similar in that time.

@pierresouchay
Copy link
Contributor

@duanxuelin

Use this script named consul_check_services_changes.sh:

#!/bin/bash
if test ${1:-none} = "none"
then
  echo "USAGE: $0 Consul_URL"
  echo "       Example: localhost:8500/v1/health/service/MY_SUPER_SERVICE"
  exit 1
fi
url_to_check=$1

headers=$(mktemp)
color_diff=$(which colordiff||which cdiff||echo diff)
content=$(mktemp)
index=0
while true;
do
  url="${url_to_check}?wait=10m&index=${index}&pretty=true&stale"
  curl -fs --dump-header "$headers" -o "${content}.new" "${url}" || { echo "Failed to query ${url}"; exit 1; }
  if test $index -ne 0
  then
    ${color_diff} -u "$content" "$content.new" && echo " diff: No Differences found in service"
  fi
  index=$(grep "X-Consul-Index" "$headers" | sed 's/[^0-9]*\([0-9][0-9]*\)[^0-9]*/\1/g')
  if test "${index:-not_found}" = "not_found"
  then
    # We are in a part of Consul that does not output X-Consul-Index in headers, switch to poll
    sleep 5
    index=1
  fi
  if test ${CONSUL_SLEEP_DELAY:-0} -gt 0
  then
    sleep ${CONSUL_SLEEP_DELAY}
  fi
  mv "$content.new" "$content"
  printf "X-Consul-Index: $index at $(date) \b"
done

And launch it against your agent this way (if your agent is remove, you can replace localhost by the name of any consul agent in your cluster):

consul_check_services_changes.sh localhost:8500/v1/health/service/:service_name

It will display the changes in real time and help you a lot (we are using it intensively)

@snownd
Copy link
Author

snownd commented Nov 21, 2018

@banks @pierresouchay
Now I am sure the changes of index happened when agent synced service eventhough ths check is passed. The sync started at 02:24:19, and the script printed new index and resent request at same time. But I don't know why.

    2018/11/21 02:24:16 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:17 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:18 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:19 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:19 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
    2018/11/21 02:24:19 [INFO] agent: Synced service "d55e9420-899f-4272-8b64-7773117babeb"
    2018/11/21 02:24:19 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" in sync
    2018/11/21 02:24:19 [DEBUG] agent: Node info in sync
    2018/11/21 02:24:19 [DEBUG] http: Request GET /v1/health/service/network-server?wait=10m&index=295733&pretty=true&stale (1m31.939727148s) from=127.0.0.1:57946
    2018/11/21 02:24:20 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:21 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:22 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:23 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:24 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:25 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing

@pierresouchay
Copy link
Contributor

2018/11/21 02:24:19 [DEBUG] http: Request GET /v1/health/service/network-server?wait=10m&index=295733&pretty=true&stale (1m31.939727148s) from=127.0.0.1:57946

This line shows that the index did not change for at least 1m31 and my script did block during that time

It means the sync did occur and triggered and watch

Did the script did print again a line after?
Please display here the output of the script, but to me, it works as intended

@snownd
Copy link
Author

snownd commented Nov 21, 2018

@pierresouchay yes, it did print a line:
_f93b2af6-1977-4b63-8b7b-ea1f178e917e

    2018/11/21 10:39:28 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:39:29 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:39:30 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
    2018/11/21 10:39:30 [INFO] agent: Synced service "d55e9420-899f-4272-8b64-7773117babeb"
    2018/11/21 10:39:30 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" in sync
    2018/11/21 10:39:30 [DEBUG] agent: Node info in sync
    2018/11/21 10:39:30 [DEBUG] http: Request GET /v1/health/service/network-server?wait=10m&index=303063&pretty=true&stale (1m29.76440681s) from=127.0.0.1:40402
    2018/11/21 10:39:30 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:39:30 [DEBUG] memberlist: Initiating push/pull sync with: 10.246.3.91:8301
    2018/11/21 10:39:31 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:39:32 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:40:38 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:40:39 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
    2018/11/21 10:40:39 [INFO] agent: Synced service "d55e9420-899f-4272-8b64-7773117babeb"
    2018/11/21 10:40:39 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" in sync
    2018/11/21 10:40:39 [DEBUG] agent: Node info in sync
    2018/11/21 10:40:39 [DEBUG] http: Request GET /v1/health/service/network-server?wait=10m&index=303089&pretty=true&stale (1m8.386090357s) from=127.0.0.1:41212
    2018/11/21 10:40:39 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:40:40 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:40:41 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:40:42 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:40:43 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:40:44 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing

The index grows every time when a sync happened. It doesn't make sense.

@banks
Copy link
Member

banks commented Nov 21, 2018 via email

@snownd
Copy link
Author

snownd commented Nov 22, 2018

@banks I use tcp check. The agent start a connection and I accept it, nothing more.

@banks
Copy link
Member

banks commented Nov 22, 2018

Hmm so do you have a consul agent running inside your pod? If not, how does 127.0.01 in the TCP check resolve to the service in another pod? If so, I'm still confused, but it's worth noting that we don't generally recommend running in containers this way - it's preferable to deploy the agent as a daemon set so one per node and use hostIP to access it.

That said, I'm not sure why you are seeing this behaviour.

    2018/11/21 10:39:30 [INFO] agent: Synced service "d55e9420-899f-4272-8b64-7773117babeb"
    2018/11/21 10:39:30 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" in sync
    2018/11/21 10:39:30 [DEBUG] agent: Node info in sync

I should have looked close, you can see there that the check is "in sync" it's the service registration itself that is apparently not in sync.

I know you pasted code for registration already, but can you link to the JS API client you are using just in case it's doing something weird?

Also can you paste the following output (similar to the ones I requested a few messages ago) If the agent is in your pod then you'll need to run this inside the pod with kubectl exec.

curl -s localhost:8500/v1/agent/service/d55e9420-899f-4272-8b64-7773117babeb

(assuming your service still has that UUID as it's name/ID, if not show that for all the services actually running).

If you could paste the output of that twice - a few minutes apart (after you've seen a sync). That will hopefully help us figure out what is changing about your service (or if nothing maybe help us reproduce).

@pierresouchay
Copy link
Contributor

@banks I suspect @duanxuelin uses a Consul version lower than 1.3.0.

@duanxuelin In version 1.3.0, this PR is included: #4720 -> and it actually avoids having Modified index continuously being updated

As we can see in the output of my script in the comment
#4960 (comment) we can see that the change occurs every minute, so basically each time the anti-entropy does its job (@duanxuelin I suspect it is a very small cluster, right?)

So, if my assumptions are correct, you have: a small Consul Cluster version lower than 1.3.0

In that case, upgrading to Consul 1.3.0+ will fix you issue thanks to #4720

@snownd
Copy link
Author

snownd commented Nov 23, 2018

@pierresouchay @banks
Problem solved after I upgraded consul 1.3.0.
Thanks for help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants