Service resynced every ~2mins, causes consul index to grow #4960

snownd · 2018-11-15T04:15:41Z

We are running agent in a k8s pod. After register a service, the agent syned service continuously。

    2018/11/15 02:04:41 [INFO] agent: Synced node info
    2018/11/15 02:04:46 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:04:46 [INFO] agent: Synced check "service:54ebba0c-841d-4d6a-8f09-734bab38d390"
==&gt; Newer Consul version available: 1.4.0 (currently running: 1.2.3)
    2018/11/15 02:06:08 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:07:46 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:09:03 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:10:34 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:11:46 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:13:34 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:14:52 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"
    2018/11/15 02:16:04 [INFO] agent: Synced service "54ebba0c-841d-4d6a-8f09-734bab38d390"

The k8s deploymend file is like:

      - image: consul:1.2.3
        imagePullPolicy: IfNotPresent
        name: consul-agent
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 5; consul leave -http-addr=http://$BIND_IP:8500"]
       args:
        - "agent"
        - "-bind=0.0.0.0"
        - "-retry-join=consul-server.base"
        - "-datacenter=dc"

The text was updated successfully, but these errors were encountered:

pierresouchay · 2018-11-15T14:12:03Z

This is by design I think.

snownd · 2018-11-16T01:45:05Z

This is by design I think.

But the growth of index makes blocking queries unusable. And this situation did not show up when I ran an agent without k8s.

pierresouchay · 2018-11-16T06:22:11Z

Blocking queries do work.
What is registered within your K8S service exactly? Are the healthchecks working properly?

banks · 2018-11-16T12:48:29Z

@duanxuelin as Pierre said the log output is normal - that is the agent performing periodic anti-entropy sync with servers.

I assume by "But the growth of index makes blocking queries unusable." you mean that every time it syncs the service ModifiedIndex is growing which means it's constantly sending updates to blocking query watchers right?

Can you show us the check status? If checks are failing then their output gets updated on the servers every anti-entroy sync which is likely what you are seeing, but that shouldn't update the service index used to just watch the whole service unless the check is flapping between passing and critical and so the list of services available changes every time.

snownd · 2018-11-19T02:56:42Z

@banks yes, do as you said, the change of ModifiedIndex makes my service gets update message constantly. And the check status is passing, at least the dashboard shows that. Meanwhile I do get a connection event every second with tcp check.

Here are codes with nodejs:

 const ret = await consul.agent.service.register({
      name: serviceName,
      id: serviceId,
      address: serviceIP,
      port: servicePort,
      check: {
        deregistercriticalserviceafter: '1m',
        tcp: `127.0.0.1:${checkPort}`,
        timeout: '3s',
        interval: '1s'
      }
    });

    const watch = consul.watch({
      method: consul.health.service,
      options: {service: serviceName, passing: true}
    });

banks · 2018-11-20T11:23:13Z

Which node JS library is that? I'm unfamiliar with the JS libs that have been contributed although it's unlikely to be the cause of the issue.

Running that check every 1 second is somewhat aggressive. It should work but it wouldn't surprise me if you hit occasional errors with that which might cause the health update even though when you reload 1 second later it looks like it is fine, Consul doesn't keep any history of checks so as soon as one check fails it updates the state in the cluster to failed.

But I doubt that's the real problem. Can you confirm exactly what the lifecycle of your pod and service is? Is that JS in your application? Does it register itself with that check just on startup or more often? Re-registering can sometimes cause an update to the index even if nothing changed.

Finally, could you try:

Configure consul agent with -log-level DEBUG (or via config file if you have one)
curl the health API on that agent, or another agent if easier in a loop and paste a few of the outputs so we can see what changed etc. something like (assuming local to the agent) curl -is localhost:8500/v1/health/service/:serviceName do it a few minutes apart so we can see the Index growing and the output each time. The logs from above may help us correlate whether the app is re-registering or failing a health check or similar in that time.

pierresouchay · 2018-11-20T19:29:58Z

@duanxuelin

Use this script named consul_check_services_changes.sh:

#!/bin/bash
if test ${1:-none} = "none"
then
  echo "USAGE: $0 Consul_URL"
  echo "       Example: localhost:8500/v1/health/service/MY_SUPER_SERVICE"
  exit 1
fi
url_to_check=$1

headers=$(mktemp)
color_diff=$(which colordiff||which cdiff||echo diff)
content=$(mktemp)
index=0
while true;
do
  url="${url_to_check}?wait=10m&index=${index}&pretty=true&stale"
  curl -fs --dump-header "$headers" -o "${content}.new" "${url}" || { echo "Failed to query ${url}"; exit 1; }
  if test $index -ne 0
  then
    ${color_diff} -u "$content" "$content.new" && echo " diff: No Differences found in service"
  fi
  index=$(grep "X-Consul-Index" "$headers" | sed 's/[^0-9]*\([0-9][0-9]*\)[^0-9]*/\1/g')
  if test "${index:-not_found}" = "not_found"
  then
    # We are in a part of Consul that does not output X-Consul-Index in headers, switch to poll
    sleep 5
    index=1
  fi
  if test ${CONSUL_SLEEP_DELAY:-0} -gt 0
  then
    sleep ${CONSUL_SLEEP_DELAY}
  fi
  mv "$content.new" "$content"
  printf "X-Consul-Index: $index at $(date) \b"
done

And launch it against your agent this way (if your agent is remove, you can replace localhost by the name of any consul agent in your cluster):

consul_check_services_changes.sh localhost:8500/v1/health/service/:service_name

It will display the changes in real time and help you a lot (we are using it intensively)

snownd · 2018-11-21T02:32:05Z

@banks @pierresouchay
Now I am sure the changes of index happened when agent synced service eventhough ths check is passed. The sync started at 02:24:19, and the script printed new index and resent request at same time. But I don't know why.

    2018/11/21 02:24:16 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:17 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:18 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:19 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:19 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
    2018/11/21 02:24:19 [INFO] agent: Synced service "d55e9420-899f-4272-8b64-7773117babeb"
    2018/11/21 02:24:19 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" in sync
    2018/11/21 02:24:19 [DEBUG] agent: Node info in sync
    2018/11/21 02:24:19 [DEBUG] http: Request GET /v1/health/service/network-server?wait=10m&amp;index=295733&amp;pretty=true&amp;stale (1m31.939727148s) from=127.0.0.1:57946
    2018/11/21 02:24:20 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:21 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:22 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:23 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:24 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 02:24:25 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing

pierresouchay · 2018-11-21T07:32:31Z

2018/11/21 02:24:19 [DEBUG] http: Request GET /v1/health/service/network-server?wait=10m&amp;index=295733&amp;pretty=true&amp;stale (1m31.939727148s) from=127.0.0.1:57946

This line shows that the index did not change for at least 1m31 and my script did block during that time

It means the sync did occur and triggered and watch

Did the script did print again a line after?
Please display here the output of the script, but to me, it works as intended

snownd · 2018-11-21T10:48:47Z

@pierresouchay yes, it did print a line:

    2018/11/21 10:39:28 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:39:29 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:39:30 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
    2018/11/21 10:39:30 [INFO] agent: Synced service "d55e9420-899f-4272-8b64-7773117babeb"
    2018/11/21 10:39:30 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" in sync
    2018/11/21 10:39:30 [DEBUG] agent: Node info in sync
    2018/11/21 10:39:30 [DEBUG] http: Request GET /v1/health/service/network-server?wait=10m&amp;index=303063&amp;pretty=true&amp;stale (1m29.76440681s) from=127.0.0.1:40402
    2018/11/21 10:39:30 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:39:30 [DEBUG] memberlist: Initiating push/pull sync with: 10.246.3.91:8301
    2018/11/21 10:39:31 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:39:32 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing

    2018/11/21 10:40:38 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:40:39 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
    2018/11/21 10:40:39 [INFO] agent: Synced service "d55e9420-899f-4272-8b64-7773117babeb"
    2018/11/21 10:40:39 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" in sync
    2018/11/21 10:40:39 [DEBUG] agent: Node info in sync
    2018/11/21 10:40:39 [DEBUG] http: Request GET /v1/health/service/network-server?wait=10m&amp;index=303089&amp;pretty=true&amp;stale (1m8.386090357s) from=127.0.0.1:41212
    2018/11/21 10:40:39 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:40:40 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:40:41 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:40:42 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:40:43 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
    2018/11/21 10:40:44 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing

The index grows every time when a sync happened. It doesn't make sense.

banks · 2018-11-21T12:11:28Z

@duanxuelin does your check output something different every time? If so then every sync we will update the output of the script on the server and that causes the update. Typically you should aim for checks to output exactly the same thing (if the output anything) for passing state otherwise the cluster has to do lots of extra work keeping that output in sync.

…

On Wed, Nov 21, 2018 at 10:48 AM duanxuelin ***@***.***> wrote: @pierresouchay <https://github.com/pierresouchay> yes, it did print a line: [image: _f93b2af6-1977-4b63-8b7b-ea1f178e917e] <https://user-images.githubusercontent.com/30339830/48836347-76a61d80-edbd-11e8-9a6c-cc4309a6640e.png> 2018/11/21 10:39:28 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing 2018/11/21 10:39:29 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing 2018/11/21 10:39:30 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically 2018/11/21 10:39:30 [INFO] agent: Synced service "d55e9420-899f-4272-8b64-7773117babeb" 2018/11/21 10:39:30 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" in sync 2018/11/21 10:39:30 [DEBUG] agent: Node info in sync 2018/11/21 10:39:30 [DEBUG] http: Request GET /v1/health/service/network-server?wait=10m&index=303063&pretty=true&stale (1m29.76440681s) from=127.0.0.1:40402 2018/11/21 10:39:30 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing 2018/11/21 10:39:30 [DEBUG] memberlist: Initiating push/pull sync with: 10.246.3.91:8301 2018/11/21 10:39:31 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing 2018/11/21 10:39:32 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing 2018/11/21 10:40:38 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing 2018/11/21 10:40:39 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically 2018/11/21 10:40:39 [INFO] agent: Synced service "d55e9420-899f-4272-8b64-7773117babeb" 2018/11/21 10:40:39 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" in sync 2018/11/21 10:40:39 [DEBUG] agent: Node info in sync 2018/11/21 10:40:39 [DEBUG] http: Request GET /v1/health/service/network-server?wait=10m&index=303089&pretty=true&stale (1m8.386090357s) from=127.0.0.1:41212 2018/11/21 10:40:39 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing 2018/11/21 10:40:40 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing 2018/11/21 10:40:41 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing 2018/11/21 10:40:42 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing 2018/11/21 10:40:43 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing 2018/11/21 10:40:44 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing The index grows every time when a sync happened. It doesn't make sense. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4960 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHYUx5mBX51KFTKHjbLufzSC1oXNYgkks5uxS-SgaJpZM4YfDUA> .

snownd · 2018-11-22T02:20:50Z

@banks I use tcp check. The agent start a connection and I accept it, nothing more.

banks · 2018-11-22T10:42:47Z

Hmm so do you have a consul agent running inside your pod? If not, how does 127.0.01 in the TCP check resolve to the service in another pod? If so, I'm still confused, but it's worth noting that we don't generally recommend running in containers this way - it's preferable to deploy the agent as a daemon set so one per node and use hostIP to access it.

That said, I'm not sure why you are seeing this behaviour.

    2018/11/21 10:39:30 [INFO] agent: Synced service "d55e9420-899f-4272-8b64-7773117babeb"
    2018/11/21 10:39:30 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" in sync
    2018/11/21 10:39:30 [DEBUG] agent: Node info in sync

I should have looked close, you can see there that the check is "in sync" it's the service registration itself that is apparently not in sync.

I know you pasted code for registration already, but can you link to the JS API client you are using just in case it's doing something weird?

Also can you paste the following output (similar to the ones I requested a few messages ago) If the agent is in your pod then you'll need to run this inside the pod with kubectl exec.

curl -s localhost:8500/v1/agent/service/d55e9420-899f-4272-8b64-7773117babeb

(assuming your service still has that UUID as it's name/ID, if not show that for all the services actually running).

If you could paste the output of that twice - a few minutes apart (after you've seen a sync). That will hopefully help us figure out what is changing about your service (or if nothing maybe help us reproduce).

pierresouchay · 2018-11-22T11:25:22Z

@banks I suspect @duanxuelin uses a Consul version lower than 1.3.0.

@duanxuelin In version 1.3.0, this PR is included: #4720 -> and it actually avoids having Modified index continuously being updated

As we can see in the output of my script in the comment
#4960 (comment) we can see that the change occurs every minute, so basically each time the anti-entropy does its job (@duanxuelin I suspect it is a very small cluster, right?)

So, if my assumptions are correct, you have: a small Consul Cluster version lower than 1.3.0

In that case, upgrading to Consul 1.3.0+ will fix you issue thanks to #4720

snownd · 2018-11-23T04:01:11Z

@pierresouchay @banks
Problem solved after I upgraded consul 1.3.0.
Thanks for help.

snownd closed this as completed Nov 23, 2018

pierresouchay mentioned this issue Dec 3, 2018

Consul watch fires for no reason #5044

Open

yaobinwen mentioned this issue Aug 27, 2023

Learn Consul yaobinwen/robin_on_rails#115

Open

31 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service resynced every ~2mins, causes consul index to grow #4960

Service resynced every ~2mins, causes consul index to grow #4960

snownd commented Nov 15, 2018

pierresouchay commented Nov 15, 2018

snownd commented Nov 16, 2018 •

edited

Loading

pierresouchay commented Nov 16, 2018

banks commented Nov 16, 2018

snownd commented Nov 19, 2018

banks commented Nov 20, 2018

pierresouchay commented Nov 20, 2018

snownd commented Nov 21, 2018 •

edited

Loading

pierresouchay commented Nov 21, 2018

snownd commented Nov 21, 2018

banks commented Nov 21, 2018 via email

snownd commented Nov 22, 2018

banks commented Nov 22, 2018

pierresouchay commented Nov 22, 2018

snownd commented Nov 23, 2018

Service resynced every ~2mins, causes consul index to grow #4960

Service resynced every ~2mins, causes consul index to grow #4960

Comments

snownd commented Nov 15, 2018

pierresouchay commented Nov 15, 2018

snownd commented Nov 16, 2018 • edited Loading

pierresouchay commented Nov 16, 2018

banks commented Nov 16, 2018

snownd commented Nov 19, 2018

banks commented Nov 20, 2018

pierresouchay commented Nov 20, 2018

snownd commented Nov 21, 2018 • edited Loading

pierresouchay commented Nov 21, 2018

snownd commented Nov 21, 2018

banks commented Nov 21, 2018 via email

snownd commented Nov 22, 2018

banks commented Nov 22, 2018

pierresouchay commented Nov 22, 2018

snownd commented Nov 23, 2018

snownd commented Nov 16, 2018 •

edited

Loading

snownd commented Nov 21, 2018 •

edited

Loading