-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service resynced every ~2mins, causes consul index to grow #4960
Comments
This is by design I think. |
But the growth of index makes blocking queries unusable. And this situation did not show up when I ran an agent without k8s. |
Blocking queries do work. |
@duanxuelin as Pierre said the log output is normal - that is the agent performing periodic anti-entropy sync with servers. I assume by "But the growth of index makes blocking queries unusable." you mean that every time it syncs the service ModifiedIndex is growing which means it's constantly sending updates to blocking query watchers right? Can you show us the check status? If checks are failing then their output gets updated on the servers every anti-entroy sync which is likely what you are seeing, but that shouldn't update the service index used to just watch the whole service unless the check is flapping between passing and critical and so the list of services available changes every time. |
@banks yes, do as you said, the change of ModifiedIndex makes my service gets update message constantly. And the check status is passing, at least the dashboard shows that. Meanwhile I do get a connection event every second with tcp check. Here are codes with nodejs:
|
Which node JS library is that? I'm unfamiliar with the JS libs that have been contributed although it's unlikely to be the cause of the issue. Running that check every 1 second is somewhat aggressive. It should work but it wouldn't surprise me if you hit occasional errors with that which might cause the health update even though when you reload 1 second later it looks like it is fine, Consul doesn't keep any history of checks so as soon as one check fails it updates the state in the cluster to failed. But I doubt that's the real problem. Can you confirm exactly what the lifecycle of your pod and service is? Is that JS in your application? Does it register itself with that check just on startup or more often? Re-registering can sometimes cause an update to the index even if nothing changed. Finally, could you try:
|
@duanxuelin Use this script named #!/bin/bash
if test ${1:-none} = "none"
then
echo "USAGE: $0 Consul_URL"
echo " Example: localhost:8500/v1/health/service/MY_SUPER_SERVICE"
exit 1
fi
url_to_check=$1
headers=$(mktemp)
color_diff=$(which colordiff||which cdiff||echo diff)
content=$(mktemp)
index=0
while true;
do
url="${url_to_check}?wait=10m&index=${index}&pretty=true&stale"
curl -fs --dump-header "$headers" -o "${content}.new" "${url}" || { echo "Failed to query ${url}"; exit 1; }
if test $index -ne 0
then
${color_diff} -u "$content" "$content.new" && echo " diff: No Differences found in service"
fi
index=$(grep "X-Consul-Index" "$headers" | sed 's/[^0-9]*\([0-9][0-9]*\)[^0-9]*/\1/g')
if test "${index:-not_found}" = "not_found"
then
# We are in a part of Consul that does not output X-Consul-Index in headers, switch to poll
sleep 5
index=1
fi
if test ${CONSUL_SLEEP_DELAY:-0} -gt 0
then
sleep ${CONSUL_SLEEP_DELAY}
fi
mv "$content.new" "$content"
printf "X-Consul-Index: $index at $(date) \b"
done And launch it against your agent this way (if your agent is remove, you can replace
It will display the changes in real time and help you a lot (we are using it intensively) |
@banks @pierresouchay
|
This line shows that the index did not change for at least 1m31 and my script did block during that time It means the sync did occur and triggered and watch Did the script did print again a line after? |
@pierresouchay yes, it did print a line:
The index grows every time when a sync happened. It doesn't make sense. |
@duanxuelin does your check output something different every time? If so
then every sync we will update the output of the script on the server and
that causes the update. Typically you should aim for checks to output
exactly the same thing (if the output anything) for passing state otherwise
the cluster has to do lots of extra work keeping that output in sync.
…On Wed, Nov 21, 2018 at 10:48 AM duanxuelin ***@***.***> wrote:
@pierresouchay <https://github.com/pierresouchay> yes, it did print a
line:
[image: _f93b2af6-1977-4b63-8b7b-ea1f178e917e]
<https://user-images.githubusercontent.com/30339830/48836347-76a61d80-edbd-11e8-9a6c-cc4309a6640e.png>
2018/11/21 10:39:28 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
2018/11/21 10:39:29 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
2018/11/21 10:39:30 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
2018/11/21 10:39:30 [INFO] agent: Synced service "d55e9420-899f-4272-8b64-7773117babeb"
2018/11/21 10:39:30 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" in sync
2018/11/21 10:39:30 [DEBUG] agent: Node info in sync
2018/11/21 10:39:30 [DEBUG] http: Request GET /v1/health/service/network-server?wait=10m&index=303063&pretty=true&stale (1m29.76440681s) from=127.0.0.1:40402
2018/11/21 10:39:30 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
2018/11/21 10:39:30 [DEBUG] memberlist: Initiating push/pull sync with: 10.246.3.91:8301
2018/11/21 10:39:31 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
2018/11/21 10:39:32 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
2018/11/21 10:40:38 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
2018/11/21 10:40:39 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
2018/11/21 10:40:39 [INFO] agent: Synced service "d55e9420-899f-4272-8b64-7773117babeb"
2018/11/21 10:40:39 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" in sync
2018/11/21 10:40:39 [DEBUG] agent: Node info in sync
2018/11/21 10:40:39 [DEBUG] http: Request GET /v1/health/service/network-server?wait=10m&index=303089&pretty=true&stale (1m8.386090357s) from=127.0.0.1:41212
2018/11/21 10:40:39 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
2018/11/21 10:40:40 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
2018/11/21 10:40:41 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
2018/11/21 10:40:42 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
2018/11/21 10:40:43 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
2018/11/21 10:40:44 [DEBUG] agent: Check "service:d55e9420-899f-4272-8b64-7773117babeb" is passing
The index grows every time when a sync happened. It doesn't make sense.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4960 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHYUx5mBX51KFTKHjbLufzSC1oXNYgkks5uxS-SgaJpZM4YfDUA>
.
|
@banks I use tcp check. The agent start a connection and I accept it, nothing more. |
Hmm so do you have a consul agent running inside your pod? If not, how does 127.0.01 in the TCP check resolve to the service in another pod? If so, I'm still confused, but it's worth noting that we don't generally recommend running in containers this way - it's preferable to deploy the agent as a daemon set so one per node and use hostIP to access it. That said, I'm not sure why you are seeing this behaviour.
I should have looked close, you can see there that the check is "in sync" it's the service registration itself that is apparently not in sync. I know you pasted code for registration already, but can you link to the JS API client you are using just in case it's doing something weird? Also can you paste the following output (similar to the ones I requested a few messages ago) If the agent is in your pod then you'll need to run this inside the pod with kubectl exec.
(assuming your service still has that UUID as it's name/ID, if not show that for all the services actually running). If you could paste the output of that twice - a few minutes apart (after you've seen a sync). That will hopefully help us figure out what is changing about your service (or if nothing maybe help us reproduce). |
@banks I suspect @duanxuelin uses a Consul version lower than 1.3.0. @duanxuelin In version 1.3.0, this PR is included: #4720 -> and it actually avoids having Modified index continuously being updated As we can see in the output of my script in the comment So, if my assumptions are correct, you have: a small Consul Cluster version lower than 1.3.0 In that case, upgrading to Consul 1.3.0+ will fix you issue thanks to #4720 |
@pierresouchay @banks |
We are running agent in a k8s pod. After register a service, the agent syned service continuously。
The k8s deploymend file is like:
The text was updated successfully, but these errors were encountered: