Consul 1.5.3 changes check status behavior when doing a consul reload? #7318

lvets · 2020-02-18T22:44:04Z

Overview of the Issue

Between Consul 1.5.2 and Consul 1.5.3, the default behavior of node checks when doing consul reload changed.
With Consul 1.5.2, checks for a specific node had "passing" as status and stayed this way when doing a consul reload
With Consul 1.5.3 and after, "passing" checks go to "critical" when doing a consul reload and when the checks pass, they go to "passing".
I would've expected that the status of checks don't change when doing a consul reload.

Additionally, because we're using Fabio, this also means that Fabio temporarily removes routes based on these checks when doing consul reload effectively causing an outage.

Reproduction Steps

Steps to reproduce this issue, eg:

Use Consul 1.5.2
Run consul reload
Check Fabio logs and/or curl -s localhost:8500/v1/health/node/node
Check status doesn't change
Use Consul 1.5.3
Run consul reload
Check Fabio logs and/or curl -s localhost:8500/v1/health/node/node
Fabio routes are removed and checks status changes to "ciritical" until the checks run again.

Consul info for both Client and Server

Consul server: 1.6.0
Consul agent: 1.5.2 and 1.5.3.

Operating system and Environment details

OS: SLES 12 and Amazon Linux 2.

Log Fragments

I'm not a 100% sure how to include the logs, the Consul logs are the same between versions and in Fabio I can see routes being removed and added again for Consul 1.5.3, but nothing for 1.5.2 (i.e. the routes stay).

The text was updated successfully, but these errors were encountered:

pierresouchay · 2020-02-19T08:15:02Z

I have the feeling it is linked to #6144 ...
@lvets Do you have a reproduction test case ?
Do your checks have IDs ?

lvets · 2020-02-19T14:54:57Z

@pierresouchay, I'm currently testing in our development landscape, not sure how I can easily translate that to a simple test case, but I'll try :)
We only have 2 checks and they have a name and id which are both unique.

pierresouchay · 2020-02-24T12:06:05Z

@lvets any news ?

lvets · 2020-02-25T00:20:49Z

@pierresouchay See https://github.com/lvets/legendary-octo-potato for a quick and dirty test scenario with Fabio, Consul servers and a bunch of Consul agents. Took a bit of time to translate our production infrastructure into docker-compose.

pierresouchay · 2020-02-25T14:07:56Z

Yes, I confirm the behavior from 1.5.2 to 1.5.3+ (still present in 1.7.1).

Each time a reload is performed, the state becomes critical and the Output becomes empty.

…ervices This fixes issue hashicorp#7318 Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health of services. A patch hashicorp#6144 had been issued for HealthChecks of nodes, but not for healthchecks of services. What happened when a reload was: 1. save all healthcheck statuses 2. cleanup everything 3. add new services with healthchecks In step 3, the state of healthchecks was taken into account locally, so at step 3, but since we cleaned up at step 2, state was lost. This PR introduces the snap parameter, so step 3 can use information from step 1

pierresouchay · 2020-02-25T15:14:38Z

@lvets Thank you for the reproduction, here is the fix: #7345

lvets · 2020-02-25T19:07:12Z

@pierresouchay Thank you for your help with this! Do you have an idea when the fix might make it in a release?

pierresouchay · 2020-02-25T19:29:22Z

@lvets When Hashicorp will review it. The change #7345 is not that complicated, if we are lucky, might be included in 1.7.2 maybe @rboyer who did the HealthCheck patch #6144 can review it?

…7345) This fixes issue #7318 Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health of services. A patch #6144 had been issued for HealthChecks of nodes, but not for healthchecks of services. What happened when a reload was: 1. save all healthcheck statuses 2. cleanup everything 3. add new services with healthchecks In step 3, the state of healthchecks was taken into account locally, so at step 3, but since we cleaned up at step 2, state was lost. This PR introduces the snap parameter, so step 3 can use information from step 1

pierresouchay · 2020-03-09T12:20:24Z

@lvets Fixed by #7345

…7345) This fixes issue #7318 Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health of services. A patch #6144 had been issued for HealthChecks of nodes, but not for healthchecks of services. What happened when a reload was: 1. save all healthcheck statuses 2. cleanup everything 3. add new services with healthchecks In step 3, the state of healthchecks was taken into account locally, so at step 3, but since we cleaned up at step 2, state was lost. This PR introduces the snap parameter, so step 3 can use information from step 1

This ensures no regression about hashicorp#7318 And ensure that hashicorp#7446 cannot happen anymore

…ul reload (#7449) This ensures no regression about #7318 And ensure that #7446 cannot happen anymore

lvets changed the title ~~Consul 1.5.3 changes check status behavior?~~ Consul 1.5.3 changes check status behavior when doing consul reload? Feb 18, 2020

lvets changed the title ~~Consul 1.5.3 changes check status behavior when doing consul reload?~~ Consul 1.5.3 changes check status behavior when doing a consul reload? Feb 18, 2020

pierresouchay mentioned this issue Feb 25, 2020

[BUGFIX] Configuration reload does not discard Check's statuses for services #7345

Merged

hanshasselberg closed this as completed Mar 9, 2020

This was referenced Mar 13, 2020

Added unit test to ensure watches are not re-triggered on consul reload #7449

Merged

Watch getting triggered in consul reload #7446

Closed

gsolic mentioned this issue Mar 23, 2020

Checks transition to critical state during reload #6914

Closed

pierresouchay added a commit to pierresouchay/consul that referenced this issue Apr 3, 2020

Added unit test to ensure watches are not re-triggered on consul reload

72a9e77

This ensures no regression about hashicorp#7318 And ensure that hashicorp#7446 cannot happen anymore

hanshasselberg pushed a commit that referenced this issue May 20, 2020

tests: added unit test to ensure watches are not re-triggered on cons…

5c7af90

…ul reload (#7449) This ensures no regression about #7318 And ensure that #7446 cannot happen anymore

rboyer pushed a commit that referenced this issue Jun 1, 2020

tests: added unit test to ensure watches are not re-triggered on cons…

66612e5

…ul reload (#7449) This ensures no regression about #7318 And ensure that #7446 cannot happen anymore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul 1.5.3 changes check status behavior when doing a consul reload? #7318

Consul 1.5.3 changes check status behavior when doing a consul reload? #7318

lvets commented Feb 18, 2020 •

edited

Loading

pierresouchay commented Feb 19, 2020

lvets commented Feb 19, 2020 •

edited

Loading

pierresouchay commented Feb 24, 2020

lvets commented Feb 25, 2020

pierresouchay commented Feb 25, 2020

pierresouchay commented Feb 25, 2020

lvets commented Feb 25, 2020

pierresouchay commented Feb 25, 2020 •

edited

Loading

pierresouchay commented Mar 9, 2020

Consul 1.5.3 changes check status behavior when doing a consul reload? #7318

Consul 1.5.3 changes check status behavior when doing a consul reload? #7318

Comments

lvets commented Feb 18, 2020 • edited Loading

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

pierresouchay commented Feb 19, 2020

lvets commented Feb 19, 2020 • edited Loading

pierresouchay commented Feb 24, 2020

lvets commented Feb 25, 2020

pierresouchay commented Feb 25, 2020

pierresouchay commented Feb 25, 2020

lvets commented Feb 25, 2020

pierresouchay commented Feb 25, 2020 • edited Loading

pierresouchay commented Mar 9, 2020

lvets commented Feb 18, 2020 •

edited

Loading

lvets commented Feb 19, 2020 •

edited

Loading

pierresouchay commented Feb 25, 2020 •

edited

Loading