-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] When a node level check is removed, ensure all services of node are notified #3970
[BUGFIX] When a node level check is removed, ensure all services of node are notified #3970
Conversation
…ode are notified Bugfix for hashicorp#3899 When a node level check is removed (example: maintenance), some watchers on services might have to recompute their state. If those nodes are performing blocking queries, they have to be notified. While their state was updated when node-level state did change or was added this was not the case when the check was removed. This fixes it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
Couple of puzzlers in the code but probably minor fixes/clarifications.
More generally, I'm a little unsettled that bugs like this exist - one of the risks we talked about in your other PR was maintenance overhead and how easy it would be for things like this to slip in where any change to state needs to reason about if there is any way it might affect some other API result and therefore need to bump the index value on it.
I'm wondering how to be confident there aren't other state changes that may occur that we've overlooked?
agent/consul/state/catalog.go
Outdated
@@ -1086,6 +1086,43 @@ func (s *Store) EnsureCheck(idx uint64, hc *structs.HealthCheck) error { | |||
return nil | |||
} | |||
|
|||
// This ensure all connected services to a check are updated properly | |||
func (s *Store) alterServicesFromCheck(tx *memdb.Txn, idx uint64, hc *structs.HealthCheck, checkServiceExists bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go style is to start the comment with the name of the method and write full sentence:
// alterServicesFromCheck ensures ... are updated properly.
func (s *Store) alterServicesFromCheck(...)
Also the name confuses me: it's names "alter services" yet it actually updates the service index right not the actual service record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE in next patch
agent/consul/state/catalog.go
Outdated
// a registration for the service. | ||
if hc.ServiceID != "" { | ||
service, err := tx.First("services", "id", hc.Node, hc.ServiceID) | ||
// When deleting a service, service might have been removed already |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, I guess this is refering to the checkServiceExists
flag passed which seems reasonable, but then in the case that we don't check existence and so skip the first error handler, the second handler will be triggered and return an error anyway?
The caller in deleteCheckTxn
doesn't distinguish between those error types so will fail if this returns any error anyway. Can we just remove this case as it seems to do nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I did it another way in next patch: I renamed alterServicesFromCheck and it only updates all Service Indexes for a given node
@banks I see your concerns While premature optimization is the root of all evil, we are currently facing very large issues in my company with no optimizations on this side, but each time we add a feature, we add the corresponding test as well. It really changed the way our clusters and even our apps do behave, and I think we are not the only ones having large deployments that would benefit greatly from this kind of changes. Furthermore, if you look closely, this bug already existed in the original implementation, but was less visible: "services" index was not updated in that case, but it was less visible BECAUSE the index of services used to change a lot. So basically, we are fixing a bug that did exist for a while. consul/agent/consul/state/catalog.go Line 1322 in 29feb66
I found this bug this morning, but the good news is that among our many watch implementations (we have at least 5 implementations of watches across our systems), that's the only real bug we did find. Everything else is:
We had issues for months with Consul performance, and it did fix almost all of our issues and did put our cluster and the whole infrastructure in a sane state. I can send you privately some metrics regarding our apps / consul state if you are interrested Kind Regards UPDATEI have been thinking about that a bit. I understand your concerns, so if you really have issues with this optimization, we might add a kind of security: we could add a barrier index such as even if a service is not updated during 5 minutes or so (for instance, we could use max_wait_duration or something similar), update the index for watches. Thus, all systems watching it would never have more than 5 minutes of delay even in the case of a catastrophic index bug. I don't find it very elegant, but it might be a compromise. |
@banks Tested in our cluster successfully after applying the patch on all Consul servers When we set maintenance, all services are impacted by the change (all are notified properly), and when we remove maintenance, all watches on services are notified as well. Full Test:
$ consul maint --enable --reason Test
Node maintenance is now enabled
(This was already working)
consul maint --disable
Node maintenance is now disabled
Watch is notified immediately, maintenance has been effectively removed (was not working)
|
Might fix unstable travis build
@banks it seems Travis tests are unstable again |
…change_service_index
Can we please mark this bug as blocking for the |
@pierresouchay will update soon based on discussions. One thing I'm not clear on though in the interim: why does #3899 make this so much worse? #3899 only optimises for the index returned in between blocking calls. Once actually blocking, my understanding is that all watchers only return on direct changes to the objects in MemDB they are actually watching - #3899 doesn't change that behaviour. If node check changes never touched the service at all then surely this If node check changes do currently trigger blocked service watchers to return with updated state but without updating the modified index, then it doesn't seem like this is really any worse than before either? I may be missing something here. Are you sure that #3899 actually made this worse for you? If so can you explain the mechanism a little more so we can understand the real impact etc. I'm not trying to get out of the fix - if it's wrong it needs fixing for sure. I just want to understand if the urgency for 1.0.7 is valid. |
To clarify the discussion a little here.
Given that, it would seem like:
I still think there is a valid issue here since it's inconsistent that we do trigger health watches on a check change but do not update the X-Consul-Index making it possible for watchers to "miss" the event with unfortunate timing. But I don't think this is any worse because of #3899 and in fact #3899 fixed the issue for service-specific watches where it would have been broken before. It's also less bad than it sounds (and probably why it's never been reported) in that blocked clients work just fine before and now, it's only the subtle possibility of a missed update in between blocking calls that is incorrect. That's bad for sure but likely not noticeable except in extreme cases. @pierresouchay does that match your understanding? If I'm right it helps us understand if this PR is the right fix and whether it's super urgent for 1.0.7 (seems not given it's no worse than before). |
@banks I think we understand the same. Basically, updating/adding a check does notify the service/<SERVICE_NAME> check, but removing does not. This is an issue, for instance with maintenance mode : When we set a node in maintenance, consul adds a failing healtcheck => everybody see the change. However, when removing the maintenance mode (basically, remove the failing healthcheck at node level), nobody see the change => the node is still considered unhealthy by watchers and node is not taken into account. This is the reason why this patch addresses ONLY checks being removed |
After discussion with @preetapan, what I said above is mostly correct although I was slightly missing that #3899 already handles node-level check updates in That means the impact of not merging this seems to be even lower given that node checks adding or failing will work correctly. I think my analysis that this is strictly better than before #3899 still holds too. That said, we now (finally) have a good understanding of the subtleties here and it does seem worth a fix so will review it as such. It's likely it will make it in before 1.0.7 but I don't think it's a showstopper if it doesn't since the overall behaviour will be no worse than before even without it.
Hmm no it doesn't - updating the
I think this is still being overstated (or I'm missunderstanding). Updating the The fact the index is not updated only matters for a watcher that happens to be between the last blocking call and the subsequent one at exactly the time the check is removed. That's why it's an edge case since vast majorty of watchers will not wait inbetween calls so the window for "missing" the removal is typically very small. But to repeat I still agree it's better to be correct now we understand the scope :) |
How we discovered the issue: When we perform some important changes to our infrastructure, we set the node in maintenance mode, so all watchers are notified: "Stop Sending requests on this node". We perform the changes, disable maintenance if everything is fine. However, since the optimisation, the watchers are not notified the node is back online and stop sending it traffic for longer than expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need some local soak testing before merging
For info we are running this patch on all of our clusters (prod and preprod) for several weeks for almost 3 weeks. No performance impact, and all our services (heavily relying on very long watches) are now behaving well, including during maintenance of nodes |
@banks Are you Ok with the changes I made? For us, 1.0.7 cannot work without this patch as it disables services where node are in maintenance in all of our nodes without enabling them back when we remove maintenance on node level. It is not a big deal for us, but all people relying on watches for provisioning might suffer from this as well |
@pierresouchay we'll certainly merge this for next release. We had a tight deadline for 1.0.7 and hadn't gotten all the testing on this done we wanted although we're happy with the code as it is. We went ahead with release for all the reasons I gave in #3970 (comment) but we certainly will get this in for next release. Thanks! |
I verified this locally with both All seems to work. I'm going to land it in time to go through our pre-release soak test but I expect there not to be further issues. |
@banks Great, I think it is a very good fix to include. Regards |
Bugfix for #3899
When a node level check is removed (example: maintenance),
some watchers on services might have to recompute their state.
If those nodes are performing blocking queries, they have to be notified.
While their state was updated when node-level state did change or was added
this was not the case when the check was removed. This fixes it.