-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in usage metrics when multiple service instances are changed in a single transaction #9440
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the Txn api is used to delete multiple instances of a service name from two nodes within the same transaction?
e0e72e6
to
1013a10
Compare
There were a couple of instances were usage metrics would do the wrong thing and result in incorrect counts, causing the count to attempt to decrement below zero and return an error. The usage metrics did not account for various places where a single transaction could delete/update/add multiple service instances at once. We also remove the error when attempting to decrement below zero, and instead just make sure we do not accidentally underflow the unsigned integer. This is a more graceful failure than returning an error and not allowing a transaction to commit.
1013a10
to
a62e845
Compare
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/hashicorp/consul/b4yxpixg4 |
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/309152. |
🍒✅ Cherry pick of commit 0712e03 onto |
…in a single transaction (#9440) * Fix bug in usage metrics that caused a negative count to occur There were a couple of instances were usage metrics would do the wrong thing and result in incorrect counts, causing the count to attempt to decrement below zero and return an error. The usage metrics did not account for various places where a single transaction could delete/update/add multiple service instances at once. We also remove the error when attempting to decrement below zero, and instead just make sure we do not accidentally underflow the unsigned integer. This is a more graceful failure than returning an error and not allowing a transaction to commit. * Add changelog
Symptom
A log line like so:
Cause
When multiple service instances are registered/deregistered at once, usage metrics of service names would be computed incorrectly, eventually leading to state store errors like the above. This could happen on
DeleteNode
RPCs, and also when Consul does a restore.To fix, we collect all of the memdb changes and then do a pass through the updated memdb state to reconcile, instead of dealing with the memdb changes one at a time.
Unfortunately it might not error out immediately, if you have enough services that the count does not go negative. Instead, the usage metrics will be subtly wrong until another service deletion forces the count negative.
Testing
I have tested out reproduction cases involving both
and
while watching:
and this PR does fix the issues and correctly computes usage metrics.