Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in usage metrics when multiple service instances are changed in a single transaction #9440

Merged
merged 2 commits into from
Jan 12, 2021

Conversation

crhino
Copy link
Contributor

@crhino crhino commented Dec 18, 2020

Symptom

A log line like so:

2020-12-18T20:39:41.784Z [WARN]  agent.fsm: DeleteNode failed: error="failed to insert usage entry for "service-names": delta will cause a negative count"

Cause

When multiple service instances are registered/deregistered at once, usage metrics of service names would be computed incorrectly, eventually leading to state store errors like the above. This could happen on DeleteNode RPCs, and also when Consul does a restore.

To fix, we collect all of the memdb changes and then do a pass through the updated memdb state to reconcile, instead of dealing with the memdb changes one at a time.

Unfortunately it might not error out immediately, if you have enough services that the count does not go negative. Instead, the usage metrics will be subtly wrong until another service deletion forces the count negative.

Testing

I have tested out reproduction cases involving both

curl -s -XPUT -H"X-Consul-Token: $CONSUL_HTTP_TOKEN" $CONSUL_HTTP_ADDR/v1/catalog/deregister -d '{"Node": "consul-dc1-client0"}'

and

$ consul snapshot save testing.snap
Saved and verified snapshot to index 161
$ consul snapshot restore testing.snap
Restored snapshot

while watching:

$ curl -s -H"X-Consul-Token: $CONSUL_HTTP_TOKEN" $CONSUL_HTTP_ADDR/v1/agent/metrics | jq '.Gauges[] | select(.Name | contains("state"))'

and this PR does fix the issues and correctly computes usage metrics.

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the Txn api is used to delete multiple instances of a service name from two nodes within the same transaction?

@crhino crhino force-pushed the b-usage-metrics-delete-node-fail branch from e0e72e6 to 1013a10 Compare January 12, 2021 17:54
There were a couple of instances were usage metrics would do the wrong
thing and result in incorrect counts, causing the count to attempt to
decrement below zero and return an error. The usage metrics did not
account for various places where a single transaction could
delete/update/add multiple service instances at once.

We also remove the error when attempting to decrement below zero, and
instead just make sure we do not accidentally underflow the unsigned
integer. This is a more graceful failure than returning an error and not
allowing a transaction to commit.
@crhino crhino force-pushed the b-usage-metrics-delete-node-fail branch from 1013a10 to a62e845 Compare January 12, 2021 20:37
@vercel
Copy link

vercel bot commented Jan 12, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/hashicorp/consul/b4yxpixg4
✅ Preview: Canceled

@crhino crhino changed the title Fix bug in usage metrics when DeleteNode is called Fix bug in usage metrics when multiple service instances are change in a single transaction Jan 12, 2021
@crhino crhino changed the title Fix bug in usage metrics when multiple service instances are change in a single transaction Fix bug in usage metrics when multiple service instances are changed in a single transaction Jan 12, 2021
@crhino crhino merged commit 0712e03 into master Jan 12, 2021
@crhino crhino deleted the b-usage-metrics-delete-node-fail branch January 12, 2021 21:31
@hashicorp-ci
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/309152.

@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit 0712e03 onto release/1.9.x succeeded!

hashicorp-ci pushed a commit that referenced this pull request Jan 12, 2021
…in a single transaction (#9440)

* Fix bug in usage metrics that caused a negative count to occur

There were a couple of instances were usage metrics would do the wrong
thing and result in incorrect counts, causing the count to attempt to
decrement below zero and return an error. The usage metrics did not
account for various places where a single transaction could
delete/update/add multiple service instances at once.

We also remove the error when attempting to decrement below zero, and
instead just make sure we do not accidentally underflow the unsigned
integer. This is a more graceful failure than returning an error and not
allowing a transaction to commit.

* Add changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants