Fix bug in usage metrics when multiple service instances are changed in a single transaction #9440

crhino · 2020-12-18T21:24:11Z

Symptom

A log line like so:

2020-12-18T20:39:41.784Z [WARN]  agent.fsm: DeleteNode failed: error="failed to insert usage entry for "service-names": delta will cause a negative count"

Cause

When multiple service instances are registered/deregistered at once, usage metrics of service names would be computed incorrectly, eventually leading to state store errors like the above. This could happen on DeleteNode RPCs, and also when Consul does a restore.

To fix, we collect all of the memdb changes and then do a pass through the updated memdb state to reconcile, instead of dealing with the memdb changes one at a time.

Unfortunately it might not error out immediately, if you have enough services that the count does not go negative. Instead, the usage metrics will be subtly wrong until another service deletion forces the count negative.

Testing

I have tested out reproduction cases involving both

curl -s -XPUT -H"X-Consul-Token: $CONSUL_HTTP_TOKEN" $CONSUL_HTTP_ADDR/v1/catalog/deregister -d '{"Node": "consul-dc1-client0"}'

and

$ consul snapshot save testing.snap
Saved and verified snapshot to index 161
$ consul snapshot restore testing.snap
Restored snapshot

while watching:

$ curl -s -H"X-Consul-Token: $CONSUL_HTTP_TOKEN" $CONSUL_HTTP_ADDR/v1/agent/metrics | jq '.Gauges[] | select(.Name | contains("state"))'

and this PR does fix the issues and correctly computes usage metrics.

agent/consul/state/usage.go

mkeeler

What happens if the Txn api is used to delete multiple instances of a service name from two nodes within the same transaction?

There were a couple of instances were usage metrics would do the wrong thing and result in incorrect counts, causing the count to attempt to decrement below zero and return an error. The usage metrics did not account for various places where a single transaction could delete/update/add multiple service instances at once. We also remove the error when attempting to decrement below zero, and instead just make sure we do not accidentally underflow the unsigned integer. This is a more graceful failure than returning an error and not allowing a transaction to commit.

vercel · 2021-01-12T20:37:43Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/hashicorp/consul/b4yxpixg4
✅ Preview: Canceled

hashicorp-ci · 2021-01-12T21:32:28Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/309152.

hashicorp-ci · 2021-01-12T21:32:33Z

🍒✅ Cherry pick of commit 0712e03 onto release/1.9.x succeeded!

…in a single transaction (#9440) * Fix bug in usage metrics that caused a negative count to occur There were a couple of instances were usage metrics would do the wrong thing and result in incorrect counts, causing the count to attempt to decrement below zero and return an error. The usage metrics did not account for various places where a single transaction could delete/update/add multiple service instances at once. We also remove the error when attempting to decrement below zero, and instead just make sure we do not accidentally underflow the unsigned integer. This is a more graceful failure than returning an error and not allowing a transaction to commit. * Add changelog

crhino requested a review from a team December 18, 2020 21:24

crhino added the backport/1.9 label Dec 18, 2020

crhino mentioned this pull request Dec 18, 2020

Agent.fsm : DeleteNode failed - Failed to insert usage entry for 'service-names' - Delta will cause a negative count #9433

Closed

dnephin reviewed Jan 11, 2021

View reviewed changes

agent/consul/state/usage.go Outdated Show resolved Hide resolved

mkeeler reviewed Jan 11, 2021

View reviewed changes

crhino force-pushed the b-usage-metrics-delete-node-fail branch from e0e72e6 to 1013a10 Compare January 12, 2021 17:54

crhino added 2 commits January 12, 2021 14:29

Add changelog

a62e845

crhino force-pushed the b-usage-metrics-delete-node-fail branch from 1013a10 to a62e845 Compare January 12, 2021 20:37

vercel bot temporarily deployed to Preview January 12, 2021 20:37 Inactive

crhino changed the title ~~Fix bug in usage metrics when DeleteNode is called~~ Fix bug in usage metrics when multiple service instances are change in a single transaction Jan 12, 2021

mkeeler approved these changes Jan 12, 2021

View reviewed changes

crhino changed the title ~~Fix bug in usage metrics when multiple service instances are change in a single transaction~~ Fix bug in usage metrics when multiple service instances are changed in a single transaction Jan 12, 2021

crhino merged commit 0712e03 into master Jan 12, 2021

crhino deleted the b-usage-metrics-delete-node-fail branch January 12, 2021 21:31

hynek mentioned this pull request Jan 17, 2021

Unable to deregister orphan checks and service instances after accessorid token is lost and set to anonymous #9577

Closed

hc-github-team-consul-core assigned dduzgun-security Jun 4, 2024

hc-github-team-consul-core requested a review from dduzgun-security June 4, 2024 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in usage metrics when multiple service instances are changed in a single transaction #9440

Fix bug in usage metrics when multiple service instances are changed in a single transaction #9440

crhino commented Dec 18, 2020 •

edited

Loading

mkeeler left a comment

vercel bot commented Jan 12, 2021 •

edited

Loading

hashicorp-ci commented Jan 12, 2021

hashicorp-ci commented Jan 12, 2021

Fix bug in usage metrics when multiple service instances are changed in a single transaction #9440

Fix bug in usage metrics when multiple service instances are changed in a single transaction #9440

Conversation

crhino commented Dec 18, 2020 • edited Loading

Symptom

Cause

Testing

mkeeler left a comment

Choose a reason for hiding this comment

vercel bot commented Jan 12, 2021 • edited Loading

hashicorp-ci commented Jan 12, 2021

hashicorp-ci commented Jan 12, 2021

crhino commented Dec 18, 2020 •

edited

Loading

vercel bot commented Jan 12, 2021 •

edited

Loading