Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: measure blocking queries #7224

Merged
merged 11 commits into from
Feb 10, 2020

Conversation

mkcp
Copy link
Contributor

@mkcp mkcp commented Feb 5, 2020

Fixes #6846

We add an atomic counter to agent/consul.Server.queriesBlocking and inc/dec it in agent/consul s.blockingQuery() to track active blocking queries. Also added docs in telemetry.md.

I deviate from #6846 a bit in where I mark the point of query start, review question is here

Before merge

  • Remove REVIEW on agent/consul/rpc.goL546

Will squash before merge.

@mkcp mkcp added the theme/telemetry Anything related to telemetry or observability label Feb 5, 2020
@mkcp mkcp added this to the 1.7.0 milestone Feb 5, 2020
@mkcp mkcp self-assigned this Feb 5, 2020
@mkcp mkcp requested review from a team and banks February 5, 2020 19:10
Copy link
Member

@hanshasselberg hanshasselberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two comments regarding use of atomic.

agent/consul/rpc.go Outdated Show resolved Hide resolved
agent/consul/rpc.go Outdated Show resolved Hide resolved
@mkcp
Copy link
Contributor Author

mkcp commented Feb 6, 2020

I'm considering about porting this over to https://github.com/uber-go/atomic as a relatively simple proof of concept to try the library in production. Its api covers the inc/dec case we use here directly, avoiding the ^uint64(0) decrement gotcha, and provides additional types and operations that I believe would make atomics safer and easier to use. For example, T.Dec on a numeric type automatically passes to T.Sub which handles the bitwise xor shenanigans for subtraction which bit me here, causing a panic in even this simple case.

Curious what folks think @hashicorp/consul-core

@mkcp
Copy link
Contributor Author

mkcp commented Feb 6, 2020

Adding go.uber.org/atomic as a team practice is larger discussion. I'm going to defer using this PR as a proof of concept, because it's intended to be in the 1.7.0 release. Keeping it simple, will look at a different place for proof of concept.

@hanshasselberg
Copy link
Member

The tests didn't run at all. You might want to have a look, seems related.

@mkcp
Copy link
Contributor Author

mkcp commented Feb 6, 2020

It sure is broken -- big shoutout to CI and thank you for keeping on eye out! Working on a fix now

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @mkcp !

This might seem like a lot of comments for a small diff but this is super close - they are mostly suggestions and a half about cleaning up the old metric while we're in here. The semantics question is the biggest one but thankfully simple to resolve either way you think best!

agent/consul/rpc.go Outdated Show resolved Hide resolved
agent/consul/rpc.go Outdated Show resolved Hide resolved
// Run the query.
// Run query

// inc count of all queries run
metrics.IncrCounter([]string{"rpc", "query"}, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placement of this is slightly suspect IMO. I think your reasoning for doing it outside the "loop" is better. This would count spurious wakeups that never return to the client as a new query starting which is not what we document or likely what the operator expects (it's not what i expected either).

I'm on the fence about moving it - it's technically a breaking change in the sense that the semantics are changing and users will potentially see a drop in their queries/second afterwards and it's not a huge deal, but on the other hand we can note that in upgrade notes and I think it's a more meaningful metric if it's outside and actually measuring what we expected/documented.

I also think our current docs for this are misleading and that this metric alone is pretty useless right now which is why this issue/PR exists so moving it seems OK to me!

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- I think moving it outside the loop offers better semantics. I was hesitant to adjust it on the first go, not knowing if there were historical expectations I was missing.

For the purpose of 1.7, reworking it seems reasonable. I have half a mind to deprecate or remove the metric all together cause counting all RPCs does seem pointless. Maybe in future releases we can take it out? In the mean time, making it more accurate now seems like a practical step.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's somewhat useful and would lean towards keeping it - server load due to blocking queries is a factor of two things: how many are inflight at once (the new metric we are adding) and how quickly things are churning/being transmitted back to clients (which is only captured by this old metric).

I.e. if you happen to know before this PR that you had 1000 clients all watching one thing and you saw the rate of blocking queries increase you could figure out that the underlying data is changing faster and so the server CPU and network bandwidth is going to be put under more pressure.

Now we have an actual measurement of how many are in flight too this could become even more useful - seeing this increase while in-flight is roughly equal means more churn, seeing this increase at ~ same rate as inflight just means more clients doing blocking queries.

agent/consul/server.go Outdated Show resolved Hide resolved
website/source/docs/agent/telemetry.html.md Show resolved Hide resolved
website/source/docs/agent/telemetry.html.md Outdated Show resolved Hide resolved
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! 🎉

@mkcp mkcp force-pushed the telemetry/count-blocking-queries branch from 1f5283d to bf1b605 Compare February 10, 2020 17:36
@mkcp mkcp merged commit 55f19a9 into hashicorp:master Feb 10, 2020
@mkcp mkcp deleted the telemetry/count-blocking-queries branch February 10, 2020 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/telemetry Anything related to telemetry or observability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metrics: Add a gauge for the number of active blocking queries
3 participants