-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: measure blocking queries #7224
rpc: measure blocking queries #7224
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left two comments regarding use of atomic.
I'm considering about porting this over to https://github.com/uber-go/atomic as a relatively simple proof of concept to try the library in production. Its api covers the inc/dec case we use here directly, avoiding the ^uint64(0) decrement gotcha, and provides additional types and operations that I believe would make atomics safer and easier to use. For example, Curious what folks think @hashicorp/consul-core |
Adding |
The tests didn't run at all. You might want to have a look, seems related. |
It sure is broken -- big shoutout to CI and thank you for keeping on eye out! Working on a fix now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @mkcp !
This might seem like a lot of comments for a small diff but this is super close - they are mostly suggestions and a half about cleaning up the old metric while we're in here. The semantics question is the biggest one but thankfully simple to resolve either way you think best!
agent/consul/rpc.go
Outdated
// Run the query. | ||
// Run query | ||
|
||
// inc count of all queries run | ||
metrics.IncrCounter([]string{"rpc", "query"}, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Placement of this is slightly suspect IMO. I think your reasoning for doing it outside the "loop" is better. This would count spurious wakeups that never return to the client as a new query starting which is not what we document or likely what the operator expects (it's not what i expected either).
I'm on the fence about moving it - it's technically a breaking change in the sense that the semantics are changing and users will potentially see a drop in their queries/second afterwards and it's not a huge deal, but on the other hand we can note that in upgrade notes and I think it's a more meaningful metric if it's outside and actually measuring what we expected/documented.
I also think our current docs for this are misleading and that this metric alone is pretty useless right now which is why this issue/PR exists so moving it seems OK to me!
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- I think moving it outside the loop offers better semantics. I was hesitant to adjust it on the first go, not knowing if there were historical expectations I was missing.
For the purpose of 1.7, reworking it seems reasonable. I have half a mind to deprecate or remove the metric all together cause counting all RPCs does seem pointless. Maybe in future releases we can take it out? In the mean time, making it more accurate now seems like a practical step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's somewhat useful and would lean towards keeping it - server load due to blocking queries is a factor of two things: how many are inflight at once (the new metric we are adding) and how quickly things are churning/being transmitted back to clients (which is only captured by this old metric).
I.e. if you happen to know before this PR that you had 1000 clients all watching one thing and you saw the rate of blocking queries increase you could figure out that the underlying data is changing faster and so the server CPU and network bandwidth is going to be put under more pressure.
Now we have an actual measurement of how many are in flight too this could become even more useful - seeing this increase while in-flight is roughly equal means more churn, seeing this increase at ~ same rate as inflight just means more clients doing blocking queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! 🎉
… interacting with goto labels
more precise comment on `Server.queriesBlocking` Co-Authored-By: Paul Banks <banks@banksco.de>
improve queries_blocking description Co-Authored-By: Paul Banks <banks@banksco.de>
1f5283d
to
bf1b605
Compare
Fixes #6846
We add an atomic counter to
agent/consul.Server.queriesBlocking
and inc/dec it inagent/consul s.blockingQuery()
to track active blocking queries. Also added docs in telemetry.md.I deviate from #6846 a bit in where I mark the point of query start, review question is here
Before merge
agent/consul/rpc.goL546
Will squash before merge.