rpc: measure blocking queries #7224

mkcp · 2020-02-05T19:10:06Z

We add an atomic counter to agent/consul.Server.queriesBlocking and inc/dec it in agent/consul s.blockingQuery() to track active blocking queries. Also added docs in telemetry.md.

I deviate from #6846 a bit in where I mark the point of query start, review question is here

Before merge

Remove REVIEW on agent/consul/rpc.goL546

Will squash before merge.

hanshasselberg

Left two comments regarding use of atomic.

agent/consul/rpc.go

mkcp · 2020-02-06T01:18:03Z

I'm considering about porting this over to https://github.com/uber-go/atomic as a relatively simple proof of concept to try the library in production. Its api covers the inc/dec case we use here directly, avoiding the ^uint64(0) decrement gotcha, and provides additional types and operations that I believe would make atomics safer and easier to use. For example, T.Dec on a numeric type automatically passes to T.Sub which handles the bitwise xor shenanigans for subtraction which bit me here, causing a panic in even this simple case.

Curious what folks think @hashicorp/consul-core

mkcp · 2020-02-06T17:09:40Z

Adding go.uber.org/atomic as a team practice is larger discussion. I'm going to defer using this PR as a proof of concept, because it's intended to be in the 1.7.0 release. Keeping it simple, will look at a different place for proof of concept.

hanshasselberg · 2020-02-06T19:56:59Z

The tests didn't run at all. You might want to have a look, seems related.

mkcp · 2020-02-06T20:30:50Z

It sure is broken -- big shoutout to CI and thank you for keeping on eye out! Working on a fix now

banks

Great job @mkcp !

This might seem like a lot of comments for a small diff but this is super close - they are mostly suggestions and a half about cleaning up the old metric while we're in here. The semantics question is the biggest one but thankfully simple to resolve either way you think best!

agent/consul/rpc.go

banks · 2020-02-07T10:23:11Z

agent/consul/rpc.go

-	// Run the query.
+	// Run query
+
+	// inc count of all queries run
 	metrics.IncrCounter([]string{"rpc", "query"}, 1)


Placement of this is slightly suspect IMO. I think your reasoning for doing it outside the "loop" is better. This would count spurious wakeups that never return to the client as a new query starting which is not what we document or likely what the operator expects (it's not what i expected either).

I'm on the fence about moving it - it's technically a breaking change in the sense that the semantics are changing and users will potentially see a drop in their queries/second afterwards and it's not a huge deal, but on the other hand we can note that in upgrade notes and I think it's a more meaningful metric if it's outside and actually measuring what we expected/documented.

I also think our current docs for this are misleading and that this metric alone is pretty useless right now which is why this issue/PR exists so moving it seems OK to me!

What do you think?

Yeah -- I think moving it outside the loop offers better semantics. I was hesitant to adjust it on the first go, not knowing if there were historical expectations I was missing.

For the purpose of 1.7, reworking it seems reasonable. I have half a mind to deprecate or remove the metric all together cause counting all RPCs does seem pointless. Maybe in future releases we can take it out? In the mean time, making it more accurate now seems like a practical step.

I think it's somewhat useful and would lean towards keeping it - server load due to blocking queries is a factor of two things: how many are inflight at once (the new metric we are adding) and how quickly things are churning/being transmitted back to clients (which is only captured by this old metric).

I.e. if you happen to know before this PR that you had 1000 clients all watching one thing and you saw the rate of blocking queries increase you could figure out that the underlying data is changing faster and so the server CPU and network bandwidth is going to be put under more pressure.

Now we have an actual measurement of how many are in flight too this could become even more useful - seeing this increase while in-flight is roughly equal means more churn, seeing this increase at ~ same rate as inflight just means more clients doing blocking queries.

agent/consul/server.go

website/source/docs/agent/telemetry.html.md

banks

Awesome! 🎉

…64 0

… interacting with goto labels

more precise comment on `Server.queriesBlocking` Co-Authored-By: Paul Banks <banks@banksco.de>

improve queries_blocking description Co-Authored-By: Paul Banks <banks@banksco.de>

…havior

mkcp added the theme/telemetry Anything related to telemetry or observability label Feb 5, 2020

mkcp added this to the 1.7.0 milestone Feb 5, 2020

mkcp self-assigned this Feb 5, 2020

mkcp requested review from a team and banks February 5, 2020 19:10

hanshasselberg reviewed Feb 5, 2020

View reviewed changes

agent/consul/rpc.go Outdated Show resolved Hide resolved

agent/consul/rpc.go Outdated Show resolved Hide resolved

banks requested changes Feb 7, 2020

View reviewed changes

banks approved these changes Feb 10, 2020

View reviewed changes

mkcp and others added 11 commits February 10, 2020 09:36

agent: measure blocking queries

f8ca2c4

agent.rpc: update docs to mention we only record blocking queries

beb5a38

agent.rpc: make go fmt happy

7713b0c

agent.rpc: fix non-atomic read and decrement with bitwise xor of uint…

77493d6

…64 0

agent.rpc: clarify review question

7c72794

agent.rpc: today I learned that one must declare all variables before…

a8f372b

… interacting with goto labels

Update agent/consul/server.go

2d2dc00

more precise comment on `Server.queriesBlocking` Co-Authored-By: Paul Banks <banks@banksco.de>

Update website/source/docs/agent/telemetry.html.md

5671baa

improve queries_blocking description Co-Authored-By: Paul Banks <banks@banksco.de>

fix some bugs found in review

2e22fe7

add a note about the updated counter behavior to telemetry.md

1a9c476

docs: add upgrade-specific note on consul.rpc.quer{y,ies_blocking} be…

bf1b605

…havior

mkcp force-pushed the telemetry/count-blocking-queries branch from 1f5283d to bf1b605 Compare February 10, 2020 17:36

mkcp merged commit 55f19a9 into hashicorp:master Feb 10, 2020

mkcp deleted the telemetry/count-blocking-queries branch February 10, 2020 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: measure blocking queries #7224

rpc: measure blocking queries #7224

mkcp commented Feb 5, 2020 •

edited

Loading

hanshasselberg left a comment

mkcp commented Feb 6, 2020 •

edited

Loading

mkcp commented Feb 6, 2020

hanshasselberg commented Feb 6, 2020

mkcp commented Feb 6, 2020

banks left a comment

banks Feb 7, 2020

mkcp Feb 7, 2020

banks Feb 10, 2020

banks left a comment

rpc: measure blocking queries #7224

rpc: measure blocking queries #7224

Conversation

mkcp commented Feb 5, 2020 • edited Loading

hanshasselberg left a comment

Choose a reason for hiding this comment

mkcp commented Feb 6, 2020 • edited Loading

mkcp commented Feb 6, 2020

hanshasselberg commented Feb 6, 2020

mkcp commented Feb 6, 2020

banks left a comment

Choose a reason for hiding this comment

banks Feb 7, 2020

Choose a reason for hiding this comment

mkcp Feb 7, 2020

Choose a reason for hiding this comment

banks Feb 10, 2020

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

mkcp commented Feb 5, 2020 •

edited

Loading

mkcp commented Feb 6, 2020 •

edited

Loading