-
Notifications
You must be signed in to change notification settings - Fork 936
Write buffer metrics #960
Write buffer metrics #960
Conversation
Here are the metrics after the fixes. We can see how now the behavior match what we expect.
Time to time we see big increases in the |
We have been running this from more than a month in our systems. Any comment on this PR? |
Sorry about the delay; I expect to look into & merge a bunch of community PRs during the first two weeks of October. |
again, sorry for the delay. Hope to look into this soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor request to change.
Also, is it possible to avoid the new logic when InstanceWriteBufferSize == 0
or BufferInstanceWrites == false
, i.e. in systems (such as most systems except your) where buffer writes are never enabled?
…rator into write-buffer-metrics
Related issues: #113
Related PRs: #676, #682
Description
This PR adds metrics related to the logic behind write buffer instances. The goal is to try to corroborate that what we assume is what is really happening and improve how we can tune config variables like
InstanceWriteBufferSize
andInstanceFlushIntervalMilliseconds
.gofmt
(please avoidgoimports
)./build.sh
go test ./go/...
While testing this PR and looking at the metrics some things came up.
InstanceWriteBufferSize
was not reached and without waiting at leastInstanceFlushIntervalMilliseconds
Here is a small sample of the metrics
api/write-buffer-metrics-raw/60
:Latency is showed in milliseconds and our config is
InstanceFlushIntervalMilliseconds: 50
andInstanceWriteBufferSize: 1200
for our primary cluster.Here we can see how we are flushing almost instantly (wait latency is almost non-existent) and with just a few instances in the buffer. Which is the opposite of what we expected: the buffer should be flushed if
InstanceFlushIntervalMilliseconds
has passed or ifInstanceWriteBufferSize
if reached.After this, we added some logging and debugged dipper and we found the following:
https://github.com/github/orchestrator/blob/master/go/inst/instance_dao.go#L2487
There are
DiscoveryMaxConcurrency
discovery goroutines filling the buffer and as soon as the buffer reached the max size, the goroutines try to send a message toforceFlushInstanceWriteBuffer
in order to signal the "flushing" goroutine that there's work (flush the buffer). But only one goroutine can send the message and the other goroutines are blocked (non-buffered channel) and waiting to be able to send the signal. As soon as the flushing goroutine finish then the next discovery goroutine is ready to send a message and so on. Which caused the weird behavior we saw on the metrics.When the buffer reaches its size then the flushing goroutine gets called (which is good) but as soon as it finishes flushing it has many more discoveries goroutines waiting to signal to flush even when there is no need anymore (the buffer has just been flushed).
This is due to how the buffer is being looped over:
https://github.com/github/orchestrator/blob/master/go/inst/instance_dao.go#L2501
On every loop iteration
len(instanceWriteBuffer)
is called and one instance is flushed out of the buffer until we had flushed half of the buffer andi
is bigger than the current length of the buffer, e.g.:This is fixed by changing the for loop to:
and then on every flush, we stop only when there are no more instances in the flush. But this will cause a weird behavior also. Since many discovery goroutines can enqueue an instance into the buffer while we are flushing. Thus, we could end flushing more than the expected
InstanceWriteBufferSize
.In our case there were times where it flushed around 2x 3x times
InstanceWriteBufferSize
.Solutions:
This commit fixes both issues.
forceFlushInstanceWriteBuffer
channel is not ready to receive a signal is because the "flushing" goroutine is running at the moment (it was triggered already).InstanceWriteBufferSize
Note: open to discussions about the solutions. If preferred then I can remove the solution for this PR and discuss the fixes on a different issue.