Write buffer metrics #960

luisyonaldo · 2019-08-19T08:39:16Z

Related issues: #113
Related PRs: #676, #682

Description

This PR adds metrics related to the logic behind write buffer instances. The goal is to try to corroborate that what we assume is what is really happening and improve how we can tune config variables like InstanceWriteBufferSize and InstanceFlushIntervalMilliseconds.

contributed code is using same conventions as original code
code is formatted via gofmt (please avoid goimports)
code is built via ./build.sh
code is tested via go test ./go/...

While testing this PR and looking at the metrics some things came up.

The metrics showed orchestrator was constantly flushing instances even when InstanceWriteBufferSize was not reached and without waiting at least InstanceFlushIntervalMilliseconds

Here is a small sample of the metrics api/write-buffer-metrics-raw/60:

    {
        "Timestamp": "2019-08-16T17:11:58.394757488+02:00",
        "WaitLatency": 964,
        "FlushLatency": 1524,
        "EnqueuedInstances": 5
    },
    {
        "Timestamp": "2019-08-16T17:11:58.402999622+02:00",
        "WaitLatency": 1018,
        "FlushLatency": 1206,
        "EnqueuedInstances": 4
    },
    {
        "Timestamp": "2019-08-16T17:11:58.416573318+02:00",
        "WaitLatency": 1720,
        "FlushLatency": 1978,
        "EnqueuedInstances": 5
    },
...

Latency is showed in milliseconds and our config is InstanceFlushIntervalMilliseconds: 50 and InstanceWriteBufferSize: 1200 for our primary cluster.

Here we can see how we are flushing almost instantly (wait latency is almost non-existent) and with just a few instances in the buffer. Which is the opposite of what we expected: the buffer should be flushed if InstanceFlushIntervalMilliseconds has passed or if InstanceWriteBufferSize if reached.

After this, we added some logging and debugged dipper and we found the following:

There were many discoveries goroutines blocked/waiting to signal to flush

https://github.com/github/orchestrator/blob/master/go/inst/instance_dao.go#L2487

There are DiscoveryMaxConcurrency discovery goroutines filling the buffer and as soon as the buffer reached the max size, the goroutines try to send a message to forceFlushInstanceWriteBuffer in order to signal the "flushing" goroutine that there's work (flush the buffer). But only one goroutine can send the message and the other goroutines are blocked (non-buffered channel) and waiting to be able to send the signal. As soon as the flushing goroutine finish then the next discovery goroutine is ready to send a message and so on. Which caused the weird behavior we saw on the metrics.
When the buffer reaches its size then the flushing goroutine gets called (which is good) but as soon as it finishes flushing it has many more discoveries goroutines waiting to signal to flush even when there is no need anymore (the buffer has just been flushed).

Also every time we call the "flushing" goroutine, it flushes half of the buffer only

This is due to how the buffer is being looped over:
https://github.com/github/orchestrator/blob/master/go/inst/instance_dao.go#L2501

for i := 0; i < len(instanceWriteBuffer); i++ {
    upd := <-instanceWriteBuffer
...
}

On every loop iteration len(instanceWriteBuffer) is called and one instance is flushed out of the buffer until we had flushed half of the buffer and i is bigger than the current length of the buffer, e.g.:

i=0, len(instanceWriteBuffer)=10
i=1, len(instanceWriteBuffer)=9
i=2, len(instanceWriteBuffer)=8
....
i=5, len(instanceWriteBuffer)=5
exit the loop,  i < len(instanceWriteBuffer) is not valid anymore

This is fixed by changing the for loop to:

for len(instanceWriteBuffer) > 0 {..}

and then on every flush, we stop only when there are no more instances in the flush. But this will cause a weird behavior also. Since many discovery goroutines can enqueue an instance into the buffer while we are flushing. Thus, we could end flushing more than the expected InstanceWriteBufferSize.
In our case there were times where it flushed around 2x 3x times InstanceWriteBufferSize.

Solutions:
This commit fixes both issues.

To avoid blocking a discovery goroutine on sending a signal to flush, when flush is already ongoing, we do a non-blocking send, so just the one goroutine will send the signal and the others will continue and not be blocked there. If forceFlushInstanceWriteBuffer channel is not ready to receive a signal is because the "flushing" goroutine is running at the moment (it was triggered already).

select {
    case forceFlushInstanceWriteBuffer <- true:
    default:
}

We flush the buffer untill a max of InstanceWriteBufferSize

for i := 0; i < config.Config.InstanceWriteBufferSize && len(instanceWriteBuffer) > 0; i++ {..}

Note: open to discussions about the solutions. If preferred then I can remove the solution for this PR and discuss the fixes on a different issue.

luisyonaldo · 2019-08-21T06:02:07Z

Here are the metrics after the fixes.

We can see how now the behavior match what we expect.

WaitLatency is on average around InstanceFlushIntervalMilliseconds: 50 except when we flush due to max size InstanceWriteBufferSize: 1200 on the buffer has been reached.

Time to time we see big increases in the WriteLatency that produces an unusual WaitLatency. We are inspecting the reason behind this.

luisyonaldo · 2019-09-25T15:12:52Z

We have been running this from more than a month in our systems. Any comment on this PR?

shlomi-noach · 2019-09-27T06:46:09Z

Sorry about the delay; I expect to look into & merge a bunch of community PRs during the first two weeks of October.

shlomi-noach · 2019-11-21T07:22:09Z

again, sorry for the delay. Hope to look into this soon.

shlomi-noach

One minor request to change.

Also, is it possible to avoid the new logic when InstanceWriteBufferSize == 0 or BufferInstanceWrites == false, i.e. in systems (such as most systems except your) where buffer writes are never enabled?

go/inst/instance_dao.go

… into write-buffer-metrics

…rator into write-buffer-metrics

luisyonaldo added 8 commits July 31, 2019 15:10

collect write buffer metrics

bfcc02c

add flush latency

0fc3350

add config to aggregated metrics

fc8f96e

add write latency metrics

53274a7

fix buffer flush and signal logic

dce8f19

Merge branch 'master' into write-buffer-metrics

590d7eb

fix stopwatch warnings

ddeaf5e

fix missing flush aggregated metrics

cb63634

improve instances stats and remove flush metrics

b615478

Merge branch 'master' into write-buffer-metrics

4a9af30

Shlomi Noach added 2 commits December 4, 2019 11:00

Merge branch 'master' into write-buffer-metrics

e9c515c

Merge branch 'master' into write-buffer-metrics

a04ce5b

shlomi-noach suggested changes Dec 4, 2019

View reviewed changes

go/inst/instance_dao.go Show resolved Hide resolved

go/inst/instance_dao.go Show resolved Hide resolved

validate InstanceWriteBufferSize

357fbb8

shlomi-noach approved these changes Jan 7, 2020

View reviewed changes

Merge branch 'master' into write-buffer-metrics

feb1b3c

shlomi-noach had a problem deploying to production/mysql_cluster=concertmaster January 7, 2020 10:54 Failure

shlomi-noach had a problem deploying to production/mysql_cluster=concertmaster January 7, 2020 10:55 Failure

shlomi-noach had a problem deploying to production/mysql_cluster=concertmaster January 7, 2020 10:59 Failure

shlomi-noach had a problem deploying to production/mysql_cluster=concertmaster January 7, 2020 11:30 Failure

shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster January 7, 2020 12:11 Inactive

gtowey temporarily deployed to production January 8, 2020 00:39 Inactive

Shlomi Noach added 4 commits January 8, 2020 09:24

Merge branch 'master' into write-buffer-metrics

c23b868

Auto-merged master into write-buffer-metrics on deployment

4b9e993

Merge branch 'write-buffer-metrics' of github.com:github/orchestrator…

3b2178b

… into write-buffer-metrics

Auto-merged master into write-buffer-metrics on deployment

bfbc6ee

Shlomi Noach added 2 commits January 8, 2020 09:42

Merge branch 'master' into write-buffer-metrics

19bfae4

Merge branch 'write-buffer-metrics' of github.com:luisyonaldo/orchest…

a18672a

…rator into write-buffer-metrics

shlomi-noach temporarily deployed to production/mysql_cluster=conductor January 8, 2020 07:49 Inactive

shlomi-noach merged commit c1dcb4a into openark:master Jan 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write buffer metrics #960

Write buffer metrics #960

luisyonaldo commented Aug 19, 2019 •

edited

Loading

luisyonaldo commented Aug 21, 2019

luisyonaldo commented Sep 25, 2019

shlomi-noach commented Sep 27, 2019

shlomi-noach commented Nov 21, 2019

shlomi-noach left a comment

Write buffer metrics #960

Write buffer metrics #960

Conversation

luisyonaldo commented Aug 19, 2019 • edited Loading

Description

luisyonaldo commented Aug 21, 2019

luisyonaldo commented Sep 25, 2019

shlomi-noach commented Sep 27, 2019

shlomi-noach commented Nov 21, 2019

shlomi-noach left a comment

Choose a reason for hiding this comment

luisyonaldo commented Aug 19, 2019 •

edited

Loading