Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Write buffer metrics #960

Merged
merged 20 commits into from
Jan 8, 2020
Merged

Conversation

luisyonaldo
Copy link
Contributor

@luisyonaldo luisyonaldo commented Aug 19, 2019

Related issues: #113
Related PRs: #676, #682

Description

This PR adds metrics related to the logic behind write buffer instances. The goal is to try to corroborate that what we assume is what is really happening and improve how we can tune config variables like InstanceWriteBufferSize and InstanceFlushIntervalMilliseconds.

  • contributed code is using same conventions as original code
  • code is formatted via gofmt (please avoid goimports)
  • code is built via ./build.sh
  • code is tested via go test ./go/...

While testing this PR and looking at the metrics some things came up.

  • The metrics showed orchestrator was constantly flushing instances even when InstanceWriteBufferSize was not reached and without waiting at least InstanceFlushIntervalMilliseconds

Here is a small sample of the metrics api/write-buffer-metrics-raw/60:

    {
        "Timestamp": "2019-08-16T17:11:58.394757488+02:00",
        "WaitLatency": 964,
        "FlushLatency": 1524,
        "EnqueuedInstances": 5
    },
    {
        "Timestamp": "2019-08-16T17:11:58.402999622+02:00",
        "WaitLatency": 1018,
        "FlushLatency": 1206,
        "EnqueuedInstances": 4
    },
    {
        "Timestamp": "2019-08-16T17:11:58.416573318+02:00",
        "WaitLatency": 1720,
        "FlushLatency": 1978,
        "EnqueuedInstances": 5
    },
...

Latency is showed in milliseconds and our config is InstanceFlushIntervalMilliseconds: 50 and InstanceWriteBufferSize: 1200 for our primary cluster.

Here we can see how we are flushing almost instantly (wait latency is almost non-existent) and with just a few instances in the buffer. Which is the opposite of what we expected: the buffer should be flushed if InstanceFlushIntervalMilliseconds has passed or if InstanceWriteBufferSize if reached.

After this, we added some logging and debugged dipper and we found the following:

  • There were many discoveries goroutines blocked/waiting to signal to flush

https://github.com/github/orchestrator/blob/master/go/inst/instance_dao.go#L2487

There are DiscoveryMaxConcurrency discovery goroutines filling the buffer and as soon as the buffer reached the max size, the goroutines try to send a message to forceFlushInstanceWriteBuffer in order to signal the "flushing" goroutine that there's work (flush the buffer). But only one goroutine can send the message and the other goroutines are blocked (non-buffered channel) and waiting to be able to send the signal. As soon as the flushing goroutine finish then the next discovery goroutine is ready to send a message and so on. Which caused the weird behavior we saw on the metrics.
When the buffer reaches its size then the flushing goroutine gets called (which is good) but as soon as it finishes flushing it has many more discoveries goroutines waiting to signal to flush even when there is no need anymore (the buffer has just been flushed).

  • Also every time we call the "flushing" goroutine, it flushes half of the buffer only

This is due to how the buffer is being looped over:
https://github.com/github/orchestrator/blob/master/go/inst/instance_dao.go#L2501

for i := 0; i < len(instanceWriteBuffer); i++ {
    upd := <-instanceWriteBuffer
...
}

On every loop iteration len(instanceWriteBuffer) is called and one instance is flushed out of the buffer until we had flushed half of the buffer and i is bigger than the current length of the buffer, e.g.:

i=0, len(instanceWriteBuffer)=10
i=1, len(instanceWriteBuffer)=9
i=2, len(instanceWriteBuffer)=8
....
i=5, len(instanceWriteBuffer)=5
exit the loop,  i < len(instanceWriteBuffer) is not valid anymore

This is fixed by changing the for loop to:

for len(instanceWriteBuffer) > 0 {..}

and then on every flush, we stop only when there are no more instances in the flush. But this will cause a weird behavior also. Since many discovery goroutines can enqueue an instance into the buffer while we are flushing. Thus, we could end flushing more than the expected InstanceWriteBufferSize.
In our case there were times where it flushed around 2x 3x times InstanceWriteBufferSize.

Solutions:
This commit fixes both issues.

  • To avoid blocking a discovery goroutine on sending a signal to flush, when flush is already ongoing, we do a non-blocking send, so just the one goroutine will send the signal and the others will continue and not be blocked there. If forceFlushInstanceWriteBuffer channel is not ready to receive a signal is because the "flushing" goroutine is running at the moment (it was triggered already).
select {
    case forceFlushInstanceWriteBuffer <- true:
    default:
}
  • We flush the buffer untill a max of InstanceWriteBufferSize
for i := 0; i < config.Config.InstanceWriteBufferSize && len(instanceWriteBuffer) > 0; i++ {..}

Note: open to discussions about the solutions. If preferred then I can remove the solution for this PR and discuss the fixes on a different issue.

@luisyonaldo
Copy link
Contributor Author

Here are the metrics after the fixes.

We can see how now the behavior match what we expect.

WaitLatency is on average around InstanceFlushIntervalMilliseconds: 50 except when we flush due to max size InstanceWriteBufferSize: 1200 on the buffer has been reached.

Time to time we see big increases in the WriteLatency that produces an unusual WaitLatency. We are inspecting the reason behind this.

@luisyonaldo
Copy link
Contributor Author

We have been running this from more than a month in our systems. Any comment on this PR?

@shlomi-noach
Copy link
Collaborator

Sorry about the delay; I expect to look into & merge a bunch of community PRs during the first two weeks of October.

@shlomi-noach
Copy link
Collaborator

again, sorry for the delay. Hope to look into this soon.

Copy link
Collaborator

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor request to change.

Also, is it possible to avoid the new logic when InstanceWriteBufferSize == 0 or BufferInstanceWrites == false, i.e. in systems (such as most systems except your) where buffer writes are never enabled?

go/inst/instance_dao.go Show resolved Hide resolved
go/inst/instance_dao.go Show resolved Hide resolved
@shlomi-noach shlomi-noach had a problem deploying to production/mysql_cluster=concertmaster January 7, 2020 10:54 Failure
@shlomi-noach shlomi-noach had a problem deploying to production/mysql_cluster=concertmaster January 7, 2020 10:55 Failure
@shlomi-noach shlomi-noach had a problem deploying to production/mysql_cluster=concertmaster January 7, 2020 10:59 Failure
@shlomi-noach shlomi-noach had a problem deploying to production/mysql_cluster=concertmaster January 7, 2020 11:30 Failure
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster January 7, 2020 12:11 Inactive
@gtowey gtowey temporarily deployed to production January 8, 2020 00:39 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor January 8, 2020 07:49 Inactive
@shlomi-noach shlomi-noach merged commit c1dcb4a into openark:master Jan 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants