Add cache memory limit for syncer of drainer #715

lichunzhu · 2019-08-15T10:24:00Z

What problem does this PR solve?

TOOL-1480
The cached binlogs count can't explicitly reprent the memory usage of drainer. To compute the memory usage preciselier and avoid one single binlog item consuming huge memory which may cause OOM, we need to measure every single binlog item's memory usage and limit the sum of them.

What is changed and how it works?

Add a new variable maxBinlogCacheSize to limit the maximum cached binlog size. This variable can be modified by users. The default size is 4GB.
Each binlog's memory usage is probably counted as the length of binlog.PrewriteKey, PrewriteValue etc. This index is computed by binlogItem.Size which is a newly added function.
Use sync.Cond to block when memory usage is too big. For binlogItemCache.Push, if adding a new binlogItem will cause exceeding memory usage, it will use cond.Wait() to block itself. Otherwise, it will add cachedSize and cache this binlogItem in binlogItemCache.cacheChan channel. In binlogItem.Pop, after picking a binlogItem, it will minus cachedSize and use cond.Notify to notify the function Push to check whether it's OK to cache new binlogItem.
A UT TestSyncerCachedSize is added into the syncer_test.go. In this test we will try to add 10 binlogItems and for each binlogItem the size is 1. The maxBinlogCacheSize is temporarily set to 1 and a go routine will keep checking whether the cachedSize exceed maxBinlogCacheSize or not in every 0.05s.
NOTICE: For some special binlogItem whose memory usage is bigger than maximum cached size, we will wait the syncer.input chan to dump all the binlogItem. And then cache this to avoid it occupy the memory for a long time.
NOTICE: After syncer stops receiving binlogItem from syncer.input if collector is keeping use syncer.Add to add binlogItem into the syncer.input channel it may get blocked when cachedSize exceed maxBinlogCacheSize. So I add variable quiting to signify that syncer is quiting. If sync.quiting == true, syncer.Add will never run cond.Wait to avoid being blocked.

Check List

Tests

Unit test
Integration test

Code changes

Side effects

Possible performance regression
Increased code complexity

Related changes

Need to update the documentation

lichunzhu · 2019-08-15T10:30:48Z

/run-all-tests

lichunzhu · 2019-08-16T12:41:20Z

/run-all-tests

lichunzhu · 2019-08-16T14:02:37Z

@july2993 PTAL

july2993

find a more simple implementation, don't go for every Add and Pop op

lichunzhu · 2019-08-19T03:25:34Z

@july2993 All go functions have been deleted.

suzaku · 2019-08-20T07:05:24Z

In what case is a buffer size as large as 65536 useful for the syncer? Maybe we can simply make it much shorter by default, for example, 50 or 100. So that even when the downstream database is severely out of sync with the upstream, it's less likely to cause OOM.

I think OOM is unavoidable, for example, the machine running drainer may have only 4 GB of memory, but the average size of a sequence of binlogs is more than 0.4 GB, then even with this new configuration, there wouldn't be enough memory because there's a buffered channel with size 10 for each pump instance.

july2993 · 2019-08-20T07:13:44Z

In what case is a buffer size as large as 65536 useful for the syncer? Maybe we can simply make it much shorter by default, for example, 50 or 100. So that even when the downstream database is severely out of sync with the upstream, it's less likely to cause OOM.
- I agree, but the default value should be best larger than worker-count*batch-size
I think OOM is unavoidable, for example, the machine running drainer may have only 4 GB of memory, but the average size of a sequence of binlogs is more than 0.4 GB, then even with this new configuration, there wouldn't be enough memory because there's a buffered channel with size 10 for each pump instance.
- we can change default 4G to be some portion of the total memory of the machine.
- another problem is how to control the limit cache memory binlog as accurate as possible.

lichunzhu · 2019-08-20T07:34:30Z

In what case is a buffer size as large as 65536 useful for the syncer? Maybe we can simply make it much shorter by default, for example, 50 or 100. So that even when the downstream database is severely out of sync with the upstream, it's less likely to cause OOM.
- How about setting the default buffer size to worker-count*batch-size, but user can still reset this variable in config?
I think OOM is unavoidable, for example, the machine running drainer may have only 4 GB of memory, but the average size of a sequence of binlogs is more than 0.4 GB, then even with this new configuration, there wouldn't be enough memory because there's a buffered channel with size 10 for each pump instance.
- It seems not easy for golang to count how much memory has a variable occupied. Current statistics is smaller than the real.

july2993 · 2019-08-20T08:26:05Z

@lichunzhu you can fire another pr to change the default buffer size first.

suzaku · 2019-08-20T09:07:08Z

Note that the unit of worker-count * batch-size is DML or DDL, which means 1 binlog in the buffer may correspond to multiple units? So there doesn't exist a one to one relationship between the two?

lichunzhu · 2019-08-20T10:44:41Z

@suzaku IMO, worker-count * batch-size is the maximun cached DMLs' size in loader. If loader received a DDL the cached DMLs will be all executed and the cache will be cleared. So I think it's OK to use worker-count * batch-size.

lichunzhu · 2019-08-23T06:40:24Z

Current cfg.MaxCacheBinlogSize is an int64 variable which is not easy for customers to modify. If the other parts are OK I will change this to HumanizeBytes which is used in pump/storage.go:StopWriteAtAvailableSpace. The customer can use 4 GB to represent the memory usage limit they want.

lichunzhu · 2019-08-23T07:10:50Z

@suzaku PTAL

suzaku · 2019-08-27T02:03:38Z

I think the problem to introduce this new configuration is that it can't effectively solve the OOM problem.

The "cache" is one of the largest buffer with the default configuration, but it's by no means the only data structure that may eat up a lot of memory.

It's "obviously" the bottleneck because it's 65536 by default, with a big worker-count and batch-size the bottleneck would be elsewhere.

This limitation add complexity to both our code and the user interface and can't actually solve the problem, so I suggest we don't add this limitation that only apply the one single data structure.

lichunzhu · 2019-08-27T02:30:47Z

I think the problem to introduce this new configuration is that it can't effectively solve the OOM problem.

The "cache" is one of the largest buffer with the default configuration, but it's by no means the only data structure that may eat up a lot of memory.

It's "obviously" the bottleneck because it's 65536 by default, with a big worker-count and batch-size the bottleneck would be elsewhere.

This limitation add complexity to both our code and the user interface and can't actually solve the problem, so I suggest we don't add this limitation that only apply the one single data structure.

@suzaku If a client's binlogs are all extremely huge, I think this kind of limitations can take effects. The number of cached binlogs can't explicitly reprent the memory usage of syncer.

suzaku · 2019-08-27T03:46:38Z

But if the binlogs are all extremely large, then other data structures would still eat up the memory and cause OOM.

lichunzhu added 4 commits August 13, 2019 20:52

add cache memory limit in drainer

14a99f1

add test for cache size

f3c6306

merge master branch

04cee6f

add quiting to avoid bing blocked

4711bff

lichunzhu added the status/WIP label Aug 15, 2019

lichunzhu added status/PTAL and removed status/WIP labels Aug 15, 2019

july2993 added status/WIP and removed status/PTAL labels Aug 16, 2019

lichunzhu added 5 commits August 16, 2019 15:26

add struct binlogItemCache to deal with the cache problem

188483d

unexport newBinlogItemCache

0d2ee8e

delete redundant select in binlogItemCache.pop

050a9e8

add fake binlog

45c46c9

fix bug that may lost binlog, refind code

b0678f3

lichunzhu added status/PTAL and removed status/WIP labels Aug 16, 2019

july2993 reviewed Aug 19, 2019

View reviewed changes

delete go func in binlogItemCache

df57764

lichunzhu force-pushed the czli/addCacheMemoryLimit branch from 38471d8 to df57764 Compare August 19, 2019 03:01

july2993 added the needs-cherry-pick-release-2.1 label Aug 20, 2019

july2993 requested review from suzaku and zier-one August 20, 2019 02:14

lichunzhu mentioned this pull request Aug 20, 2019

Change default cache count from 65536 to 512 #721

Merged

lichunzhu added status/WIP and removed status/PTAL labels Aug 23, 2019

merge master

eedcb43

Merge branch 'master' into czli/addCacheMemoryLimit

75444e1

lichunzhu added status/PTAL and removed status/WIP labels Aug 23, 2019

lichunzhu requested review from suzaku and zier-one and removed request for suzaku and zier-one August 26, 2019 08:19

lichunzhu added status/DNM and removed status/PTAL labels Aug 27, 2019

lichunzhu closed this Sep 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cache memory limit for syncer of drainer #715

Add cache memory limit for syncer of drainer #715

lichunzhu commented Aug 15, 2019 •

edited

Loading

lichunzhu commented Aug 15, 2019

lichunzhu commented Aug 16, 2019

lichunzhu commented Aug 16, 2019

july2993 left a comment

lichunzhu commented Aug 19, 2019

suzaku commented Aug 20, 2019

july2993 commented Aug 20, 2019

lichunzhu commented Aug 20, 2019 •

edited

Loading

july2993 commented Aug 20, 2019

suzaku commented Aug 20, 2019

lichunzhu commented Aug 20, 2019

lichunzhu commented Aug 23, 2019

lichunzhu commented Aug 23, 2019

suzaku commented Aug 27, 2019

lichunzhu commented Aug 27, 2019 •

edited

Loading

suzaku commented Aug 27, 2019

Add cache memory limit for syncer of drainer #715

Add cache memory limit for syncer of drainer #715

Conversation

lichunzhu commented Aug 15, 2019 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

lichunzhu commented Aug 15, 2019

lichunzhu commented Aug 16, 2019

lichunzhu commented Aug 16, 2019

july2993 left a comment

Choose a reason for hiding this comment

lichunzhu commented Aug 19, 2019

suzaku commented Aug 20, 2019

july2993 commented Aug 20, 2019

lichunzhu commented Aug 20, 2019 • edited Loading

july2993 commented Aug 20, 2019

suzaku commented Aug 20, 2019

lichunzhu commented Aug 20, 2019

lichunzhu commented Aug 23, 2019

lichunzhu commented Aug 23, 2019

suzaku commented Aug 27, 2019

lichunzhu commented Aug 27, 2019 • edited Loading

suzaku commented Aug 27, 2019

lichunzhu commented Aug 15, 2019 •

edited

Loading

lichunzhu commented Aug 20, 2019 •

edited

Loading

lichunzhu commented Aug 27, 2019 •

edited

Loading