Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache memory limit for syncer of drainer #715

Closed
wants to merge 12 commits into from

Conversation

lichunzhu
Copy link
Contributor

@lichunzhu lichunzhu commented Aug 15, 2019

What problem does this PR solve?

TOOL-1480
The cached binlogs count can't explicitly reprent the memory usage of drainer. To compute the memory usage preciselier and avoid one single binlog item consuming huge memory which may cause OOM, we need to measure every single binlog item's memory usage and limit the sum of them.

What is changed and how it works?

  • Add a new variable maxBinlogCacheSize to limit the maximum cached binlog size. This variable can be modified by users. The default size is 4GB.
  • Each binlog's memory usage is probably counted as the length of binlog.PrewriteKey, PrewriteValue etc. This index is computed by binlogItem.Size which is a newly added function.
  • Use sync.Cond to block when memory usage is too big. For binlogItemCache.Push, if adding a new binlogItem will cause exceeding memory usage, it will use cond.Wait() to block itself. Otherwise, it will add cachedSize and cache this binlogItem in binlogItemCache.cacheChan channel. In binlogItem.Pop, after picking a binlogItem, it will minus cachedSize and use cond.Notify to notify the function Push to check whether it's OK to cache new binlogItem.
  • A UT TestSyncerCachedSize is added into the syncer_test.go. In this test we will try to add 10 binlogItems and for each binlogItem the size is 1. The maxBinlogCacheSize is temporarily set to 1 and a go routine will keep checking whether the cachedSize exceed maxBinlogCacheSize or not in every 0.05s.
  • NOTICE: For some special binlogItem whose memory usage is bigger than maximum cached size, we will wait the syncer.input chan to dump all the binlogItem. And then cache this to avoid it occupy the memory for a long time.
  • NOTICE: After syncer stops receiving binlogItem from syncer.input if collector is keeping use syncer.Add to add binlogItem into the syncer.input channel it may get blocked when cachedSize exceed maxBinlogCacheSize. So I add variable quiting to signify that syncer is quiting. If sync.quiting == true, syncer.Add will never run cond.Wait to avoid being blocked.

Check List

Tests

  • Unit test
  • Integration test

Code changes

Side effects

  • Possible performance regression
  • Increased code complexity

Related changes

  • Need to update the documentation

@lichunzhu
Copy link
Contributor Author

/run-all-tests

@lichunzhu
Copy link
Contributor Author

/run-all-tests

@lichunzhu
Copy link
Contributor Author

@july2993 PTAL

Copy link
Contributor

@july2993 july2993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find a more simple implementation, don't go for every Add and Pop op

@lichunzhu lichunzhu force-pushed the czli/addCacheMemoryLimit branch from 38471d8 to df57764 Compare August 19, 2019 03:01
@lichunzhu
Copy link
Contributor Author

@july2993 All go functions have been deleted.

@suzaku
Copy link
Contributor

suzaku commented Aug 20, 2019

In what case is a buffer size as large as 65536 useful for the syncer? Maybe we can simply make it much shorter by default, for example, 50 or 100. So that even when the downstream database is severely out of sync with the upstream, it's less likely to cause OOM.

I think OOM is unavoidable, for example, the machine running drainer may have only 4 GB of memory, but the average size of a sequence of binlogs is more than 0.4 GB, then even with this new configuration, there wouldn't be enough memory because there's a buffered channel with size 10 for each pump instance.

@july2993
Copy link
Contributor

  • In what case is a buffer size as large as 65536 useful for the syncer? Maybe we can simply make it much shorter by default, for example, 50 or 100. So that even when the downstream database is severely out of sync with the upstream, it's less likely to cause OOM.

    • I agree, but the default value should be best larger than worker-count*batch-size
  • I think OOM is unavoidable, for example, the machine running drainer may have only 4 GB of memory, but the average size of a sequence of binlogs is more than 0.4 GB, then even with this new configuration, there wouldn't be enough memory because there's a buffered channel with size 10 for each pump instance.

    • we can change default 4G to be some portion of the total memory of the machine.
    • another problem is how to control the limit cache memory binlog as accurate as possible.

@lichunzhu
Copy link
Contributor Author

lichunzhu commented Aug 20, 2019

  • In what case is a buffer size as large as 65536 useful for the syncer? Maybe we can simply make it much shorter by default, for example, 50 or 100. So that even when the downstream database is severely out of sync with the upstream, it's less likely to cause OOM.
    • How about setting the default buffer size to worker-count*batch-size, but user can still reset this variable in config?
  • I think OOM is unavoidable, for example, the machine running drainer may have only 4 GB of memory, but the average size of a sequence of binlogs is more than 0.4 GB, then even with this new configuration, there wouldn't be enough memory because there's a buffered channel with size 10 for each pump instance.
    • It seems not easy for golang to count how much memory has a variable occupied. Current statistics is smaller than the real.

@july2993
Copy link
Contributor

@lichunzhu you can fire another pr to change the default buffer size first.

@suzaku
Copy link
Contributor

suzaku commented Aug 20, 2019

Note that the unit of worker-count * batch-size is DML or DDL, which means 1 binlog in the buffer may correspond to multiple units? So there doesn't exist a one to one relationship between the two?

@lichunzhu
Copy link
Contributor Author

@suzaku IMO, worker-count * batch-size is the maximun cached DMLs' size in loader. If loader received a DDL the cached DMLs will be all executed and the cache will be cleared. So I think it's OK to use worker-count * batch-size.

@lichunzhu
Copy link
Contributor Author

Current cfg.MaxCacheBinlogSize is an int64 variable which is not easy for customers to modify. If the other parts are OK I will change this to HumanizeBytes which is used in pump/storage.go:StopWriteAtAvailableSpace. The customer can use 4 GB to represent the memory usage limit they want.

@lichunzhu
Copy link
Contributor Author

@suzaku PTAL

@lichunzhu lichunzhu requested review from suzaku and zier-one and removed request for suzaku and zier-one August 26, 2019 08:19
@suzaku
Copy link
Contributor

suzaku commented Aug 27, 2019

I think the problem to introduce this new configuration is that it can't effectively solve the OOM problem.

The "cache" is one of the largest buffer with the default configuration, but it's by no means the only data structure that may eat up a lot of memory.

It's "obviously" the bottleneck because it's 65536 by default, with a big worker-count and batch-size the bottleneck would be elsewhere.

This limitation add complexity to both our code and the user interface and can't actually solve the problem, so I suggest we don't add this limitation that only apply the one single data structure.

@lichunzhu
Copy link
Contributor Author

lichunzhu commented Aug 27, 2019

I think the problem to introduce this new configuration is that it can't effectively solve the OOM problem.

The "cache" is one of the largest buffer with the default configuration, but it's by no means the only data structure that may eat up a lot of memory.

It's "obviously" the bottleneck because it's 65536 by default, with a big worker-count and batch-size the bottleneck would be elsewhere.

This limitation add complexity to both our code and the user interface and can't actually solve the problem, so I suggest we don't add this limitation that only apply the one single data structure.

@suzaku If a client's binlogs are all extremely huge, I think this kind of limitations can take effects. The number of cached binlogs can't explicitly reprent the memory usage of syncer.

@suzaku
Copy link
Contributor

suzaku commented Aug 27, 2019

But if the binlogs are all extremely large, then other data structures would still eat up the memory and cause OOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants