Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default cache count from 65536 to 512 #721

Merged
merged 6 commits into from
Aug 23, 2019

Conversation

lichunzhu
Copy link
Contributor

@lichunzhu lichunzhu commented Aug 20, 2019

What problem does this PR solve?

A son PR of #715.
Current default cache count is too big which can easily cause OOM of drainer.

What is changed and how it works?

The default size is changed from 65536 to 512.

Check List

Tests

  • Unit test
  • Integration test

Code changes

Side effects

Related changes

@lichunzhu
Copy link
Contributor Author

/build

Copy link
Contributor

@suzaku suzaku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, worker-count * batch-size is not suitable as a default value, for the following reasons:

  1. It may surprise the user. If a user finds that the downstream is quite far behind the upstream, he may try to make worker-count and batch-size larger and accidentally cause OOM.
  2. There's a unit mismatch here. The unit for the buffer of drainer is the whole binlog, while the unit for batch-size is individual mutation in binlog. One binlog may contain many row mutations.

What we have learned so far is that 65536 may be too large for quite a few users, so maybe just explicitly use a smaller default value would be a safer way to solve the problem.

@lichunzhu
Copy link
Contributor Author

lichunzhu commented Aug 22, 2019

@suzaku So what do you think of the right value? How about 512? I'm not so sure about this.
What's more, I think cache-binlog-count can avoid OOM in some degree, but a better way to avoid OOM is using cache-binlog-size of binlogItemCache since the count doesn't match the memory usage. Some binlog may very big which may cause OOM.

@suzaku
Copy link
Contributor

suzaku commented Aug 22, 2019

@lichunzhu Note that cache-binlog-size can not avoid OOM in all cases, because the buffer is not the only data structure consuming memory. For example, if the downstream is TiDB/MySQL, we'll be exeucting SQLs in bulk and the binlogs not executed yet are stored in a slice which might take up a lot of memory if worker-count and batch-size is set to large numbers.

What I am suggesting is that, since we are facing a complicated problem with no perfect solution right now, we'd better just use a simple solution that can make OOM less likely.

@lichunzhu
Copy link
Contributor Author

@suzaku I understand what you mean now. Then what do you think of the limit value? Is 512 OK?

@lichunzhu lichunzhu changed the title Change default cache count from 65536 to workcount*batchsize Change default cache count from 65536 to Aug 22, 2019
@july2993
Copy link
Contributor

@lichunzhu just change to 512 but don't add the config in SyncerConfig

@lichunzhu lichunzhu changed the title Change default cache count from 65536 to Change default cache count from 65536 to 512 Aug 23, 2019
@suzaku
Copy link
Contributor

suzaku commented Aug 23, 2019

512 or anything less than or equal to 1024 should be a safe choice, because in loader we also have a buffered channel which is hard coded to be 1024 in length.

@lichunzhu
Copy link
Contributor Author

@suzaku @july2993 I have modified defaultBinlogItemCount from 65536 to 512, PTAL.

Copy link
Contributor

@suzaku suzaku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@suzaku
Copy link
Contributor

suzaku commented Aug 23, 2019

@lichunzhu Just writing 512 may be more readable than 1 >> 9?

@lichunzhu
Copy link
Contributor Author

@lichunzhu Just writing 512 may be more readable than 1 >> 9?

Revised.

Copy link
Contributor

@july2993 july2993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@july2993 july2993 merged commit 423cef7 into pingcap:master Aug 23, 2019
@lichunzhu lichunzhu deleted the czli/changeCacheCount branch September 2, 2019 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants