Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drainer/: Change unreasonable buff size and reduce memory usage #735

Merged
merged 11 commits into from
Sep 9, 2019

Conversation

july2993
Copy link
Contributor

@july2993 july2993 commented Aug 30, 2019

What problem does this PR solve?

the maxBinlogItemCount var will affect the job buffer size in an unreasonable way.

In syncer.sync will work like this

// where will be worker-count goroutine to run this
for {
    get job from "jobChan chan *job"
    if reach some condition {
          execute the pending job(execute sql)
     } else {
         continue
     }
}

note execute the sql may take times, and not consume the job chan.
so the upstream may block at add job into the chan, and also will not add job the other sync job chan.

What is changed and how it works?

avoid any buffer at upstream to much mem and not decrease performance, and don't make buffer size of job chan depent on maxBinlogItemCount

Screen Shot 2019-08-30 at 2 43 43 PM

the flowing v1 means f9e4589
v2 means one more commit 3074b3b

before 10:53 version: v2.1.16 maxBinlogItemCount=16
start at 10:53 version: v1 maxBinlogItemCount=16
start at 30 11:17 version:v2.1.16 maxBinlogItemCount=64k (the previous default value)
start at 30 11:47 version: v1 maxBinlogItemCount=0

all using worker-count = 256, txn-batch = 20
note when the drainer restarts, the first 5 minutes will use safe-mode, the event metrics count the changed events.

Another test:

the upstream data one binlog near 60M

start at 19:25: version: v1
start at 19:45: version: v2
Screen Shot 2019-08-30 at 7 56 23 PM
Screen Shot 2019-08-30 at 7 56 14 PM

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

@july2993
Copy link
Contributor Author

/run-all-tests

@july2993
Copy link
Contributor Author

/run-all-tests

@july2993 july2993 changed the title drainer/: Change unreasonable buff size drainer/: Change unreasonable buff size and reduce memory usage Aug 30, 2019
@july2993
Copy link
Contributor Author

/run-all-tests

1 similar comment
@ericsyh
Copy link
Contributor

ericsyh commented Aug 30, 2019

/run-all-tests

@IANTHEREAL
Copy link
Collaborator

lgtm

@WangXiangUSTC
Copy link
Contributor

/run-all-tests

@zier-one
Copy link
Contributor

zier-one commented Aug 30, 2019

why ci was failed?

@july2993
Copy link
Contributor Author

why ci was failed?

i don't know, failed at status test, but not related to this pr.
(run ok locally and failed ci at another pr too-2.1)

@zier-one
Copy link
Contributor

LGTM

zier-one
zier-one previously approved these changes Aug 30, 2019
@lichunzhu
Copy link
Contributor

/run-integration-test

@lichunzhu
Copy link
Contributor

lichunzhu commented Sep 3, 2019

IMO, CI fails because drainer hasn't consumed the pump 8250's storaged binlogs in 15 seconds when closing pump. This problem can be reproduced on my server.

@july2993
Copy link
Contributor Author

july2993 commented Sep 3, 2019

IMO, CI fails because drainer hasn't consumed the pump 8250's storaged binlogs in 15 seconds when closing pump. This problem can be reproduced on my server.

Is it something relate to this pr change? can you fix this?

@lichunzhu
Copy link
Contributor

IMO, CI fails because drainer hasn't consumed the pump 8250's storaged binlogs in 15 seconds when closing pump. This problem can be reproduced on my server.

Is it something relate to this pr change? can you fix this?

I'm not so sure now. I'm working on that now and later I will get a conclusion.

@july2993
Copy link
Contributor Author

july2993 commented Sep 3, 2019

/run-all-tests

@lichunzhu
Copy link
Contributor

/run-all-tests

@july2993
Copy link
Contributor Author

july2993 commented Sep 5, 2019

/run-all-tests

@lichunzhu
Copy link
Contributor

There is still 4s gap between drainer and pump after 15s. I guess when drainer has cache it can cache some binlogs from pump then drainer can still sort and execute the cached binlogs even though this pump is paused. Maybe binlogChanSize shouldn't be set to 0. Or we can set check time in check_status more bigger.

@july2993
Copy link
Contributor Author

july2993 commented Sep 8, 2019

/run-all-tests

@july2993
Copy link
Contributor Author

july2993 commented Sep 8, 2019

/run-all-tests

@july2993
Copy link
Contributor Author

july2993 commented Sep 8, 2019

/run-all-tests

@july2993
Copy link
Contributor Author

july2993 commented Sep 9, 2019

/run-all-tests

@july2993
Copy link
Contributor Author

july2993 commented Sep 9, 2019

@lichunzhu PTAL, the root cause is when runs into the status test, it will start with commit-ts = 0 and fetch data from the oldest point.

Copy link
Contributor

@lichunzhu lichunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@july2993 july2993 merged commit 9d7fb86 into pingcap:release-2.1 Sep 9, 2019
@july2993 july2993 deleted the hjh/var branch September 9, 2019 03:07
lichunzhu added a commit that referenced this pull request Sep 25, 2019
* Avoid any buffer at upstream to avoid too much memory usage and not decreasing performance.
* Add txnManager in loader to manage cached txn memory usage. Without txnManager the performance of loader will be affected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants