Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize cpu sketch allreduce for sparse data. #6009

Merged
merged 6 commits into from
Aug 19, 2020

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Aug 12, 2020

Follow up on #5880 . @ShvetsKS @SmirnovEgorRu Previously sketching on URL dataset on distributed environment was not possible due to memory usage blow up. After this PR I can unify the sketching for hist and approx and proceed with other refactoring to unify the codebase.

Also fixed a bug in distributed training with incorrect allreduce call, and added tests.

Perf

Previous perf is from #5880.

Single Node

Before

  • URL
    GmatInitialization: 10.3033s, 1 calls @ 10303323us
    GmatInitialization: 10.2406s, 1 calls @ 10240576us
    GmatInitialization: 10.224s, 1 calls @ 10224020us

  • HIGGS
    GmatInitialization: 3.74322s, 1 calls @ 3743224us
    GmatInitialization: 3.63079s, 1 calls @ 3630793us
    GmatInitialization: 3.67905s, 1 calls @ 3679049us

After

  • HIGGS
    GmatInitialization: 3.7794s, 1 calls @ 3779401us
    GmatInitialization: 3.81085s, 1 calls @ 3810851us
    GmatInitialization: 3.77258s, 1 calls @ 3772581us
    GmatInitialization: 3.76844s, 1 calls @ 3768436us

  • URL
    GmatInitialization: 10.315s, 1 calls @ 10314976us
    GmatInitialization: 10.3202s, 1 calls @ 10320246us
    GmatInitialization: 10.3059s, 1 calls @ 10305877us

Multi Node 4x4

Before

  • HIGGS
    GmatInitialization: 3.7682s, 1 calls @ 3768198us
    GmatInitialization: 3.707s, 1 calls @ 3707000us
    GmatInitialization: 3.70828s, 1 calls @ 3708276us

  • URL
    NAN

After

  • HIGGS
    GmatInitialization: 3.64623s, 1 calls @ 3646232us
    GmatInitialization: 3.5942s, 1 calls @ 3594197us
    GmatInitialization: 3.73079s, 1 calls @ 3730792us

  • URL
    GmatInitialization: 6.53294s, 1 calls @ 6532936us
    GmatInitialization: 6.43215s, 1 calls @ 6432153us
    GmatInitialization: 6.67619s, 1 calls @ 6676192us

@ShvetsKS Also I might have found the cause of the slow down you mentioned, this is memory usage from massif when training on URL on distributed env, but should be related to performance:
Screenshot from 2020-08-13 05-15-24

I can't get a complete run as my home machine only has 32 GB of memory. Hope that helps.

@trivialfis trivialfis force-pushed the optimize-cpu-sketch-allreduce branch 3 times, most recently from 13e690e to a0757a1 Compare August 17, 2020 18:17
@trivialfis
Copy link
Member Author

@ShvetsKS Could you please help taking a look?

src/common/quantile.cc Show resolved Hide resolved
@trivialfis trivialfis force-pushed the optimize-cpu-sketch-allreduce branch from a0757a1 to 51a285e Compare August 18, 2020 03:45
@codecov-commenter
Copy link

codecov-commenter commented Aug 18, 2020

Codecov Report

Merging #6009 into master will increase coverage by 0.10%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6009      +/-   ##
==========================================
+ Coverage   79.04%   79.14%   +0.10%     
==========================================
  Files          12       12              
  Lines        3025     3040      +15     
==========================================
+ Hits         2391     2406      +15     
  Misses        634      634              
Impacted Files Coverage Δ
python-package/xgboost/sklearn.py 91.44% <0.00%> (+0.06%) ⬆️
python-package/xgboost/core.py 78.22% <0.00%> (+0.08%) ⬆️
python-package/xgboost/data.py 59.40% <0.00%> (+0.85%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a418278...e3c53e8. Read the comment docs.

@trivialfis
Copy link
Member Author

@ShvetsKS I'm merging it as it contains a bug fix that's blocking the CI. If you have concerns feel free to revert or let me know how should I make follow up PRs.

@trivialfis trivialfis merged commit 29b7fea into dmlc:master Aug 19, 2020
@trivialfis trivialfis deleted the optimize-cpu-sketch-allreduce branch August 19, 2020 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants