-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize cpu sketch allreduce for sparse data. #6009
Optimize cpu sketch allreduce for sparse data. #6009
Conversation
13e690e
to
a0757a1
Compare
@ShvetsKS Could you please help taking a look? |
a0757a1
to
51a285e
Compare
Codecov Report
@@ Coverage Diff @@
## master #6009 +/- ##
==========================================
+ Coverage 79.04% 79.14% +0.10%
==========================================
Files 12 12
Lines 3025 3040 +15
==========================================
+ Hits 2391 2406 +15
Misses 634 634
Continue to review full report at Codecov.
|
@ShvetsKS I'm merging it as it contains a bug fix that's blocking the CI. If you have concerns feel free to revert or let me know how should I make follow up PRs. |
Follow up on #5880 . @ShvetsKS @SmirnovEgorRu Previously sketching on URL dataset on distributed environment was not possible due to memory usage blow up. After this PR I can unify the sketching for
hist
andapprox
and proceed with other refactoring to unify the codebase.Also fixed a bug in distributed training with incorrect allreduce call, and added tests.
Perf
Previous perf is from #5880.
Single Node
Before
URL
GmatInitialization: 10.3033s, 1 calls @ 10303323us
GmatInitialization: 10.2406s, 1 calls @ 10240576us
GmatInitialization: 10.224s, 1 calls @ 10224020us
HIGGS
GmatInitialization: 3.74322s, 1 calls @ 3743224us
GmatInitialization: 3.63079s, 1 calls @ 3630793us
GmatInitialization: 3.67905s, 1 calls @ 3679049us
After
HIGGS
GmatInitialization: 3.7794s, 1 calls @ 3779401us
GmatInitialization: 3.81085s, 1 calls @ 3810851us
GmatInitialization: 3.77258s, 1 calls @ 3772581us
GmatInitialization: 3.76844s, 1 calls @ 3768436us
URL
GmatInitialization: 10.315s, 1 calls @ 10314976us
GmatInitialization: 10.3202s, 1 calls @ 10320246us
GmatInitialization: 10.3059s, 1 calls @ 10305877us
Multi Node 4x4
Before
HIGGS
GmatInitialization: 3.7682s, 1 calls @ 3768198us
GmatInitialization: 3.707s, 1 calls @ 3707000us
GmatInitialization: 3.70828s, 1 calls @ 3708276us
URL
NAN
After
HIGGS
GmatInitialization: 3.64623s, 1 calls @ 3646232us
GmatInitialization: 3.5942s, 1 calls @ 3594197us
GmatInitialization: 3.73079s, 1 calls @ 3730792us
URL
GmatInitialization: 6.53294s, 1 calls @ 6532936us
GmatInitialization: 6.43215s, 1 calls @ 6432153us
GmatInitialization: 6.67619s, 1 calls @ 6676192us
@ShvetsKS Also I might have found the cause of the slow down you mentioned, this is memory usage from massif when training on URL on distributed env, but should be related to performance:
I can't get a complete run as my home machine only has 32 GB of memory. Hope that helps.