Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient based sampling for external memory mode on GPU #5093

Merged
merged 70 commits into from
Feb 4, 2020

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Dec 6, 2019

In GPU external memory mode, rely on gradient-based sampling to allow for bigger datasets. The final step for #4357.

The main idea is from https://arxiv.org/abs/1803.00841.

High level design:

  • At the beginning of each round, take the gradient pairs, perform weighted sampling without replacement based on the absolute value of the gradient.
  • Compact all the sampled rows in the DMatrix into a single ELLPACK page.
  • Construct the tree the same way as in-memory mode, using the compacted page that's kept in memory.
  • After the tree is constructed, loop through all the original pages to finalize the position of all the rows.
  • Repeat for each round.

On a generated synthetic dataset (1 million rows, 50 features), gradient-based sampling still works down to about 10% of the total rows sampled with loss of accuracy, while uniform random sampling only works at 50% of data, and completely fails to converge at 10%.

On a larger synthetic dataset (30 million rows, 200 features), CPU external memory mode actually runs out of memory on my 32GB desktop, while GPU external memory mode works fine on a Titan V (12GB).

Some benchmark numbers for the synthetic 1 million row, 50 feature dataset:

number of rounds: 500
max_depth = 0
max_leaves = 256
grow_policy = lossguide
Mode Eval Error Time (seconds)
CPU in-core 0.0083 508.40
CPU external memory 0.0096 513.58
GPU in-core 0.0086 31.13
GPU external memory, sample all rows 0.0093 46.30
GPU external memory, sample 50% 0.0081 124.44
GPU external memory, sample 10% 0.0085 121.29
GPU external memory, sample 5% 0.0098 120.48

@RAMitchell @trivialfis @sriramch

@hcho3
Copy link
Collaborator

hcho3 commented Dec 6, 2019

@rongou Can you provide a summary of what this pull request does? Why a new kind of sampling? To reduce the workload of making new releases, I'd like to start writing summaries each month, similar to https://discuss.tvm.ai/t/tvm-monthly-nov-2019/5038

@trivialfis
Copy link
Member

@hcho3 I think it's this one: https://arxiv.org/abs/1901.09047

@rongou
Copy link
Contributor Author

rongou commented Dec 6, 2019

@hcho3 @trivialfis I'll write a more detailed summary once I'm confident that it's actually working properly. The main reason for this is that the naive implementation of external memory mode on GPU requires reading back all the pages for every tree node, which is pretty expensive since data are moved over the PCIe bus. By doing some kind of smart sampling and keeping the sampled page in GPU memory, we can hope to get reasonable performance without degrading the accuracy too much. The main idea is from https://arxiv.org/abs/1803.00841, but LightGBM also does something similar.

@hcho3
Copy link
Collaborator

hcho3 commented Dec 6, 2019

Thanks a lot for the paper references.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just before reviewing the PR, could you provide a general pipeline as code comment, something similar to:

external data -> adaptor -> sparsepage -> ellpack page -> sampling algorithm -> sampled ?? page -> tree updater


XGBOOST_DEVICE GradientPairInternal<T> operator/(float divider) const {
GradientPairInternal<T> g;
g.grad_ = grad_ / divider;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful for 0 division.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out don't really need these. Removed.

@rongou
Copy link
Contributor Author

rongou commented Jan 23, 2020

The code is ready, but looks like there is a failing test involving XGBRFClassifier. Need to do some debugging.

@rongou
Copy link
Contributor Author

rongou commented Jan 29, 2020

@RAMitchell @trivialfis I refactored the code, the existing behavior is preserved for uniform sampling. This PR actually reverts some of the changes made to gpu_hist in my last PR, so it's arguably less risky.

@trivialfis
Copy link
Member

@rongou The next release should be really close. 1 last remaining blocker being model IO of Scikit-Learn interface. I will try to merge this PR once we can split up the branch. It should be fine as I believe we can make it for the next rapids release.

@trivialfis
Copy link
Member

@RAMitchell I think it's ready for merging. WDYT?

GradientBasedSample ExternalMemoryNoSampling::Sample(common::Span<GradientPair> gpair,
DMatrix* dmat) {
if (!page_concatenated_) {
// Concatenate all the external memory ELLPACK pages into a single in-memory page.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it even possible to do this? Seems redundant to allow a user to build external memory pages only to concatenate them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this option is here mostly for completeness' sake. If the whole dataset fits in GPU memory, then presumably you wouldn't want to use external memory; if it doesn't fit, you probably want to play around with sampling.

I have noticed that writing out external pages and then concatenating them together might allow you to train on slightly larger datasets versus keeping everything in memory, probably because of lower working memory requirement. Not sure how useful that is though.

@trivialfis
Copy link
Member

Restarted the test. Will merge unless @rongou has other comments regarding @RAMitchell 's review.

@codecov-io
Copy link

Codecov Report

Merging #5093 into master will increase coverage by 5.94%.
The diff coverage is 92.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5093      +/-   ##
==========================================
+ Coverage   77.89%   83.83%   +5.94%     
==========================================
  Files          11       11              
  Lines        2330     2407      +77     
==========================================
+ Hits         1815     2018     +203     
+ Misses        515      389     -126
Impacted Files Coverage Δ
python-package/xgboost/compat.py 53.95% <83.33%> (+5.17%) ⬆️
python-package/xgboost/sklearn.py 90.88% <97.29%> (+1.66%) ⬆️
python-package/xgboost/rabit.py 67.1% <0%> (+3.94%) ⬆️
python-package/xgboost/tracker.py 93.97% <0%> (+15.66%) ⬆️
python-package/xgboost/dask.py 90.3% <0%> (+28.42%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97ae33d...7fd7c31. Read the comment docs.

@trivialfis trivialfis merged commit e4b74c4 into dmlc:master Feb 4, 2020
@hcho3 hcho3 mentioned this pull request Feb 21, 2020
12 tasks
@lock lock bot locked as resolved and limited conversation to collaborators May 5, 2020
@rongou rongou deleted the gradient-based-sampler branch November 18, 2022 19:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants