-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient based sampling for external memory mode on GPU #5093
Conversation
@rongou Can you provide a summary of what this pull request does? Why a new kind of sampling? To reduce the workload of making new releases, I'd like to start writing summaries each month, similar to https://discuss.tvm.ai/t/tvm-monthly-nov-2019/5038 |
@hcho3 I think it's this one: https://arxiv.org/abs/1901.09047 |
@hcho3 @trivialfis I'll write a more detailed summary once I'm confident that it's actually working properly. The main reason for this is that the naive implementation of external memory mode on GPU requires reading back all the pages for every tree node, which is pretty expensive since data are moved over the PCIe bus. By doing some kind of smart sampling and keeping the sampled page in GPU memory, we can hope to get reasonable performance without degrading the accuracy too much. The main idea is from https://arxiv.org/abs/1803.00841, but LightGBM also does something similar. |
Thanks a lot for the paper references. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just before reviewing the PR, could you provide a general pipeline as code comment, something similar to:
external data -> adaptor -> sparsepage -> ellpack page -> sampling algorithm -> sampled ?? page -> tree updater
include/xgboost/base.h
Outdated
|
||
XGBOOST_DEVICE GradientPairInternal<T> operator/(float divider) const { | ||
GradientPairInternal<T> g; | ||
g.grad_ = grad_ / divider; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful for 0 division.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out don't really need these. Removed.
The code is ready, but looks like there is a failing test involving |
@RAMitchell @trivialfis I refactored the code, the existing behavior is preserved for uniform sampling. This PR actually reverts some of the changes made to |
@rongou The next release should be really close. 1 last remaining blocker being model IO of Scikit-Learn interface. I will try to merge this PR once we can split up the branch. It should be fine as I believe we can make it for the next rapids release. |
@RAMitchell I think it's ready for merging. WDYT? |
GradientBasedSample ExternalMemoryNoSampling::Sample(common::Span<GradientPair> gpair, | ||
DMatrix* dmat) { | ||
if (!page_concatenated_) { | ||
// Concatenate all the external memory ELLPACK pages into a single in-memory page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it even possible to do this? Seems redundant to allow a user to build external memory pages only to concatenate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this option is here mostly for completeness' sake. If the whole dataset fits in GPU memory, then presumably you wouldn't want to use external memory; if it doesn't fit, you probably want to play around with sampling.
I have noticed that writing out external pages and then concatenating them together might allow you to train on slightly larger datasets versus keeping everything in memory, probably because of lower working memory requirement. Not sure how useful that is though.
Restarted the test. Will merge unless @rongou has other comments regarding @RAMitchell 's review. |
Codecov Report
@@ Coverage Diff @@
## master #5093 +/- ##
==========================================
+ Coverage 77.89% 83.83% +5.94%
==========================================
Files 11 11
Lines 2330 2407 +77
==========================================
+ Hits 1815 2018 +203
+ Misses 515 389 -126
Continue to review full report at Codecov.
|
In GPU external memory mode, rely on gradient-based sampling to allow for bigger datasets. The final step for #4357.
The main idea is from https://arxiv.org/abs/1803.00841.
High level design:
On a generated synthetic dataset (1 million rows, 50 features), gradient-based sampling still works down to about 10% of the total rows sampled with loss of accuracy, while uniform random sampling only works at 50% of data, and completely fails to converge at 10%.
On a larger synthetic dataset (30 million rows, 200 features), CPU external memory mode actually runs out of memory on my 32GB desktop, while GPU external memory mode works fine on a Titan V (12GB).
Some benchmark numbers for the synthetic 1 million row, 50 feature dataset:
@RAMitchell @trivialfis @sriramch