-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use mmap for external memory. #9282
Conversation
@@ -809,12 +810,11 @@ class GPUHistMaker : public TreeUpdater { | |||
collective::Broadcast(&column_sampling_seed, sizeof(column_sampling_seed), 0); | |||
|
|||
auto batch_param = BatchParam{param->max_bin, TrainParam::DftSparseThreshold()}; | |||
auto page = (*dmat->GetBatches<EllpackPage>(ctx_, batch_param).begin()).Impl(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This initiates an iteration on the sparse DMatrix but doesn't finish it. As a result, we run sketching twice before the PR.
There are a couple of places where we can eliminate batch fetching, but I will leave that as future optimization: Line 638 in e70810b
xgboost/src/data/sparse_page_dmatrix.cc Line 170 in e70810b
|
@rongou Would you like to take a look when you are available? I have some experiments in mind for external memory; this PR is the first step. I will follow up with some of them after 2.0:
Maybe there are other types of memory reduction algorithms, and I would also like to learn more about them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice docs as usual.
On a system with 128GB memory, I am able to train a model with
hist
on a dataset of about 290GB.*
Windows support is removed.Performance is bounded by IO, and it's unlikely to change for the foreseeable future. I ran the test on a PICE-4 NVME drive. Observing the run from htop/iotop, the disk read is relatively consistent. For fetching gradient index, it's about 3G/s throughout the training.
There are still many limiting factors in scaling with storage since we only batch the predictor. But this PR will make the external memory a bit more practical.
I will do some more experiments in coming days. At the moment, using mmap can help (not in this branch yet) reusing the Linux cache. Allocating large chunks of memory is extremely expensive operation when the memory is already under pressure.
The next step after this PR is to make the DMatrix pages' size immutable after construction. This way, we can reuse the pointer from mmap in structures like
SparsePage
. The end goal is to make sure XGBoost can use the Linux cache efficiently.