Improve sparse dataset memory usage for hist tree method. #4503

trivialfis · 2019-05-26T13:19:22Z

As part of refactoring DMatrix, I'm trying to profile memory usage issue across XGBoost. One glaring issue in hist method is that it uses feature index to build the histogram cut matrix. For dense dataset this is optimal, but for sparse dataset this results in very huge waste of memory. One particular reproducible dataset would be URL

The linked dataset has total size of 2.2 GB in text form. After some optimizations and limiting to first 450000 lines of dataset, it still costs 28.6 GB of memory. Here is a simple report from valgrind:

In some selected rows of above dataset, largest feature index is 3231962, but actual number of none zero columns is just 394. Is there any way we can compress the histogram cut building into a form that doesn't depend on feature index?

trivialfis · 2019-05-26T13:23:40Z

And I need to see if it's possible to do feature grouping before building histogram cut.

CodingCat · 2019-05-26T13:27:59Z

When I was working on distributed hist, I f found hist were still built for all features even with col sampling. I thought this is the bottleneck and work on it

At that time, I added something like a featureIdToHistIndex (but I didn’t merge it since I found the real bottleneck is rabit)...is the similar approach enough here?

trivialfis · 2019-05-26T13:39:37Z

@CodingCat This is still at quantiles cut building stage. I'm not sure about hist updater yet. Your approach is my initial idea. But maybe there are many other problems that I don't know. So I figured I should ask for some suggestions here first.

RAMitchell · 2019-05-26T21:40:50Z

Yes we should definitely look into cut building scalability with number of features. URL is a great test case. I would like to apply what we learn to the GPU algorithms as well.

Regarding training, I believe LightGBM has the concept of a sparse histogram. Another approach that I have seen is to switch to exact style split search when the number of training examples low (e.g. less than the number of features).

thvasilo · 2019-05-27T11:32:11Z

I think this is quite related to the problem we tried to solve with @hcho3 in our recent paper on block-distributed GBTs.

Using a sparse histogram representation gave us orders of magnitude improvements in histogram size. We could/should try to find a way for an efficient sparse histogram representation that works with Rabbit/MPI's limitation of knowing the object size in advance.

trivialfis · 2019-05-27T12:53:47Z

@thvasilo Interesting. Will read the paper later. If there's available experimental source code then maybe I can help. :)

thvasilo · 2019-05-27T13:19:47Z

@trivialfis We implemented these algorithms from scratch but leaning heavily on the xgboost codebase. @hcho3 looked into open-sourcing this (work was done at Amazon) but there are some licensing issues there that are holding us back.

I think the important idea to consider is how can we get rid of all the redundant zeros that are currently being communicated in the gradient histograms.

Briefly: XGBoost will create histograms that for each feature will have as many bins as min(numBins, featureUniqueValues). Now for every leaf that has its own partition of data we create and communicate these histograms.

The issue is that as we grow the tree, it becomes more likely that buckets will have a zero value, as each leaf's partition ends up with fewer and fewer data points. Fewer data points -> less unique values to populate the bins.

For a continuous feature that will usually easily cover the say 255 buckets, we will always communicate 255 values, even if only a couple are non-zero. This is because Rabit and MPI require that the size of the objects being communicated to be known in advance. This leads to great amounts of redundant communication and memory use.

Similarly for the feature quantile sketches, we communicate at the beginning of learning their maximum possible size. The reality is that each worker's feature histograms will be much smaller, but because of the practical limitations of Rabit/MPI we need to communicate the max size. In our work we're using the parameter server, where each worker can send objects of arbitrary size and the reduction in communication is even more pronounced:

These types of gains would allow the quantile sketches to be communicated for each leaf, allowing for equivalent accuracy with smaller sketches, as shown in the original XGBoost paper (they call them local vs. global proposals) in Figure 3:

I hope to write a longer paper about these techniques in the future, we can talk with @hcho3 if you're interested in helping out.

trivialfis · 2019-05-27T13:31:01Z

@thvasilo Thanks for the explanation. Will review the paper once #4488 is resolved. This will greatly improve current histogram method.

hcho3 pinned this issue May 26, 2019

trivialfis mentioned this issue May 27, 2019

Small optimization for block hist matrix. #4501

Closed

hcho3 mentioned this issue May 29, 2019

Provide example of custom distributed evaluation function via dask #4508

Closed

trivialfis mentioned this issue Jun 30, 2019

Fix CPU hist init for sparse dataset. #4625

Merged

hcho3 closed this as completed in #4625 Jul 4, 2019

hcho3 unpinned this issue Jul 4, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve sparse dataset memory usage for hist tree method. #4503

Improve sparse dataset memory usage for hist tree method. #4503

trivialfis commented May 26, 2019 •

edited

Loading

trivialfis commented May 26, 2019

CodingCat commented May 26, 2019

trivialfis commented May 26, 2019

RAMitchell commented May 26, 2019

thvasilo commented May 27, 2019 •

edited

Loading

trivialfis commented May 27, 2019

thvasilo commented May 27, 2019

trivialfis commented May 27, 2019

Improve sparse dataset memory usage for hist tree method. #4503

Improve sparse dataset memory usage for hist tree method. #4503

Comments

trivialfis commented May 26, 2019 • edited Loading

trivialfis commented May 26, 2019

CodingCat commented May 26, 2019

trivialfis commented May 26, 2019

RAMitchell commented May 26, 2019

thvasilo commented May 27, 2019 • edited Loading

trivialfis commented May 27, 2019

thvasilo commented May 27, 2019

trivialfis commented May 27, 2019

trivialfis commented May 26, 2019 •

edited

Loading

thvasilo commented May 27, 2019 •

edited

Loading