Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sparse dataset memory usage for hist tree method. #4503

Closed
trivialfis opened this issue May 26, 2019 · 8 comments · Fixed by #4625
Closed

Improve sparse dataset memory usage for hist tree method. #4503

trivialfis opened this issue May 26, 2019 · 8 comments · Fixed by #4625

Comments

@trivialfis
Copy link
Member

trivialfis commented May 26, 2019

@hcho3 @RAMitchell

As part of refactoring DMatrix, I'm trying to profile memory usage issue across XGBoost. One glaring issue in hist method is that it uses feature index to build the histogram cut matrix. For dense dataset this is optimal, but for sparse dataset this results in very huge waste of memory. One particular reproducible dataset would be URL

The linked dataset has total size of 2.2 GB in text form. After some optimizations and limiting to first 450000 lines of dataset, it still costs 28.6 GB of memory. Here is a simple report from valgrind:

Screenshot from 2019-05-26 21-14-27

In some selected rows of above dataset, largest feature index is 3231962, but actual number of none zero columns is just 394. Is there any way we can compress the histogram cut building into a form that doesn't depend on feature index?

@trivialfis
Copy link
Member Author

And I need to see if it's possible to do feature grouping before building histogram cut.

@CodingCat
Copy link
Member

When I was working on distributed hist, I f found hist were still built for all features even with col sampling. I thought this is the bottleneck and work on it

At that time, I added something like a featureIdToHistIndex (but I didn’t merge it since I found the real bottleneck is rabit)...is the similar approach enough here?

@trivialfis
Copy link
Member Author

@CodingCat This is still at quantiles cut building stage. I'm not sure about hist updater yet. Your approach is my initial idea. But maybe there are many other problems that I don't know. So I figured I should ask for some suggestions here first.

@hcho3 hcho3 pinned this issue May 26, 2019
@RAMitchell
Copy link
Member

Yes we should definitely look into cut building scalability with number of features. URL is a great test case. I would like to apply what we learn to the GPU algorithms as well.

Regarding training, I believe LightGBM has the concept of a sparse histogram. Another approach that I have seen is to switch to exact style split search when the number of training examples low (e.g. less than the number of features).

@thvasilo
Copy link
Contributor

thvasilo commented May 27, 2019

I think this is quite related to the problem we tried to solve with @hcho3 in our recent paper on block-distributed GBTs.

Using a sparse histogram representation gave us orders of magnitude improvements in histogram size. We could/should try to find a way for an efficient sparse histogram representation that works with Rabbit/MPI's limitation of knowing the object size in advance.

@trivialfis
Copy link
Member Author

@thvasilo Interesting. Will read the paper later. If there's available experimental source code then maybe I can help. :)

@thvasilo
Copy link
Contributor

@trivialfis We implemented these algorithms from scratch but leaning heavily on the xgboost codebase. @hcho3 looked into open-sourcing this (work was done at Amazon) but there are some licensing issues there that are holding us back.

I think the important idea to consider is how can we get rid of all the redundant zeros that are currently being communicated in the gradient histograms.

Briefly: XGBoost will create histograms that for each feature will have as many bins as min(numBins, featureUniqueValues). Now for every leaf that has its own partition of data we create and communicate these histograms.

The issue is that as we grow the tree, it becomes more likely that buckets will have a zero value, as each leaf's partition ends up with fewer and fewer data points. Fewer data points -> less unique values to populate the bins.

For a continuous feature that will usually easily cover the say 255 buckets, we will always communicate 255 values, even if only a couple are non-zero. This is because Rabit and MPI require that the size of the objects being communicated to be known in advance. This leads to great amounts of redundant communication and memory use.

Similarly for the feature quantile sketches, we communicate at the beginning of learning their maximum possible size. The reality is that each worker's feature histograms will be much smaller, but because of the practical limitations of Rabit/MPI we need to communicate the max size. In our work we're using the parameter server, where each worker can send objects of arbitrary size and the reduction in communication is even more pronounced:

image

These types of gains would allow the quantile sketches to be communicated for each leaf, allowing for equivalent accuracy with smaller sketches, as shown in the original XGBoost paper (they call them local vs. global proposals) in Figure 3:

image

I hope to write a longer paper about these techniques in the future, we can talk with @hcho3 if you're interested in helping out.

@trivialfis
Copy link
Member Author

@thvasilo Thanks for the explanation. Will review the paper once #4488 is resolved. This will greatly improve current histogram method.

@hcho3 hcho3 unpinned this issue Jul 4, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants