-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve sparse dataset memory usage for hist tree method. #4503
Comments
And I need to see if it's possible to do feature grouping before building histogram cut. |
When I was working on distributed hist, I f found hist were still built for all features even with col sampling. I thought this is the bottleneck and work on it At that time, I added something like a featureIdToHistIndex (but I didn’t merge it since I found the real bottleneck is rabit)...is the similar approach enough here? |
@CodingCat This is still at quantiles cut building stage. I'm not sure about hist updater yet. Your approach is my initial idea. But maybe there are many other problems that I don't know. So I figured I should ask for some suggestions here first. |
Yes we should definitely look into cut building scalability with number of features. URL is a great test case. I would like to apply what we learn to the GPU algorithms as well. Regarding training, I believe LightGBM has the concept of a sparse histogram. Another approach that I have seen is to switch to exact style split search when the number of training examples low (e.g. less than the number of features). |
I think this is quite related to the problem we tried to solve with @hcho3 in our recent paper on block-distributed GBTs. Using a sparse histogram representation gave us orders of magnitude improvements in histogram size. We could/should try to find a way for an efficient sparse histogram representation that works with Rabbit/MPI's limitation of knowing the object size in advance. |
@thvasilo Interesting. Will read the paper later. If there's available experimental source code then maybe I can help. :) |
@trivialfis We implemented these algorithms from scratch but leaning heavily on the xgboost codebase. @hcho3 looked into open-sourcing this (work was done at Amazon) but there are some licensing issues there that are holding us back. I think the important idea to consider is how can we get rid of all the redundant zeros that are currently being communicated in the gradient histograms. Briefly: XGBoost will create histograms that for each feature will have as many bins as The issue is that as we grow the tree, it becomes more likely that buckets will have a zero value, as each leaf's partition ends up with fewer and fewer data points. Fewer data points -> less unique values to populate the bins. For a continuous feature that will usually easily cover the say 255 buckets, we will always communicate 255 values, even if only a couple are non-zero. This is because Rabit and MPI require that the size of the objects being communicated to be known in advance. This leads to great amounts of redundant communication and memory use. Similarly for the feature quantile sketches, we communicate at the beginning of learning their maximum possible size. The reality is that each worker's feature histograms will be much smaller, but because of the practical limitations of Rabit/MPI we need to communicate the max size. In our work we're using the parameter server, where each worker can send objects of arbitrary size and the reduction in communication is even more pronounced: These types of gains would allow the quantile sketches to be communicated for each leaf, allowing for equivalent accuracy with smaller sketches, as shown in the original XGBoost paper (they call them local vs. global proposals) in Figure 3: I hope to write a longer paper about these techniques in the future, we can talk with @hcho3 if you're interested in helping out. |
@hcho3 @RAMitchell
As part of refactoring DMatrix, I'm trying to profile memory usage issue across XGBoost. One glaring issue in hist method is that it uses feature index to build the histogram cut matrix. For dense dataset this is optimal, but for sparse dataset this results in very huge waste of memory. One particular reproducible dataset would be URL
The linked dataset has total size of 2.2 GB in text form. After some optimizations and limiting to first 450000 lines of dataset, it still costs 28.6 GB of memory. Here is a simple report from valgrind:
In some selected rows of above dataset, largest feature index is 3231962, but actual number of none zero columns is just 394. Is there any way we can compress the histogram cut building into a form that doesn't depend on feature index?
The text was updated successfully, but these errors were encountered: