Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prediction by indices (subsample < 1) #6683

Merged
merged 3 commits into from
Mar 16, 2021

Conversation

RukhovichIV
Copy link
Contributor

@RukhovichIV RukhovichIV commented Feb 5, 2021

Currently, the prediction cache is not enabled for models with subsample < 1, so a full prediction is made after each training iteration. This PR allows to partially update the predictions using the existing cache and make the actual prediction only on the rows that were not used for training.

We noticed an almost 2x acceleration for the PredictRaw section (on santander, where subsample == 0.5)
We also encountered a slowdown in the InitData section, where we now do more calculations (this will only be observed on subsample < 1 datasets). This could be improved in future updates.

UPD:
Here are some measurements:

Higgs, 1m, subsample == 0.9 PredictRaw InitData Overall time
current master branch, s 2.57 7.12 24.70
#6683, s 1.51 8.51 24.86
speedup, ratio 1.7x 0.84x 0.994x
Airline, 1m, subsample == 0.9 PredictRaw InitData Overall time
current master branch, s 37.48 7.46 81.47
#6683, s 8.17 8.67 51.71
speedup, ratio 4.59x 0.86x 1.58x
Mortgage, 9m, subsample == 0.9 PredictRaw InitData Overall time
current master branch, s 5.19 6.4 25.93
#6683, s 1.73 7.6 23.48
speedup, ratio 3x 0.84x 1.1x
Santander, subsample == 0.5 PredictRaw InitData Overall time
master just before #6696, s 79.11 14.08 163.41
current master branch, s 56.81 15.78 145.52
#6683, s 37.77 21.40 129.41
total speedup, ratio 2.1x 0.66x 1.26x

@trivialfis trivialfis self-requested a review February 5, 2021 13:08
Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reconsider this? This PR seems to be an optimization where subsample is used for hist, which is a limited use case, also it complicates and duplicates the existing code.

@RukhovichIV
Copy link
Contributor Author

@trivialfis, we tried to simplify this code as much as possible. Now it seems to me that this is one of the shortest ways to do this optimization. But we still need to partially duplicate the code of PredictBatchByBlockOfRowsKernel(). The fact is that this function was created in order to make predictions on all trees at once (in the current iteration), because each row of the training sample must be used in each tree, and this is the fastest way to do this. But now (when subsample < 1) we have separate rowsets for each tree to make predictions, and that is why we have to process each tree separately. We tried to unify these branches as much as possible, but still some code is duplicated.
As for optimizing only hist - yes, at the moment the acceleration will be obtained only in it. Of course, we can do the same optimization for other methods in the future, but it seems like PredictRaw it is not the major problem in other methods. Let's first try to do this only for hist

@RukhovichIV RukhovichIV marked this pull request as ready for review February 17, 2021 05:20
@RukhovichIV RukhovichIV changed the title WIP: Prediction by indices (subsample < 1) Prediction by indices (subsample < 1) Feb 17, 2021
@trivialfis
Copy link
Member

Sorry for the late reply. I think this PR is a workaround for hist not having prediction caching for prediction when subsample is enabled. Feel free to correct me. But I think it's possible to have prediction cache even when subsample is used. So far that's what the GPU hist does.

@RukhovichIV
Copy link
Contributor Author

I'm not really aware of what's going on in GPU hist, but here's what we're doing in CPU:
https://github.com/dmlc/xgboost/blob/master/src/tree/updater_quantile_hist.cc#L115
https://github.com/dmlc/xgboost/blob/master/src/tree/updater_quantile_hist.cc#L131
We simply skip the accumulated prediction cache when subsample < 1, despite the fact that we could partially use it. And this PR enables its use for hist.

We added unit tests for the new CPU prediction branch (as part of the UpdatePredictionCache test) and extended TestInitDataSampling for the new behaviour.

It looks like there's been some strange error in Travis (https://travis-ci.org/github/dmlc/xgboost/jobs/760854002#L18269). Should I re-push this to restart the check?

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just ensure leaf_value_cache_ is up-to-date at the end of training (maybe here:

).

We want to avoid communicating internals of separate parts of the program as much as possible. If you do it this way none of the interface changes.

@RukhovichIV
Copy link
Contributor Author

RukhovichIV commented Mar 3, 2021

Could you please explain in more detail what you want us to check? I checked the code again - it looks like leaf_value_cache_ is not used at all. It can be easily removed and nothing will change - I think I can even remove it in this PR if there are no objections from your side.
As for checking for cache availability, we do such check here: https://github.com/RukhovichIV/xgboost/blob/prediction_by_indices/src/predictor/cpu_predictor.cc#L260
If UpdatePredictionCache returns false for some reason, then tree_begin will be less than tree_end and we will fall into the default prediction branch.

We want to avoid communicating internals of separate parts of the program as much as possible

We've been thinking about the better way to make such optimization and came to this. The fact is that we obtain the indices of the rows on which we need to make a prediction in InitData part of Updater. And it's logical to make the prediction in a specially designated part - PredictRaw. So that we must somehow transfer our array between these parts of the program. If you think there's a better solution for this, let's discuss it.

@RAMitchell
Copy link
Member

Ah, I didnt notice leaf_value_cache_ is no longer used. In that case can you ensure row set collection is complete at the end of Update to contain all required rows?

The current design is problematic. It makes life harder around other parts of xgboost to serve a very specific optimisation case.

@RAMitchell
Copy link
Member

To clarify, I don't mean just adding a check, I mean ensuring row_set_collection_ actually contains all rows at the end of training and removing unused_indices.

Sorry for the ambiguity.

@RukhovichIV
Copy link
Contributor Author

ensuring row_set_collection_ actually contains all rows at the end of training

But how can we verify this? If subsample >= 1, then this is automatically true, as it always was, why do we need to check this? And if subsample < 1, then it will only contain about nrows * subsample rows that were randomly taken from all of them. What can we check here? If you mean adding unit tests for it - they have been added

and removing unused_indices

Do you object to this field? Yes, we could use row_set_collection_ to find this array, but we think it would be much slower than than creating such an array directly in InitSampling(). And it still won't save us from communication between Updater and Predictor, because this field is also contained in Updater.

@RukhovichIV
Copy link
Contributor Author

It seems to me that the only way to avoid any communication between Updater and Predictor is to make predictions on these unused_indices right in the Updater (for example, at the end of UpdatePredictionCache()), but I think that this is not what we expect from the Updater class, and it will also lead to copying some parts of the existing code from Predictor.

@RAMitchell
Copy link
Member

Yes that is what I want you to do. Prediction is not complicated on a single tree - give it a try and see what it looks like.

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, just a few comments.

src/tree/updater_quantile_hist.cc Outdated Show resolved Hide resolved
@@ -680,30 +683,63 @@ bool QuantileHistMaker::Builder<GradientSumT>::UpdatePredictionCache(
}
});

if (param_.subsample < 1.0f) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it potentially easier to complete this step at the end of the update instead of in UpdatePredictionCache? That way we don't need p_last_fmat_mutable_ as we are guaranteed to have a valid DMatrix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will be easier. We already have other things in UpdatePredictionCache() needed for predicting, such as out_preds, the number of trees, and the number of the current tree. Moving them to Update() will also cost us a few extra fields in Updater. But we can probably remove const from the existing p_last_fmat_ field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that seems fair. I don't like this p_last_fmat_ business much - it's a raw pointer that we are making assumptions about. For example if someone made changes to other parts of the program it could be easy to invalidate its use in the updater in unpredictable situations. If you can think of a better way to handle this in general that could be a nice change for the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing your vision. We are going to do some refactoring for hist in the nearest future. At first glance, we can replace raw pointers with smart ones. Next, perhaps, we will think about some hashing for DMatrix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should definitely chat with @trivialfis about hist refactoring and work on an RFC.

// tree rows that were not used for current training
std::vector<size_t> unused_rows_;
// feature vectors for subsampled prediction
std::vector<RegTree::FVec> feat_vecs_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you get any performance benefit from keeping this as a member variable instead of creating locally? Avoid extra member variables if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this field allows us not to allocate nthread*nfeatures units of memory on each UpdatePredicionCache() call. Here're the results for one of our datasets:

Santander, subsample == 0.5 UpdatePredictionCache, s Overall time, s
with feat_vecs_ as a member variable 42.7465 142.7305
with local allocations 49.3696 147.4053

@RAMitchell RAMitchell merged commit 19a2c54 into dmlc:master Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants