-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide quantiles and average binning for RF histograms #2309
Provide quantiles and average binning for RF histograms #2309
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main concerns are:
- How this is mapped to DPC++ part of the algorithm? Are the similar changes planned to be propagated there?
- Test coverage needs to be extended to cover both binning strategies:
quantiles
andaverages
Please also see the other comments from this review.
append(_bins, nBins, newBinSize); | ||
i += newBinSize; | ||
} | ||
|
||
// collect the remaining data rows in the final bin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to distribute the residual data rows more uniformly across the bins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but it's what we have been doing all along. I did not change the logic, only added the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry y'all. Vika, you were correct that it was bad, and was the cause of a bug. The remainder is now distributed to the various bins using bresenham's algo which should be relatively uniform (given the discrete nature). There is some interplay associated with replicated values which plays around with this some, but should be small wrt to the bin size after the remainder distribution (like a single data point or so).
|
||
size_t nBins = 0; | ||
size_t i = 0; | ||
algorithmFPType binSize = (index[nRows - 1].key - index[0].key) / _prm.maxBins; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could a division by zero happen here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because we never create the binning task when maxBins=0 is selected. Nevertheless, I have added a DAAL_ASSERT for debug later in case this gets changed
@@ -1044,7 +1044,7 @@ services::Status ClassificationTrainBatchKernel<algorithmFPType, method, cpu>::c | |||
{ | |||
if (!par.memorySavingMode) | |||
{ | |||
BinParams prm(par.maxBins, par.minBinSize); | |||
BinParams prm(par.maxBins, par.minBinSize, par.binningStrategy); | |||
s = indexedFeatures.init<algorithmFPType, cpu>(*x, &featTypes, &prm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the line #1078 s = indexedFeatures.init<algorithmFPType, cpu>(*x, &featTypes);
should be changed accordingly, i.e. binningStrategy needs to be added there as well.
cpp/daal/src/algorithms/dtrees/gbt/classification/gbt_classification_train_dense_default_impl.i
Outdated
Show resolved
Hide resolved
cpp/daal/src/algorithms/dtrees/gbt/regression/gbt_regression_train_dense_default_impl.i
Outdated
Show resolved
Hide resolved
...src/algorithms/dtrees/gbt/regression/oneapi/gbt_regression_train_dense_default_oneapi_impl.i
Outdated
Show resolved
Hide resolved
e14f975
to
67c591a
Compare
67c591a
to
377250c
Compare
377250c
to
f2a4c41
Compare
/intelci: run |
@Mergifyio rebase |
✅ Branch has been successfully rebased |
f2a4c41
to
abf268f
Compare
/intelci: run |
1 similar comment
/intelci: run |
ac25b0e
to
201cb31
Compare
Improve API that allows feature discretization for RF training.
The quantile indexing that is currently hardcoded and tough to disable should be
Kicked off by uxlfoundation/scikit-learn-intelex#1090, this draft already improves the MSE on the reported dataset. In a test run I find
I will run more tests as soon as the API is cleaned up