New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Provide quantiles and average binning for RF histograms #2309

Merged

Alexsandruss merged 3 commits into uxlfoundation:master from ahuber21:dev/ahuber/rf-feature-binning-optimization

May 31, 2023

Contributor

ahuber21 commented Apr 5, 2023 •

edited

Loading

Improve API that allows feature discretization for RF training.

The quantile indexing that is currently hardcoded and tough to disable should be

more easy to disable / configure (maxBins, binWidth, memorySavingMode, bUseIndexedFeatures, etc.)
happening in a more transparent fashion
extended by average binning

Kicked off by uxlfoundation/scikit-learn-intelex#1090, this draft already improves the MSE on the reported dataset. In a test run I find

Intel® extension for Scikit-learn Mean Squared Error using quantiles binning: 64401.01440419433  *current release*
Intel® extension for Scikit-learn Mean Squared Error using averages binning: 21383.80560598015  *new option*
stock-learn Mean Squared Error: 20962.250402202062

I will run more tests as soon as the API is cleaned up

ahuber21 requested review from Alexsandruss and samir-nasibli as code owners

April 5, 2023 15:38

ahuber21 marked this pull request as draft

April 5, 2023 15:38

KulikovNikita reviewed

View reviewed changes

cpp/daal/src/algorithms/dtrees/dtrees_feature_type_helper.i Outdated Show resolved Hide resolved

cpp/daal/src/algorithms/dtrees/dtrees_feature_type_helper.i Outdated Show resolved Hide resolved

Vika-F previously requested changes

View reviewed changes

Contributor

Vika-F left a comment

The main concerns are:

How this is mapped to DPC++ part of the algorithm? Are the similar changes planned to be propagated there?
Test coverage needs to be extended to cover both binning strategies: quantiles and averages

Please also see the other comments from this review.

cpp/daal/src/algorithms/dtrees/dtrees_feature_type_helper.i Outdated Show resolved Hide resolved

cpp/daal/src/algorithms/dtrees/dtrees_feature_type_helper.i Outdated Show resolved Hide resolved

cpp/daal/src/algorithms/dtrees/dtrees_feature_type_helper.i Outdated Show resolved Hide resolved

cpp/daal/src/algorithms/dtrees/dtrees_feature_type_helper.i

                       append(_bins, nBins, newBinSize);
                       i += newBinSize;
                   }
+                  // collect the remaining data rows in the final bin

Contributor

Vika-F Apr 11, 2023

Wouldn't it be better to distribute the residual data rows more uniformly across the bins?

Contributor Author

ahuber21 May 15, 2023 •

edited

Loading

Agreed, but it's what we have been doing all along. I did not change the logic, only added the comment.

Contributor

icfaust May 19, 2023 •

edited

Loading

Don't worry y'all. Vika, you were correct that it was bad, and was the cause of a bug. The remainder is now distributed to the various bins using bresenham's algo which should be relatively uniform (given the discrete nature). There is some interplay associated with replicated values which plays around with this some, but should be small wrt to the bin size after the remainder distribution (like a single data point or so).

cpp/daal/src/algorithms/dtrees/dtrees_feature_type_helper.i

+                  size_t nBins            = 0;
+                  size_t i                = 0;
+                  algorithmFPType binSize = (index[nRows - 1].key - index[0].key) / _prm.maxBins;

Contributor

Vika-F Apr 11, 2023

Could a division by zero happen here?

Contributor Author

ahuber21 May 15, 2023

No, because we never create the binning task when maxBins=0 is selected. Nevertheless, I have added a DAAL_ASSERT for debug later in case this gets changed

cpp/daal/src/algorithms/dtrees/dtrees_feature_type_helper.i Outdated Show resolved Hide resolved

...aal/src/algorithms/dtrees/forest/classification/df_classification_train_dense_default_impl.i

                   {
                       if (!par.memorySavingMode)
                       {
-                          BinParams prm(par.maxBins, par.minBinSize);
+                          BinParams prm(par.maxBins, par.minBinSize, par.binningStrategy);
                           s = indexedFeatures.init<algorithmFPType, cpu>(*x, &featTypes, &prm);

Contributor

Vika-F Apr 11, 2023

It seems the line #1078 s = indexedFeatures.init<algorithmFPType, cpu>(*x, &featTypes); should be changed accordingly, i.e. binningStrategy needs to be added there as well.

cpp/daal/src/algorithms/dtrees/gbt/classification/gbt_classification_train_dense_default_impl.i Outdated Show resolved Hide resolved

cpp/daal/src/algorithms/dtrees/gbt/regression/gbt_regression_train_dense_default_impl.i Outdated Show resolved Hide resolved

...src/algorithms/dtrees/gbt/regression/oneapi/gbt_regression_train_dense_default_oneapi_impl.i Outdated Show resolved Hide resolved

ahuber21 force-pushed the dev/ahuber/rf-feature-binning-optimization branch from e14f975 to 67c591a Compare

May 15, 2023 16:03

ahuber21 marked this pull request as ready for review

May 15, 2023 16:03

ahuber21 force-pushed the dev/ahuber/rf-feature-binning-optimization branch from 67c591a to 377250c Compare

May 15, 2023 16:10

ahuber21 changed the title ~~feat: (draft) provide quantiles and average binning for RF histograms~~ Provide quantiles and average binning for RF histograms

ahuber21 force-pushed the dev/ahuber/rf-feature-binning-optimization branch from 377250c to f2a4c41 Compare

May 15, 2023 16:56

Contributor Author

ahuber21 commented May 15, 2023

/intelci: run

ahuber21 requested review from Vika-F, icfaust and KulikovNikita

May 15, 2023 16:59

Contributor

Alexsandruss commented May 26, 2023

@Mergifyio rebase

Contributor

mergify bot commented May 26, 2023

rebase

✅ Branch has been successfully rebased

Alexsandruss force-pushed the dev/ahuber/rf-feature-binning-optimization branch from f2a4c41 to abf268f Compare

May 26, 2023 20:43

Contributor

Alexsandruss commented May 26, 2023

/intelci: run

1 similar comment

Contributor

Alexsandruss commented May 27, 2023

/intelci: run

Alexsandruss approved these changes

View reviewed changes

ahuber21 and others added 3 commits

May 30, 2023 12:08


          chore: cleanup unused threaded node split code

75d349e


          feat: quantiles and average binning for RF histograms / maxBins=0 to …

f30c20e

…disable binning


          Apply default binning strategy

201cb31

ahuber21 force-pushed the dev/ahuber/rf-feature-binning-optimization branch from ac25b0e to 201cb31 Compare

May 30, 2023 19:09

ahuber21 dismissed Vika-F’s stale review

May 30, 2023 19:10

stale

Alexsandruss merged commit 0d46ed1 into uxlfoundation:master

ahuber21 deleted the dev/ahuber/rf-feature-binning-optimization branch

May 31, 2023 10:44

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Alexsandruss Alexsandruss approved these changes

samir-nasibli Awaiting requested review from samir-nasibli

Vika-F Awaiting requested review from Vika-F

icfaust Awaiting requested review from icfaust

KulikovNikita Awaiting requested review from KulikovNikita

Labels

None yet