Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide quantiles and average binning for RF histograms #2309

Conversation

ahuber21
Copy link
Contributor

@ahuber21 ahuber21 commented Apr 5, 2023

Improve API that allows feature discretization for RF training.

The quantile indexing that is currently hardcoded and tough to disable should be

  • more easy to disable / configure (maxBins, binWidth, memorySavingMode, bUseIndexedFeatures, etc.)
  • happening in a more transparent fashion
  • extended by average binning

Kicked off by uxlfoundation/scikit-learn-intelex#1090, this draft already improves the MSE on the reported dataset. In a test run I find

Intel® extension for Scikit-learn Mean Squared Error using quantiles binning: 64401.01440419433  *current release*
Intel® extension for Scikit-learn Mean Squared Error using averages binning: 21383.80560598015  *new option*
stock-learn Mean Squared Error: 20962.250402202062

I will run more tests as soon as the API is cleaned up

@ahuber21 ahuber21 marked this pull request as draft April 5, 2023 15:38
Vika-F
Vika-F previously requested changes Apr 11, 2023
Copy link
Contributor

@Vika-F Vika-F left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main concerns are:

  1. How this is mapped to DPC++ part of the algorithm? Are the similar changes planned to be propagated there?
  2. Test coverage needs to be extended to cover both binning strategies: quantiles and averages

Please also see the other comments from this review.

append(_bins, nBins, newBinSize);
i += newBinSize;
}

// collect the remaining data rows in the final bin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to distribute the residual data rows more uniformly across the bins?

Copy link
Contributor Author

@ahuber21 ahuber21 May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but it's what we have been doing all along. I did not change the logic, only added the comment.

Copy link
Contributor

@icfaust icfaust May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry y'all. Vika, you were correct that it was bad, and was the cause of a bug. The remainder is now distributed to the various bins using bresenham's algo which should be relatively uniform (given the discrete nature). There is some interplay associated with replicated values which plays around with this some, but should be small wrt to the bin size after the remainder distribution (like a single data point or so).


size_t nBins = 0;
size_t i = 0;
algorithmFPType binSize = (index[nRows - 1].key - index[0].key) / _prm.maxBins;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could a division by zero happen here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because we never create the binning task when maxBins=0 is selected. Nevertheless, I have added a DAAL_ASSERT for debug later in case this gets changed

@@ -1044,7 +1044,7 @@ services::Status ClassificationTrainBatchKernel<algorithmFPType, method, cpu>::c
{
if (!par.memorySavingMode)
{
BinParams prm(par.maxBins, par.minBinSize);
BinParams prm(par.maxBins, par.minBinSize, par.binningStrategy);
s = indexedFeatures.init<algorithmFPType, cpu>(*x, &featTypes, &prm);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the line #1078 s = indexedFeatures.init<algorithmFPType, cpu>(*x, &featTypes); should be changed accordingly, i.e. binningStrategy needs to be added there as well.

@ahuber21 ahuber21 force-pushed the dev/ahuber/rf-feature-binning-optimization branch from e14f975 to 67c591a Compare May 15, 2023 16:03
@ahuber21 ahuber21 marked this pull request as ready for review May 15, 2023 16:03
@ahuber21 ahuber21 force-pushed the dev/ahuber/rf-feature-binning-optimization branch from 67c591a to 377250c Compare May 15, 2023 16:10
@ahuber21 ahuber21 changed the title feat: (draft) provide quantiles and average binning for RF histograms Provide quantiles and average binning for RF histograms May 15, 2023
@ahuber21 ahuber21 force-pushed the dev/ahuber/rf-feature-binning-optimization branch from 377250c to f2a4c41 Compare May 15, 2023 16:56
@ahuber21
Copy link
Contributor Author

/intelci: run

@Alexsandruss
Copy link
Contributor

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented May 26, 2023

rebase

✅ Branch has been successfully rebased

@Alexsandruss Alexsandruss force-pushed the dev/ahuber/rf-feature-binning-optimization branch from f2a4c41 to abf268f Compare May 26, 2023 20:43
@Alexsandruss
Copy link
Contributor

/intelci: run

1 similar comment
@Alexsandruss
Copy link
Contributor

/intelci: run

@ahuber21 ahuber21 force-pushed the dev/ahuber/rf-feature-binning-optimization branch from ac25b0e to 201cb31 Compare May 30, 2023 19:09
@Alexsandruss Alexsandruss merged commit 0d46ed1 into uxlfoundation:master May 31, 2023
@ahuber21 ahuber21 deleted the dev/ahuber/rf-feature-binning-optimization branch May 31, 2023 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants