[SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica… #38996

ahmed-mahran · 2022-12-09T07:54:31Z

What changes were proposed in this pull request?

A follow-up on #38966 to update relevant documentation and remove redundant sort key.

Why are the changes needed?

For isotonic regression, another method for breaking ties of repeated features was introduced in #38966. This will aggregate points having the same feature value by computing the weighted average of the labels.

This only requires points to be sorted by features instead of features and labels. So, we should remove label as a secondary sorting key.
Isotonic regression documentation needs to be updated to reflect the new behavior.

Does this PR introduce any user-facing change?

Isotonic regression documentation update. The documentation described the behavior of the algorithm when there are points in the input with repeated features. Since this behavior has changed, documentation needs to describe the new behavior.

How was this patch tested?

Existing tests passed. No need to add new tests since existing tests are already comprehensive.

@srowen

…tion Update documentation and remove redundant sorting by labels

srowen · 2022-12-10T01:09:10Z

docs/mllib-isotonic-regression.md

+the same feature then these tuples are aggregated into a single tuple as follows:
+
+* Aggregated label is the weighted average of all labels.
+* Aggregated feature is the weighted average of all equal features. It is possible


(Sorry if this double-posts)
Yeah I was confused about this too. If feature values are pooled when exactly equal, why average them? if anything that introduces tiny errors

Equality is not exact, it is defined by org.apache.commons.math3.util.Precision.equals(double x, double y) which is true whenever x and y are exactly equal or are adjacent doubles (i.e. no other representable doubles exist in between). So, there has to be pooling of features too. The weighted average here just guarantees fairness. scikit-learn uses first encountered (i.e. minimum) feature.

Definition of equality in scikit-learn is different too:

eps = np.finfo(np.floating).resolution # 1e-15 if x - current_x >= eps: # different else: # equal

We can use minimum for consistency, avoid accumulative errors, and for better performance too. Do you agree?

What do you suggest regarding definition of equality?

Oh OK I see it. But why bother, just deal with duplicates that are exactly equal and skip this? is it to match scikit? They are already distinct and ordered if they're different even by a small amount

It's just the default thought when it comes to comparing doubles. On the other hand though, other parts of the implementation don't care: in partitioning by features and in combining labels into isotonic buckets. So, why not do the same here as well. So, exact equality sounds fit here.

Now it is exact equality likewise other parts of the implementation that don't handle approximation errors, and it should be the responsibility of the user to handle such cases before fitting the model.

…tonic regression For consistency with scikit-learn. Also to avoid accumulative errors and for better performance as opposed to the weighted average.

AmplabJenkins · 2022-12-10T12:39:06Z

Can one of the admins verify this patch?

…re de-duplication It is the user's responibility to handle double values representation errors

srowen

OK seems plausible. Do the tests vary any more from sklearn or the current code after this change? I see there are new tests, just trying to assess if this will introduce other behavior changes

ahmed-mahran · 2022-12-11T20:57:53Z

Just modified tests I added in my previous PR. In one, just used same values as in the ticket. In the other, just modified to test exact equality instead of approximate.

srowen · 2022-12-11T21:01:41Z

SGTM. Merged to master

### What changes were proposed in this pull request? A follow-up on apache#38966 to update relevant documentation and remove redundant sort key. ### Why are the changes needed? For isotonic regression, another method for breaking ties of repeated features was introduced in apache#38966. This will aggregate points having the same feature value by computing the weighted average of the labels. - This only requires points to be sorted by features instead of features and labels. So, we should remove label as a secondary sorting key. - Isotonic regression documentation needs to be updated to reflect the new behavior. ### Does this PR introduce _any_ user-facing change? Isotonic regression documentation update. The documentation described the behavior of the algorithm when there are points in the input with repeated features. Since this behavior has changed, documentation needs to describe the new behavior. ### How was this patch tested? Existing tests passed. No need to add new tests since existing tests are already comprehensive. srowen Closes apache#38996 from ahmed-mahran/ml-isotonic-reg-dups-follow-up. Authored-by: Ahmed Mahran <ahmed.mahran@mashin.io> Signed-off-by: Sean Owen <srowen@gmail.com>

[SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica…

bd88952

…tion Update documentation and remove redundant sorting by labels

github-actions bot added DOCS MLLIB labels Dec 9, 2022

srowen reviewed Dec 10, 2022

View reviewed changes

[SPARK-41008][MLLIB] Use minimum pooling of repeated features for iso…

3e2f2b1

…tonic regression For consistency with scikit-learn. Also to avoid accumulative errors and for better performance as opposed to the weighted average.

[SPARK-41008][MLLIB] Use exact equality for isotonic regression featu…

9c4669b

…re de-duplication It is the user's responibility to handle double values representation errors

srowen reviewed Dec 11, 2022

View reviewed changes

srowen closed this in f92c827 Dec 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica… #38996

[SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica… #38996

Uh oh!

ahmed-mahran commented Dec 9, 2022

Uh oh!

srowen Dec 10, 2022

Uh oh!

ahmed-mahran Dec 10, 2022

Uh oh!

srowen Dec 10, 2022

Uh oh!

ahmed-mahran Dec 10, 2022

Uh oh!

ahmed-mahran Dec 11, 2022

Uh oh!

AmplabJenkins commented Dec 10, 2022

Uh oh!

srowen left a comment

Uh oh!

ahmed-mahran commented Dec 11, 2022

Uh oh!

srowen commented Dec 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica… #38996

[SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica… #38996

Uh oh!

Conversation

ahmed-mahran commented Dec 9, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

srowen Dec 10, 2022

Choose a reason for hiding this comment

Uh oh!

ahmed-mahran Dec 10, 2022

Choose a reason for hiding this comment

Uh oh!

srowen Dec 10, 2022

Choose a reason for hiding this comment

Uh oh!

ahmed-mahran Dec 10, 2022

Choose a reason for hiding this comment

Uh oh!

ahmed-mahran Dec 11, 2022

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Dec 10, 2022

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

ahmed-mahran commented Dec 11, 2022

Uh oh!

srowen commented Dec 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants