-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica… #38996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica… #38996
Conversation
…tion Update documentation and remove redundant sorting by labels
docs/mllib-isotonic-regression.md
Outdated
| the same feature then these tuples are aggregated into a single tuple as follows: | ||
|
|
||
| * Aggregated label is the weighted average of all labels. | ||
| * Aggregated feature is the weighted average of all equal features. It is possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry if this double-posts)
Yeah I was confused about this too. If feature values are pooled when exactly equal, why average them? if anything that introduces tiny errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Equality is not exact, it is defined by org.apache.commons.math3.util.Precision.equals(double x, double y) which is true whenever x and y are exactly equal or are adjacent doubles (i.e. no other representable doubles exist in between). So, there has to be pooling of features too. The weighted average here just guarantees fairness. scikit-learn uses first encountered (i.e. minimum) feature.
Definition of equality in scikit-learn is different too:
eps = np.finfo(np.floating).resolution # 1e-15
if x - current_x >= eps:
# different
else:
# equalWe can use minimum for consistency, avoid accumulative errors, and for better performance too. Do you agree?
What do you suggest regarding definition of equality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh OK I see it. But why bother, just deal with duplicates that are exactly equal and skip this? is it to match scikit? They are already distinct and ordered if they're different even by a small amount
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just the default thought when it comes to comparing doubles. On the other hand though, other parts of the implementation don't care: in partitioning by features and in combining labels into isotonic buckets. So, why not do the same here as well. So, exact equality sounds fit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it is exact equality likewise other parts of the implementation that don't handle approximation errors, and it should be the responsibility of the user to handle such cases before fitting the model.
…tonic regression For consistency with scikit-learn. Also to avoid accumulative errors and for better performance as opposed to the weighted average.
|
Can one of the admins verify this patch? |
…re de-duplication It is the user's responibility to handle double values representation errors
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK seems plausible. Do the tests vary any more from sklearn or the current code after this change? I see there are new tests, just trying to assess if this will introduce other behavior changes
|
Just modified tests I added in my previous PR. In one, just used same values as in the ticket. In the other, just modified to test exact equality instead of approximate. |
|
SGTM. Merged to master |
### What changes were proposed in this pull request? A follow-up on apache#38966 to update relevant documentation and remove redundant sort key. ### Why are the changes needed? For isotonic regression, another method for breaking ties of repeated features was introduced in apache#38966. This will aggregate points having the same feature value by computing the weighted average of the labels. - This only requires points to be sorted by features instead of features and labels. So, we should remove label as a secondary sorting key. - Isotonic regression documentation needs to be updated to reflect the new behavior. ### Does this PR introduce _any_ user-facing change? Isotonic regression documentation update. The documentation described the behavior of the algorithm when there are points in the input with repeated features. Since this behavior has changed, documentation needs to describe the new behavior. ### How was this patch tested? Existing tests passed. No need to add new tests since existing tests are already comprehensive. srowen Closes apache#38996 from ahmed-mahran/ml-isotonic-reg-dups-follow-up. Authored-by: Ahmed Mahran <ahmed.mahran@mashin.io> Signed-off-by: Sean Owen <srowen@gmail.com>
What changes were proposed in this pull request?
A follow-up on #38966 to update relevant documentation and remove redundant sort key.
Why are the changes needed?
For isotonic regression, another method for breaking ties of repeated features was introduced in #38966. This will aggregate points having the same feature value by computing the weighted average of the labels.
Does this PR introduce any user-facing change?
Isotonic regression documentation update. The documentation described the behavior of the algorithm when there are points in the input with repeated features. Since this behavior has changed, documentation needs to describe the new behavior.
How was this patch tested?
Existing tests passed. No need to add new tests since existing tests are already comprehensive.
@srowen