Skip to content

Conversation

@ahmed-mahran
Copy link
Contributor

What changes were proposed in this pull request?

A follow-up on #38966 to update relevant documentation and remove redundant sort key.

Why are the changes needed?

For isotonic regression, another method for breaking ties of repeated features was introduced in #38966. This will aggregate points having the same feature value by computing the weighted average of the labels.

  • This only requires points to be sorted by features instead of features and labels. So, we should remove label as a secondary sorting key.
  • Isotonic regression documentation needs to be updated to reflect the new behavior.

Does this PR introduce any user-facing change?

Isotonic regression documentation update. The documentation described the behavior of the algorithm when there are points in the input with repeated features. Since this behavior has changed, documentation needs to describe the new behavior.

How was this patch tested?

Existing tests passed. No need to add new tests since existing tests are already comprehensive.

@srowen

…tion

Update documentation and remove redundant sorting by labels
the same feature then these tuples are aggregated into a single tuple as follows:

* Aggregated label is the weighted average of all labels.
* Aggregated feature is the weighted average of all equal features. It is possible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry if this double-posts)
Yeah I was confused about this too. If feature values are pooled when exactly equal, why average them? if anything that introduces tiny errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Equality is not exact, it is defined by org.apache.commons.math3.util.Precision.equals(double x, double y) which is true whenever x and y are exactly equal or are adjacent doubles (i.e. no other representable doubles exist in between). So, there has to be pooling of features too. The weighted average here just guarantees fairness. scikit-learn uses first encountered (i.e. minimum) feature.

Definition of equality in scikit-learn is different too:

eps = np.finfo(np.floating).resolution # 1e-15
if x - current_x >= eps:
  # different
else:
  # equal

We can use minimum for consistency, avoid accumulative errors, and for better performance too. Do you agree?

What do you suggest regarding definition of equality?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh OK I see it. But why bother, just deal with duplicates that are exactly equal and skip this? is it to match scikit? They are already distinct and ordered if they're different even by a small amount

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just the default thought when it comes to comparing doubles. On the other hand though, other parts of the implementation don't care: in partitioning by features and in combining labels into isotonic buckets. So, why not do the same here as well. So, exact equality sounds fit here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it is exact equality likewise other parts of the implementation that don't handle approximation errors, and it should be the responsibility of the user to handle such cases before fitting the model.

…tonic regression

For consistency with scikit-learn. Also to avoid accumulative errors and for better performance as opposed to the weighted average.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

…re de-duplication

It is the user's responibility to handle double values representation errors
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK seems plausible. Do the tests vary any more from sklearn or the current code after this change? I see there are new tests, just trying to assess if this will introduce other behavior changes

@ahmed-mahran
Copy link
Contributor Author

Just modified tests I added in my previous PR. In one, just used same values as in the ticket. In the other, just modified to test exact equality instead of approximate.

@srowen
Copy link
Member

srowen commented Dec 11, 2022

SGTM. Merged to master

@srowen srowen closed this in f92c827 Dec 11, 2022
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
### What changes were proposed in this pull request?

A follow-up on apache#38966 to update relevant documentation and remove redundant sort key.

### Why are the changes needed?

For isotonic regression, another method for breaking ties of repeated features was introduced in apache#38966. This will aggregate points having the same feature value by computing the weighted average of the labels.
- This only requires points to be sorted by features instead of features and labels. So, we should remove label as a secondary sorting key.
- Isotonic regression documentation needs to be updated to reflect the new behavior.

### Does this PR introduce _any_ user-facing change?

Isotonic regression documentation update. The documentation described the behavior of the algorithm when there are points in the input with repeated features. Since this behavior has changed, documentation needs to describe the new behavior.

### How was this patch tested?

Existing tests passed. No need to add new tests since existing tests are already comprehensive.

srowen

Closes apache#38996 from ahmed-mahran/ml-isotonic-reg-dups-follow-up.

Authored-by: Ahmed Mahran <ahmed.mahran@mashin.io>
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants