Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: RepeatingBasisFunction.inverse_transform #687

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Alex-Cremers
Copy link

@Alex-Cremers Alex-Cremers commented Jul 11, 2024

Description

This PR implements get_feature_names_out() for _RepeatingBasisFunction, and RepeatingBasisFunction (which inherits feature names from the former).

This PR also implements an inverse_transform for _RepeatingBasisFunction in passing. I did not include a more general implementation for RepeatingBasisFunction (as the one I use requires importing pandas), but the inverse_transform() can be accessed from the fitted transformer via .pipeline_.named_transformers_['repeatingbasis'].inverse_transform()). It's a rare use case, but it shouldn't affect other uses in any way, so I figured I'd include it. Note that the transformation is only invertible if the original values are in the input_range (upper bound excluded). Otherwise, the reconstructed values are only equal modulo the width of the range.

Two tests have been added to test_repeatingbasisfunction.py:

  • test that set_output(transform='pandas') works properly with RepeatingBasisFunction
  • test that the new inverse_transform() for _RepeatingBasisFunction truly recovers the original values (as long as they fall within the input_range).

Note that for feature names, ClassNamePrefixFeaturesOutMixin could also be used if self.n_periods was renamed to self._n_features_out, but I didn't see a simple way of keeping the original column name as prefix, so I adopted a solution perhaps more idiosyncratic.

It's a minor PR overall, so I didn't ping before submitting. I hope it's okay.

Progress on Issue #543

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the style guidelines (ruff)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (also to the readme.md)
  • I have added tests that prove my fix is effective or that my feature works
  • I have added tests to check whether the new feature adheres to the sklearn convention
  • New and existing unit tests pass locally with my changes

If you feel your PR is ready for a review, ping @FBruzzesi or @koaning.

…verse_transform() in the underlying _RepeatingBasisFunction, which can be accessed through self.pipeline_.named_transformers_['repeatingbasis'].inverse_transform() is needed (rare use case for sure).
…tting output format to pandas work, a second to make sure that the inverse_transform() truly returns the original column (as long as original are within the input_range).
@Alex-Cremers
Copy link
Author

Requesting review and approval for the tests, @koaning

@koaning
Copy link
Owner

koaning commented Jul 11, 2024

I am open to the inverse transform, but I think the implementation may change in the future in favor our scikit-learn pipelines. This transformer was added before the SplineTransformer was added to core scikit-learn and I might recommend folks to use that going forward. The extrapolation="periodic" setting basically covers the use-case here. We might just refer to that under the hood.

@FBruzzesi figured I'd check with you, do you have any comments/opinions on this one? I am open to adding it, just wondering what might be most pragmatic given the sklearn estimator that now exists.

@Alex-Cremers
Copy link
Author

Thanks, I had missed the possibility to set SplineTransformer to periodic! The result is pretty similar indeed.

Now I see that sklearn also has an open issue regarding inverse transform on Spline transformer: scikit-learn/scikit-learn#28551

@FBruzzesi
Copy link
Collaborator

Hey @Alex-Cremers , thanks for the PR, it is much appreciated!

I can take a close look during the weekend, from a sneakpeak I would say that some linting is required. Every other technical comment will follow in the weekend 😂

@koaning

just wondering what might be most pragmatic given the sklearn estimator that now exists.

I am generally in favor to keep what we have bug free but not expanding (and maintaining) new features if they are already in scikit-learn - which seems not to be the case for inverse_transform , therefore I am open to have it implemented here

@FBruzzesi FBruzzesi added the enhancement New feature or request label Jul 13, 2024
@FBruzzesi FBruzzesi changed the title Work on rbf feat: RepeatingBasisFunction.inverse_transform Jul 13, 2024
Copy link
Collaborator

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the PR. I think it is a nice feature to have in here!

As first comment, please run the following to format and lint the code:

python -m pip install ruff 
make lint

I think I am missing something on why RepeatingBasisFunction re-implements some logic in the inverse_transform and does not just call

self.pipeline_.named_transformers_['repeatingbasis'].inverse_transform(X)

I feel like we are separating the logic while there is no real need for it - in fact they return different shape objects and I would not be expecting that.

Comment on lines +126 to +128
def inverse_transform(self, X):
"""Transform RBF features back to the input range. Outputs a numpy array.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add what you mention in the description in the docstring Notes of the method?

In particular I am referring to:

Note that the transformation is only invertible if the original values are in the input_range (upper bound excluded). Otherwise, the reconstructed values are only equal modulo the width of the range.

check_is_fitted(self, ["pipeline_"])

if isinstance(X,np.ndarray):
Xarr = check_array(X[:,:self.n_periods], estimator=self, ensure_2d=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be more comfortable with something along the following lines:

If line 144 becomes

Xarr = check_array(X, estimator=self, ensure_2d=True, ensure_min_features=self.n_periods)[:, :self.n_periods]

which also convert to array (some) dataframe-like objects and maybe it's even possible to avoid checking for array instance.

Z = tf.fit(X, y).transform(X)
assert np.allclose(
X["a"],
tf.pipeline_.named_transformers_['repeatingbasis'].inverse_transform(Z),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit too nested to me.

Why don't we

Suggested change
tf.pipeline_.named_transformers_['repeatingbasis'].inverse_transform(Z),
tf.inverse_transform(Z),

?

They should suppose to return the same or am I missing something?

if isinstance(X,np.ndarray):
Xarr = check_array(X[:,:self.n_periods], estimator=self, ensure_2d=True)
new_x = self.pipeline_.named_transformers_['repeatingbasis'].inverse_transform(Xarr)
Xarr = np.hstack((new_x.reshape(-1, 1),X[:,self.n_periods:]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we returning the input as well? kind of related to the comment in tests

@@ -208,7 +243,27 @@ def transform(self, X):

# apply rbf function to series for each basis
return self._rbf(base_distances)


def inverse_transform(self, X):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome ✨

@Alex-Cremers
Copy link
Author

Alex-Cremers commented Jul 13, 2024

Thanks again for the PR. I think it is a nice feature to have in here!

As first comment, please run the following to format and lint the code:

python -m pip install ruff 
make lint

I think I am missing something on why RepeatingBasisFunction re-implements some logic in the inverse_transform and does not just call

self.pipeline_.named_transformers_['repeatingbasis'].inverse_transform(X)

I feel like we are separating the logic while there is no real need for it - in fact they return different shape objects and I would not be expecting that.

Thanks for the thorough review!

Sorry for the lint, I only ran "ruff check" and it passed, so I didn't dig deeper!

Regarding the inverse_transform: I actually meant to remove it from RepeatingBasisFunction before submitting, precisely because it was a big confusing, but it turns out I only removed part of it. There are two reasons why it's more complicated than _RepeatingBasisFunction:

  1. I wanted it to deal with both pandas df and numpy arrays as possible inputs, and if a pandas, I wanted to retrieve the original column name (see this sklearn issue: Add inverse-transform to _set_output scikit-learn/scikit-learn#27891). I then removed the pandas part of the code to avoid adding an extra dependency, but it looks like I left the numpy part behind in the end.
  2. Unlike _RepeatingBasisFunction which only deals with one column input / n_periods column outputs, RepeatingBasisFunction deals with a whole df/array, selects one of its columns as input, and might pass the other columns as well if remainder='passthrough'. Because of the passthrough possibility, the inverse transform should check for the presence of other columns and pass them back. In hindsight, I actually think that this is a bad idea with numpy arrays, because the order is lost (the transform/inverse_transform composition moves the target column first). With pandas it's fine, because the columns are named. EDIT: actually, we could use the self.column to move the reconstructed column back to the right position in a numpy array.

As far as I can see, the best would probably be to remove the inverse_transform from RepeatingBasisFunction entirely for now and wait for scikit-learn/scikit-learn#11463 (inverse_transform for ColumnTransformer) and the other issue mentioned above to be solved, as there would then be a proper way to implement the inverse transform without relying on funny tricks. If a user really needs the inverse transform in the meantime, they can recover the function as I did in the test.

If you agree with this, I'll remove RepeatingBasisFunction.inverse_transform() and address the remaining comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants