Fix dataframe indexing for `sknnr` estimators #26

aazuspan · 2024-06-24T22:01:26Z

This closes #25 and HOPEFULLY solves feature name warnings for the last time.

The previous fix for feature name warnings (#23) was to prevent fitting wrapped estimators with dataframe data, but this had an unintended side effect of breaking any functionality that depended on fitting with dataframes, like returning dataframe indexes in sknnr estimators.

I removed the array conversion to restore dataframe indexing, and added a function suppress_feature_name_warnings that suppresses missing feature name warnings in wrapped functions. By wrapping the functions applied by apply_gufunc (e.g. predict and kneighbors), we can suppress the warning when it arises at compute time, rather than during the eager call to apply_gufunc. We'll need to use that function any time that sklearn runs feature name checks on chunked arrays (e.g. transform in #16), so it may be worth abstracting some of that repeated code out eventually.

I also shuffled the utils modules around to avoid some circular dependency issues. I didn't put a ton of thought into this, so there might be a better way to organize things.

The previous fix for feature name warnings was to prevent fitting wrapped estimators with dataframe data, but this had a side effect of breaking any functionality that depended on fitting with dataframes, like returning dataframe indexes in sknnr estimators. I removed the array conversion to restore dataframe indexing, and added a function that that suppresses missing feature name warnings in wrapped functions. By wrapping the functions applied by apply_gufunc (i.e. predict and kneighbors), we can suppress the warning when it arises at compute time. I also refactored the utils to avoid some circular dependency issues.

grovduck

Great changes here! It has the effect of simplifying the code as well to not even having to deal with the array with stripped names until the last possible moment. Very minor stuff from me - otherwise looks great!

I also shuffled the utils modules around to avoid some circular dependency issues. I didn't put a ton of thought into this, so there might be a better way to organize things.

Your organization of the utility functions makes sense to me.

src/sknnr_spatial/utils/estimator.py

grovduck · 2024-06-25T21:09:13Z

src/sknnr_spatial/image/_dask_backed.py

+ # If the wrapped estimator was fit with a dataframe, it will warn about
+ # missing feature names because this passes unnamed arrays. Suppress that
+ # and let the wrapper handle feature name checks.
+ suppress_feature_name_warnings(estimator._wrapped.predict),


The placement of the comment here and on lines 92-95 caught me a bit off guard. One, because it's the identical comment and presumably it might be repeated for other functions (e.g. transform). You mention this in the PR description. Two, because the comment itself is within the apply_gufunc function call, it might be read as if it applies to all arguments. Could there be a private static method in DaskBackedWrapper like:

class DaskBackedWrapper(ImageWrapper[DaskBackedType]): @staticmethod def _suppressed_estimator_function(func): # If the wrapped estimator was fit with a dataframe, it will warn about # missing feature names because this passes unnamed arrays. Suppress that # and let the wrapper handle feature name checks. return suppress_feature_name_warnings(func) def predict(self, ...): ... y_pred = da.apply_gufunc(self._suppressed_estimator_function(estimator._wrapped.predict, ...)

You were suggesting abstracting out the repeated code, so perhaps you're already ahead of me on this one.

Agreed, it felt weird to put the same comment in both places! I like the static method suggestion, but saw an opportunity to reduce a little more duplication by refactoring out the shared arguments between the apply_gufunc calls into a private _apply_gufunc method, which also allowed for only suppressing warnings in one spot.

Let me know what you think of that choice. Also, maybe there's a more specific name than _apply_gufunc? I thought about _apply_estimator_gufunc or _apply_sklearn_gufunc or something similar, but I guess there's no reason this wouldn't work with other functions.

Agreed, it felt weird to put the same comment in both places! I like the static method suggestion, but saw an opportunity to reduce a little more duplication by refactoring out the shared arguments between the apply_gufunc calls into a private _apply_gufunc method, which also allowed for only suppressing warnings in one spot.

As always, yours is the better solution 😉. This looks good to me and, as you say, takes out a bit more duplication.

Also, maybe there's a more specific name than _apply_gufunc? I thought about _apply_estimator_gufunc or _apply_sklearn_gufunc or something similar, but I guess there's no reason this wouldn't work with other functions.

The current name seems good to me. It's clear that it's just an enhancement of da.apply_gufunc and should be able to be used with other functions as well. I like keeping it as a more generic name. Out of curiosity, I assume much of this code goes away if you move to xarray.apply_ufunc (in a future PR), so this seems like reasonable naming for the short term at least.

tests/test_sknnr.py

grovduck · 2024-06-26T18:38:37Z

From what I can tell, this all seems good to me. OK by me to merge.

aazuspan · 2024-06-26T19:05:36Z

Thanks @grovduck!

aazuspan added 2 commits June 24, 2024 14:23

Test returning df indices with kneighbors

869f61b

aazuspan added the bug Something isn't working label Jun 24, 2024

aazuspan requested a review from grovduck June 24, 2024 22:01

aazuspan self-assigned this Jun 24, 2024

grovduck reviewed Jun 25, 2024

View reviewed changes

aazuspan added 2 commits June 26, 2024 10:03

Check index of all first neighbors

61a7f59

Refactor out duplicated gufunc args

e4f5c54

aazuspan merged commit b28a91b into main Jun 26, 2024
5 checks passed

aazuspan deleted the fix-df-index branch June 26, 2024 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dataframe indexing for `sknnr` estimators #26

Fix dataframe indexing for `sknnr` estimators #26

aazuspan commented Jun 24, 2024

grovduck left a comment •

edited

Loading

grovduck Jun 25, 2024

aazuspan Jun 26, 2024

grovduck Jun 26, 2024

grovduck commented Jun 26, 2024

aazuspan commented Jun 26, 2024

Fix dataframe indexing for sknnr estimators #26

Fix dataframe indexing for sknnr estimators #26

Conversation

aazuspan commented Jun 24, 2024

grovduck left a comment • edited Loading

Choose a reason for hiding this comment

grovduck Jun 25, 2024

Choose a reason for hiding this comment

aazuspan Jun 26, 2024

Choose a reason for hiding this comment

grovduck Jun 26, 2024

Choose a reason for hiding this comment

grovduck commented Jun 26, 2024

aazuspan commented Jun 26, 2024

Fix dataframe indexing for `sknnr` estimators #26

Fix dataframe indexing for `sknnr` estimators #26

grovduck left a comment •

edited

Loading