Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT Persist clean up and consistency checks #143

Conversation

BenjaminBossan
Copy link
Collaborator

During the work on adding strict typing (which will most likely not
work), a couple of opportunities for simplifications and clean ups came
up. In order for them not to get lost in the typing PR, they are now
added in this separate PR. Changes are:

  • A few simplifications to the state to avoid unnecessary nesting
  • Consistently raise UnsupportedTypeException
  • Simplify some code without without functional change
  • Consistently use _get_state and _get_instance
  • Enforce consistency on the argument names in get_state/instance
    functions

Regarding the use of _get_state vs get_state and _get_instance vs get_instance:
Since we basically only ever want to use the _get variant and not get (since the
latter are only placeholders for registration), I would suggest that we swap the
names. This will also be better for future contributions to 3rd party libraries,
since right now they would be required to use "private" functions.

Ready for review @skops-dev/maintainers

(PS: Sorry for spurious commits, I'll clean them up)

BenjaminBossan and others added 7 commits September 6, 2022 17:52
After checking a couple of runs, none that took longer than 12 min ever
finished. Therefore, let's cap the timout at 15 min for now.

Later, if/when we factor out the inference tests, we may decrease the
timeout even further.
During the work on adding strict typing (which will most likely not
work), a couple of opportunities for simplifications and clean ups came
up. In order for them not to get lost in the typing PR, they are now
added in this separate PR. Changes are:

- A few simplifications to the "state" to avoid unnecessary nesting
- Consistently raise UnsupportedTypeException
- Simplify some code without without functional change
- Consistently use _get_state and _get_instance
- Enforce consistency on the argument names in get_state/instance
  functions
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall pretty nice!

Regarding the use of _get_state vs get_state and _get_instance vs get_instance:
Since we basically only ever want to use the _get variant and not get (since the
latter are only placeholders for registration), I would suggest that we swap the
names. This will also be better for future contributions to 3rd party libraries,
since right now they would be required to use "private" functions.

+1

(PS: Sorry for spurious commits, I'll clean them up)

We squash/merge, don't worry about the commits ;)

try:
# First, try to save object with np.save and allow_pickle=False, which
# should generally work as long as the dtype is not object.
with suppress(ValueError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the suppress/return pattern makes this harder to understand. Any reason you don't want to stay with try/except? try/except would seem much more readable to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer early return to lengthy try...except blocks but I will change it back if you prefer the other way round.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with early returns usually, but this suppress/return is somewhat convoluted!

# convert them to a list and recursively call get_state on it. For this, we
# expect the dtype to be object.
if obj.dtype != object:
raise UnsupportedTypeException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to test these. But I'm happy to leave tests for individual methods to a separate PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to write a test for it but it wasn't that simple. This is the easiest I could come up with:

class _MockDtype:
    metadata = None
    names = None
    subdtype = None
    str = "<f8"
    hasobject = False

_mock_dtype = _MockDtype()


class _CustomDtypeArray(np.ndarray):
    @property
    def dtype(self):
        return _mock_dtype

    @property
    def itemsize(self):
        # np.save checks this attribute, so (ab)use this to trigger a ValueError
        raise ValueError("Trigger ValueError")


class NumpyUnknownDtypeEstimator(BaseEstimator):
    def fit(self, X, y=None, **fit_params):
        self.x_ = _CustomDtypeArray([1, 2, 3])
        return self


def test_unknown_numpy_dtype_raises(tmp_path):
    est = NumpyUnknownDtypeEstimator().fit(None)
    f_name = tmp_path / "file.skops"
    with pytest.raises(UnsupportedTypeException):
        save_load_round(est, f_name)

I found it interesting that without the itemsize property, saving would actually just work! Loading would fail though.

But I think what this shows us is that it's almost impossible to trigger this code path as a user, even with custom dtypes (if that's even a thing), unless the user goes throw extraordinary lengths to provoke the error. So I don't think this situation will really come up in the wild.

This makes me think that we might have to check the dtype differently inside this function. Not sure if we want to support custom dtypes and hope for the best or if we should check against all existing dtypes and raise if we don't know this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sklearn has custom dtypes, in trees, but if that's not triggering this, then not many things probably would, and that means we can remove it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. I checked and np.save and np.load seem to work fine on that custom dtype (I suppose you mean this dtype=[('left_child', '<i8'), ...] thing).

It means we could remove the if obj.dtype != object: check, given how we never trigger it. I just wonder if it isn't more cautious to leave it there to have an explicit error? We don't really know what would happen if we have a custom dtype that cannot be saved with np.save, maybe it's better to raise and get a user report? We could encourage reporting by extending the error message with "Please open an issue on...".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, extending the error message makes sense. I'm okay with this line not being tested.

skops/io/tests/test_persist.py Show resolved Hide resolved
- Use try...except instead of suppress
- Add comments to tests
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think renaming _get_state, _get_instance is also left.

@@ -61,6 +64,66 @@
ATOL = 1e-6 if sys.platform == "darwin" else 1e-7


@pytest.fixture(autouse=True)
def debug_dispatch_functions():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you're hacking type checks here lol. Nice!

# convert them to a list and recursively call get_state on it. For this, we
# expect the dtype to be object.
if obj.dtype != object:
raise UnsupportedTypeException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, extending the error message makes sense. I'm okay with this line not being tested.

@BenjaminBossan
Copy link
Collaborator Author

BenjaminBossan commented Sep 29, 2022

I think renaming _get_state, _get_instance is also left.

Let me do this in another PR.

Yeah, extending the error message makes sense.

Done.

I feel like you're hacking type checks here lol. Nice!

Maybe a little bit ;) but my main concern was the inconsistent naming, e.g. obj when it's a state.

Ready for re-review @adrinjalali

@adrinjalali adrinjalali changed the title Persist clean up and consistency checks MNT Persist clean up and consistency checks Sep 29, 2022
@adrinjalali adrinjalali merged commit 95a718a into skops-dev:main Sep 29, 2022
@BenjaminBossan BenjaminBossan deleted the persist-clean-up-and-consistency-checks branch September 29, 2022 14:28
BenjaminBossan added a commit to BenjaminBossan/skops that referenced this pull request Sep 30, 2022
The functions that should be actually used everywhere are _get_state and
_get_instance, not get_state and get_instance. This is inconvenient and
confusing. Therefore, swap out the names so that the functions being
used everywhere now have the "public" name and the other ones the
"private" name.

This was discussed here:

skops-dev#143 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants