-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT Persist clean up and consistency checks #143
MNT Persist clean up and consistency checks #143
Conversation
After checking a couple of runs, none that took longer than 12 min ever finished. Therefore, let's cap the timout at 15 min for now. Later, if/when we factor out the inference tests, we may decrease the timeout even further.
During the work on adding strict typing (which will most likely not work), a couple of opportunities for simplifications and clean ups came up. In order for them not to get lost in the typing PR, they are now added in this separate PR. Changes are: - A few simplifications to the "state" to avoid unnecessary nesting - Consistently raise UnsupportedTypeException - Simplify some code without without functional change - Consistently use _get_state and _get_instance - Enforce consistency on the argument names in get_state/instance functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall pretty nice!
Regarding the use of _get_state vs get_state and _get_instance vs get_instance:
Since we basically only ever want to use the _get variant and not get (since the
latter are only placeholders for registration), I would suggest that we swap the
names. This will also be better for future contributions to 3rd party libraries,
since right now they would be required to use "private" functions.
+1
(PS: Sorry for spurious commits, I'll clean them up)
We squash/merge, don't worry about the commits ;)
skops/io/_numpy.py
Outdated
try: | ||
# First, try to save object with np.save and allow_pickle=False, which | ||
# should generally work as long as the dtype is not object. | ||
with suppress(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the suppress/return pattern makes this harder to understand. Any reason you don't want to stay with try/except? try/except would seem much more readable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer early return to lengthy try...except blocks but I will change it back if you prefer the other way round.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with early returns usually, but this suppress/return is somewhat convoluted!
skops/io/_numpy.py
Outdated
# convert them to a list and recursively call get_state on it. For this, we | ||
# expect the dtype to be object. | ||
if obj.dtype != object: | ||
raise UnsupportedTypeException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to test these. But I'm happy to leave tests for individual methods to a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to write a test for it but it wasn't that simple. This is the easiest I could come up with:
class _MockDtype:
metadata = None
names = None
subdtype = None
str = "<f8"
hasobject = False
_mock_dtype = _MockDtype()
class _CustomDtypeArray(np.ndarray):
@property
def dtype(self):
return _mock_dtype
@property
def itemsize(self):
# np.save checks this attribute, so (ab)use this to trigger a ValueError
raise ValueError("Trigger ValueError")
class NumpyUnknownDtypeEstimator(BaseEstimator):
def fit(self, X, y=None, **fit_params):
self.x_ = _CustomDtypeArray([1, 2, 3])
return self
def test_unknown_numpy_dtype_raises(tmp_path):
est = NumpyUnknownDtypeEstimator().fit(None)
f_name = tmp_path / "file.skops"
with pytest.raises(UnsupportedTypeException):
save_load_round(est, f_name)
I found it interesting that without the itemsize
property, saving would actually just work! Loading would fail though.
But I think what this shows us is that it's almost impossible to trigger this code path as a user, even with custom dtypes (if that's even a thing), unless the user goes throw extraordinary lengths to provoke the error. So I don't think this situation will really come up in the wild.
This makes me think that we might have to check the dtype differently inside this function. Not sure if we want to support custom dtypes and hope for the best or if we should check against all existing dtypes and raise if we don't know this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sklearn has custom dtypes, in trees, but if that's not triggering this, then not many things probably would, and that means we can remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer. I checked and np.save
and np.load
seem to work fine on that custom dtype (I suppose you mean this dtype=[('left_child', '<i8'), ...]
thing).
It means we could remove the if obj.dtype != object:
check, given how we never trigger it. I just wonder if it isn't more cautious to leave it there to have an explicit error? We don't really know what would happen if we have a custom dtype that cannot be saved with np.save
, maybe it's better to raise and get a user report? We could encourage reporting by extending the error message with "Please open an issue on...".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, extending the error message makes sense. I'm okay with this line not being tested.
- Use try...except instead of suppress - Add comments to tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think renaming _get_state
, _get_instance
is also left.
@@ -61,6 +64,66 @@ | |||
ATOL = 1e-6 if sys.platform == "darwin" else 1e-7 | |||
|
|||
|
|||
@pytest.fixture(autouse=True) | |||
def debug_dispatch_functions(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like you're hacking type checks here lol. Nice!
skops/io/_numpy.py
Outdated
# convert them to a list and recursively call get_state on it. For this, we | ||
# expect the dtype to be object. | ||
if obj.dtype != object: | ||
raise UnsupportedTypeException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, extending the error message makes sense. I'm okay with this line not being tested.
... if an unknown dtype causes an error.
Let me do this in another PR.
Done.
Maybe a little bit ;) but my main concern was the inconsistent naming, e.g. Ready for re-review @adrinjalali |
The functions that should be actually used everywhere are _get_state and _get_instance, not get_state and get_instance. This is inconvenient and confusing. Therefore, swap out the names so that the functions being used everywhere now have the "public" name and the other ones the "private" name. This was discussed here: skops-dev#143 (comment)
During the work on adding strict typing (which will most likely not
work), a couple of opportunities for simplifications and clean ups came
up. In order for them not to get lost in the typing PR, they are now
added in this separate PR. Changes are:
state
to avoid unnecessary nestingUnsupportedTypeException
_get_state
and_get_instance
get_state/instance
functions
Regarding the use of
_get_state
vsget_state
and_get_instance
vsget_instance
:Since we basically only ever want to use the
_get
variant and notget
(since thelatter are only placeholders for registration), I would suggest that we swap the
names. This will also be better for future contributions to 3rd party libraries,
since right now they would be required to use "private" functions.
Ready for review @skops-dev/maintainers
(PS: Sorry for spurious commits, I'll clean them up)