-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load model from path/str to generate Card #205
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@p-mishra1 Thanks a lot for listening to the feedback, I like the refactor you did. This PR is now close to being ready. I have a few minor comments, please take a look. Also, please add an entry to the changes.rst
.
skops/card/_model_card.py
Outdated
"""Loads the mddel if provided a file path, if already a model instance return it unmodified. | ||
|
||
Parameters | ||
---------- | ||
model : pathlib.path, str, or sklearn estimator | ||
Path/str or the actual model instance. if a Path or str, loads the model on first call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add line length limits to the docstring as well?
skops/card/_model_card.py
Outdated
|
||
model_path = Path(model) | ||
if not model_path.exists(): | ||
raise ValueError("Model file does not exist") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a test case for this potential error? Also, FileNotFoundError
is a more precise error.
skops/card/_model_card.py
Outdated
@@ -172,7 +210,7 @@ class Card: | |||
|
|||
Parameters | |||
---------- | |||
model: estimator object | |||
model: pathlib.path, str, or sklearn estimator object | |||
Model that will be documented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please extend the docstring, now that we have more options here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i changed to this "pathlib.path, str, or sklearn estimator object"
what should we add more ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something along the lines that if a str or Path is passed, the model will be loaded. This could be surprising behavior for some users and should be mentioned.
skops/card/tests/test_card.py
Outdated
assert loaded_model_str.n_jobs is model0.n_jobs | ||
assert loaded_model_path.n_jobs is model0.n_jobs | ||
assert loaded_model_path.n_jobs is loaded_model_str.n_jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert loaded_model_str.n_jobs is model0.n_jobs | |
assert loaded_model_path.n_jobs is model0.n_jobs | |
assert loaded_model_path.n_jobs is loaded_model_str.n_jobs | |
assert loaded_model_str.n_jobs == model0.n_jobs | |
assert loaded_model_path.n_jobs == model0.n_jobs | |
assert loaded_model_path.n_jobs == loaded_model_str.n_jobs |
Although is
works in this case, let's use ==
to compare numbers and then we don't have to worry about the size of the number or the Python interpreter :)
skops/card/tests/test_card.py
Outdated
assert card_from_model0.model.n_jobs == card_from_str.model.n_jobs | ||
|
||
@pytest.mark.parametrize("suffix", [".pkl", ".pickle", ".skops"]) | ||
def test_model_card_path(self, suffix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test could be unified with the previous one, as there is only one line difference, by parametrizing it over whether the save_file
is converted to Path
or not.
skops/card/tests/test_card.py
Outdated
def test_model_is_str_pickle(self, destination_path): | ||
model0 = LinearRegression(n_jobs=123) | ||
f_name0 = destination_path / "lin_reg.pickle" | ||
with open(f_name0, "wb") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use save_model_to_file
in this test?
Thanks to you @BenjaminBossan patiently guiding me and giving me feedback every time.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, we're almost done here.
From my point of view, there are only 3 minor issues left:
- Adjusting the docstring as discussed
- Remove the debug print (see comment)
- Add an entry to
changes.rst
skops/card/tests/test_card.py
Outdated
model0 = LinearRegression(n_jobs=123) | ||
save_file = save_model_to_file(model0, suffix) | ||
save_file = file_type(save_file) | ||
print(type(save_file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sorry left it by mistake,
one more thing @BenjaminBossan I'm unable to understand where should i add the changes in changes.rst
it will go under v0.3
right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will go under v0.3 right.
Exactly, just add an entry to the end of the list. Also, since you're not on the contributor list yet, add yourself as a contributor at the very end of the document.
@BenjaminBossan the test were failing on windows so I changed the function def save_model_to_file(model_instance, suffix):
save_file_handle, save_file = tempfile.mkstemp(suffix=suffix, prefix="skops-test")
if suffix in (".pkl", ".pickle"):
with open(save_file, "wb") as f:
pickle.dump(model_instance, f)
elif suffix == ".skops":
dump(model_instance, save_file)
return save_file_handle, save_file like this so that i can close the file before removing it somehow it is not a problem in linux/mac. |
@BenjaminBossan do you have an idea why it is failing on windows, i could not figure out exactly why, |
I'm not sure, Windows is always a big riddle. Maybe the string that is being matched contains a character that doesn't work on Windows for some reason? |
If you can't figure it out, I would be okay with skipping the test on Windows (with a comment explaining why). The error really comes from the test itself and is not something a Windows user would encounter in practice when using the model card. |
I will look into it and then we can decide |
@BenjaminBossan i tried somethings and also worked in the windows environment to mitigate the issue have changed the test a little bit hopefully it will work this time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @p-mishra1
skops/card/_model_card.py
Outdated
if model_path.suffix in (".pkl", ".pickle"): | ||
model = joblib.load(model_path) | ||
elif model_path.suffix == ".skops": | ||
model = load(model_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't rely on the extension here. We can check if the file is a zip file (zipfile.is_zipfile(filename)
¶ and load with skops
if yes, otherwise try to load with joblib
, and catch any exceptions which might be raised during each of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we just keep a key in config file to store which format the model was serialized in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree depending on file extension is not a good way.
let's say the file loaded indeed was a zip then we will be conforming that it is sklearn estimator object. using this
base.BaseEstimator ?
really sorry if it felt too naive. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we just keep a key in config file to store which format the model was serialized in?
We could but I don't think this function is called late enough for that to work.
let's say the file loaded indeed was a zip then we will be conforming that it is sklearn estimator object. using this
base.BaseEstimator ?
No, if it's a zip file, then it's a skops
file, which means we can load using skops, otherwise we load with joblib
, and we don't validate the output. We just catch any exception which might have been raised.
skops/card/_model_card.py
Outdated
@model.deleter | ||
def model(self): | ||
del self._model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, the method should exist (assuming we leave the rest as is), no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I almost never see a deleter for properties.
skops/card/_model_card.py
Outdated
@property | ||
def model(self): | ||
model = _load_model(self._model) | ||
if model is not self._model: | ||
self._model = model | ||
return model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should change model
. We should always keep model
attribute as is given by the user, and only load when we need it in other methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I fully understand. You want the model
argument passed by the user to remain unchanged, and when we access it in other methods, we check if it's a string/path and load it? So basically swap around the attribute names model
and _model
? Or load the model each time it's accessed (which would be wasteful, since it's used, for instance, in __repr__
so could be accessed quite frequently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so instead of using _model
attribute we should use _load_model(self.model)
wherever there is a call to model
,
am i thinking in the right direction @merveenoyan @adrinjalali .
also tagging @BenjaminBossan to have a look on what we discussed earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, exactly @p-mishra1 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @BenjaminBossan sounds fine will work on this more, and
just curious do we have slack/forum/discord where we can talk about code design before commiting it, or is it the preferred way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing the work.
I think ideally these discussions should come up in the PR or the corresponding issue. I thought there was consensus on the suggested approach here, I guess there was some confusion, sorry for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __repr__(self) -> str:
model = getattr(self, "model", None)
if model:
model = _load_model(model) # this line
model_str = self._strip_blank(repr(model))
model_repr = aRepr.repr(f" model={model_str},").strip('"').strip("'")
else:
model_repr = None
def _extract_estimator_config(self) -> str:
"""Extracts estimator hyperparameters and renders them into a vertical table.
Returns
-------
str:
Markdown table of hyperparameters.
"""
model = _load_model(self.model) # This line
hyperparameter_dict = model.get_params(deep=True)
def _generate_card(self) -> ModelCard:
.
.
.
if self.model_diagram is True:
model = _load_model(self.model) # this line
model_plot_div = re.sub(r"\n\s+", "", str(estimator_html_repr(model)))
so these 3 functions will be changed like this(commented with "this line") while doing lazy loading in card class.
also the test for card repr and str either have to be changed to take card instance and path,
or the test class have to be duplicated like passing model file as param instead of estimator.
lmk @adrinjalali @BenjaminBossan which method should we prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is the right approach or if we should keep the property
and just do the _load_model
inside the property. The disadvantage of the suggested approach is that it's very error prone, as soon as new code is added that access the model, it would need to do the same thing. That's why I would prefer using property
. @adrinjalali wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about self.get_model()
which would return the model if self.model
is a str
? acceptable middle ground?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot for working on this!
skops/card/_model_card.py
Outdated
if model_path.suffix in (".pkl", ".pickle"): | ||
model = joblib.load(model_path) | ||
elif model_path.suffix == ".skops": | ||
model = load(model_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we just keep a key in config file to store which format the model was serialized in?
skops/card/_model_card.py
Outdated
@property | ||
def model(self): | ||
model = _load_model(self._model) | ||
if model is not self._model: | ||
self._model = model | ||
return model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed 👍🏻
@BenjaminBossan @adrinjalali have a look at the changes, sorry for the long gap, was busy with some things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM now. but since the code has changed a bit since last review from @BenjaminBossan , I'll let him check and merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have a few minor comments left, after that we should be good 👍
skops/card/_model_card.py
Outdated
|
||
Parameters | ||
---------- | ||
model : pathlib.path, str, or sklearn estimator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model : pathlib.path, str, or sklearn estimator | |
model : pathlib.Path, str, or sklearn estimator |
skops/card/_model_card.py
Outdated
@@ -263,6 +304,18 @@ def __init__( | |||
self._extra_sections: list[tuple[str, Any]] = [] | |||
self.metadata = metadata or ModelCardData() | |||
|
|||
def get_model(self) -> Any: | |||
"""Returns sklearn estimator object if ``Path``/``str`` | |||
is provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is provided. | |
is provided. |
skops/card/_model_card.py
Outdated
@@ -162,6 +165,41 @@ def metadata_from_config(config_path: Union[str, Path]) -> ModelCardData: | |||
return card_data | |||
|
|||
|
|||
def _load_model(model: Any) -> Any: | |||
"""Loads the mddel if provided a file path, if already a model instance return it | |||
unmodified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unmodified. | |
unmodified. |
skops/card/_model_card.py
Outdated
documented. If a ``Path`` or ``str`` is provided, model will be loaded | ||
on first use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
documented. If a ``Path`` or ``str`` is provided, model will be loaded | |
on first use. | |
documented. If a ``Path`` or ``str`` is provided, model will be loaded. |
After the change, it will be loaded each time, so "on first use" is not correct anymore.
skops/card/_model_card.py
Outdated
``Path``/``str`` of the model or the actual model instance that will be | ||
documented. If a ``Path`` or ``str`` is provided, model will be loaded | ||
on first use. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skops/card/tests/test_card.py
Outdated
assert loaded_model_str.n_jobs == model0.n_jobs | ||
assert loaded_model_path.n_jobs == model0.n_jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert loaded_model_str.n_jobs == model0.n_jobs | |
assert loaded_model_path.n_jobs == model0.n_jobs | |
assert loaded_model_str.n_jobs == 123 | |
assert loaded_model_path.n_jobs == 123 |
This test is a tiny bit more robust this way (in case that model0
is being mutated by _load_model
for some reason).
skops/card/tests/test_card.py
Outdated
|
||
assert loaded_model_str.n_jobs == model0.n_jobs | ||
assert loaded_model_path.n_jobs == model0.n_jobs | ||
assert loaded_model_path.n_jobs == loaded_model_str.n_jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is not needed, right? If the two asserts above succeed, this cannot possibly fail.
skops/card/tests/test_card.py
Outdated
|
||
card_from_path = self.path_to_card(file_name) | ||
result_from_path = meth(card_from_path) | ||
print(result_from_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
skops/card/tests/test_card.py
Outdated
|
||
@pytest.mark.parametrize("meth", [repr, str]) | ||
@pytest.mark.parametrize("suffix", [".pkl", ".skops"]) | ||
def test_model_card_repr(self, card: Card, meth, suffix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this test is strictly necessary. Maybe it is sufficient to show that calling repr
or str
on a card that has a path/str for model works? The fact that it's the same output is implied by the other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to only repr from card_from _path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Thanks a lot @p-mishra1 for your work and patience.
Thanks a lot to you guys @BenjaminBossan and @adrinjalali for guiding me and listening to my silly comments and doubts. one request I have to @BenjaminBossan as most of the guidance and reviews were provided by you, can you provide me any feedback you have for me so that I can improve myself? |
Hey @p-mishra1, First of all, it's really hard to give feedback after one PR and without knowing you personally, so there is only so much I can say. But this is my personal impression: I liked the initiative you showed. Your responses were quick and you were receptive to our feedback - even when it was conflicting sometimes ;-). As you know, this all took some time but still, you persisted, which is very good. You were willing to do some of the hard work and came up with solutions to new challenges that occurred on the way. If there was one thing that would make the life of us maintainers easier, it is if right from the start, you would have delivered a more "complete" package, i.e. updating the docs everywhere, providing unit tests, making sure that code style and formatting are consistent with the rest of the code base. When we see that as maintainers right from the start, we're always excited. Hopefully, this helps you a little bit. If you have feedback for us, also feel free to share. |
Thank you @BenjaminBossan i appreciate you taking some time to give me feedback. I know it is quite tough for you to give feedback on the coding part . |
#96 Added required changes and followed the pattern @BenjaminBossan advised.
Creating PR from another branch as the last one had too many commits and reverts.
@BenjaminBossan have a look.