Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load model in Card class #174

Closed
wants to merge 6 commits into from

Conversation

p-mishra1
Copy link
Contributor

#96 added a function in the Card class to load the model if provided a file path.
sorry if the implementation looks naive

Copy link
Collaborator

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for contributing 🧡 could you please also write tests making sure that it would work for both a model, a model path containing a pkl file, and a model path containing .skops format?

model_path = Path(model)
if not model_path.exists():
raise ValueError("Model file does not exist")
model = skops.io.load(model)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, we also support pickle files. For this to work, it would be good to add a line that loads a pickle. skops.io.load() loads a model file with skops extension, which is essentially a .zip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just added a few changes for "skops" as well as "pkl" file suuport

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this.

Not a full review but a couple of points:

  1. We need to have a test for this. If you need help implementing the test, let us know.
  2. The docstring of the Card class needs to be updated.
  3. Right now, the model is eagerly loaded during __init__. Recently, we made a move to prevent this (see MNT Model card refactor to move plotting & table #163). However, lazy loading could be costly because it would be done for each call to __repr__. Maybe we can store the actual model in a different attribute (say, _model) when it's called for the first time?

@adrinjalali
Copy link
Member

Right now, the model is eagerly loaded during init. Recently, we made a move to prevent this (see #163). However, lazy loading could be costly because it would be done for each call to repr. Maybe we can store the actual model in a different attribute (say, _model) when it's called for the first time?

But we need to make sure we invalidate that whenever the _model changes.

@BenjaminBossan
Copy link
Collaborator

But we need to make sure we invalidate that whenever the _model changes.

This could be done with a setter but maybe we can come up with an easier solution?

@p-mishra1
Copy link
Contributor Author

p-mishra1 commented Oct 6, 2022

  1. We need to have a test for this. If you need help implementing the test, let us know.

Yes sure, there will be same test function but instead of model instance we have to pass a tmp model file (one for skops and one for pkl).

  1. Right now, the model is eagerly loaded during __init__. Recently, we made a move to prevent this (see MNT Model card refactor to move plotting & table #163). However, lazy loading could be costly because it would be done for each call to __repr__. Maybe we can store the actual model in a different attribute (say, _model) when it's called for the first time?

On this as the same argument is supporting both model and model_path so in representation and in _generate_card we will be doing model loading.
else we can do it in representation on 1st call and set _model to model instance but in that case both model and '_model' will have the instance when user will provide model instance to Card class.

@BenjaminBossan
Copy link
Collaborator

@p-mishra1 Just to make sure, what is the state of this PR? Are you waiting for input from our side?

@p-mishra1
Copy link
Contributor Author

p-mishra1 commented Oct 11, 2022

Yes @BenjaminBossan i changed the functionality to load skops file as well as pkl file and also mentioned some of the things about thr design choice.
Waiting for input from you guys about that.

@BenjaminBossan
Copy link
Collaborator

  1. We need to have a test for this. If you need help implementing the test, let us know.

Yes sure, there will be same test function but instead of model instance we have to pass a tmp model file (one for skops and one for pkl).

So here we would like to see unit tests for the new feature being added, similar to what we have here. Would you please work on that? We can help if needed.

On this as the same argument is supporting both model and model_path so in representation and in _generate_card we will be doing model loading.
else we can do it in representation on 1st call and set _model to model instance but in that case both model and '_model' will have the instance when user will provide model instance to Card class.

So I prefer the second solution. When the model is first accessed, if it's a path, it is loaded, if it's already a model, then we're good. It should be stored on a private attribute like _model. Then the next time, the model is already loaded and we don't need to load it again. When the user changes the model attribute, we need to invalidate _model, since it could now be a different model.

@p-mishra1 p-mishra1 requested review from BenjaminBossan and removed request for adrinjalali October 20, 2022 16:11
Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @p-mishra1

This is progressing in the right direction, but there is still some work to do. Most significantly, the initial consideration is still not addressed:

Right now, the model is eagerly loaded during __init__. Recently, we made a move to prevent this. However, lazy loading could be costly because it would be done for each call to __repr__. Maybe we can store the actual model in a different attribute (say, _model) when it's called for the first time?

Why is this important? Say, a user creates a card instance with model_path0 and then later re-assigns the model to model_path1, the _load_model function will not be called. Then code further down the line will fail.

I created a minimal example that shows how this problem can be solved with @property and setter:

import pickle
from typing import Any
from pathlib import Path
import joblib
from sklearn.linear_model import LinearRegression, LogisticRegression
from skops.io import load


def _load_model(model: Any) -> Any:
    """Loads the model if provided a file path, if already a model instance,
    return it unmodified.

    Parameters
    ----------
    model : pathlib.path, str, or sklearn estimator
        Path/str or the actual model instance. If a Path or str, loads the model on first call.

    Returns
    -------
    model : object
        Model instance.

    """
    if not isinstance(model, (Path, str)):
        return model

    model_path = Path(model)
    if not model_path.exists():
        raise ValueError("Model file does not exist")

    if model_path.suffix in (".pkl", ".pickle"):
        # I don't think there is any advantage of joblib.load over pickle.load
        model = joblib.load(model_path)
    elif model_path.suffix == ".skops":
        model = load(model_path)
    else:
        msg = f"Cannot interpret model suffix {model_path.suffix}, should be '.pkl', '.pickle' or '.skops'"
        raise ValueError(msg)

    return model


class Card:
    def __init__(self, model):
        self._model = model

    @property
    def model(self):
        model = _load_model(self._model)
        if model is not self._model:
            self._model = model
        return model

    @model.setter
    def model(self, model):
        self._model = model


class TestCardModelAttribute:
    def test_model_estimator(self):
        model0 = LinearRegression()

        card = Card(model0)
        assert card.model is model0

        # re-assigning the model works
        model1 = LogisticRegression()
        card.model = model1
        assert card.model is model1

        # re-assigning back to original works
        card.model = model0
        assert card.model is model0

    def test_model_is_str_pickle(self, tmp_path):
        model0 = LinearRegression(n_jobs=123)
        f_name0 = tmp_path / "lin_reg.pickle"
        with open(f_name0, "wb") as f:
            pickle.dump(model0, f)

        card = Card(f_name0)
        assert isinstance(card.model, LinearRegression)
        assert card.model.n_jobs == 123

        # re-assigning the model works
        model1 = LogisticRegression(n_jobs=456)
        f_name1 = tmp_path / "log_reg.pickle"
        with open(f_name1, "wb") as f:
            pickle.dump(model1, f)

        card.model = f_name1
        assert isinstance(card.model, LogisticRegression)
        assert card.model.n_jobs == 456

        # re-assigning back to original works
        card.model = f_name0
        assert isinstance(card.model, LinearRegression)
        assert card.model.n_jobs == 123

This example should work as is. As you can see, I simplified a lot, but it should demonstrate the general idea. Do you think you could refactor your code to work like this?

Also, as you can see, I added some specific tests for this feature. This is preferred over modifying all the existing tests to check both models passed as str or as actual estimators. So could you please undo your changes to the existing tests and instead add tests that are specific to this problem?

PS: We should also add the possibility to pass a pathlib.Path, not only str, since we generally support that throughout skops.

@p-mishra1
Copy link
Contributor Author

thanks @BenjaminBossan will update the changes

@p-mishra1
Copy link
Contributor Author

thanks @BenjaminBossan for the code snippet and detailed explanation i have added required changes, i also added a deleter for model. have a look.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for reworking this PR. We're getting closer to the finish line.

I did a light review of your PR and there are a couple of things that need to be changed.

Formatting

Could you revert the formatting changes you made? It looks like some tool you use (your IDE maybe?) reformatted the files you worked on to use 79 or 80 characters per line. However, we use black with the default settings, which comes down to ~88 characters per line. If you undo these changes, the code will be much easier to review because the formatting changes distract from the real changes.

The best way to achieve this would be if you could first undo your formatting changes, then add black via a pre-commit hook. This step is explained in our contribution pipeline but the gist is that you should do something like:

pip install pre-commit
pre-commit install

from within your Python environment. After doing this, each time when you commit, the code will be checked for black compliance.

Testing

If I understand correctly, you have duplicated a number of tests. Examples are test_plot_model_false_from_path and test_add_from_path. The change you introduced was to have the card initialized from a file instead of passing the model directly, the rest is identical.

IMO, it is not necessary to duplicate all those tests, since the actual thing they're testing is not related to how the model was passed. Instead, it is sufficient to have tests that show that passing a model or a str or a Path do the same thing, and that modifying the .model attribute works (as in the last test class (which btw. should also test deleting, which you correctly added)). The duplicated tests can then be removed.

Another suggestion I have, which is not as clear cut, is to move _load_model out to be a standalone function and not a method on Card. Then add a few tests just for this function. Once we know that this function works well, testing the Card class will be much easier because we know that the model loading part is covered.

Test fixtures

I think there is a misunderstanding regarding the use of pytest fixtures. Let's take this example:

@pytest.mark.parametrize("suffix", [".pkl", ".pickle", ".skops"])
def test_save_model_card_from_path(destination_path, model_card_from_path):
    model_card_from_path.save(Path(destination_path) / "README.md")
    assert (Path(destination_path) / "README.md").exists()

Here you use the fixture suffix but your test doesn't make use of it. This means that the exact same test will be run 3 times, with no additional value. The correct way to use fixtures is to create the decorator (as you did), then add the fixture name as an argument to the function, and then to use that argument in the function (e.g. in test_plot_model_false_from_path you use the fixture correctly).

I hope this helps and if you have more questions feel free to ask us.

@p-mishra1
Copy link
Contributor Author

Formatting

Could you revert the formatting changes you made? It looks like some tool you use (your IDE maybe?) reformatted the files you worked on to use 79 or 80 characters per line. However, we use black with the default settings, which comes down to ~88 characters per line. If you undo these changes, the code will be much easier to review because the formatting changes distract from the real changes.

The best way to achieve this would be if you could first undo your formatting changes, then add black via a pre-commit hook. This step is explained in our contribution pipeline but the gist is that you should do something like:

pip install pre-commit
pre-commit install

from within your Python environment. After doing this, each time when you commit, the code will be checked for black compliance.

I used the contribution setup before raising the PR but i think my local black setup formatted the code before the being formatted by pre-commit.

Testing

If I understand correctly, you have duplicated a number of tests. Examples are test_plot_model_false_from_path and test_add_from_path. The change you introduced was to have the card initialized from a file instead of passing the model directly, the rest is identical.

IMO, it is not necessary to duplicate all those tests, since the actual thing they're testing is not related to how the model was passed. Instead, it is sufficient to have tests that show that passing a model or a str or a Path do the same thing, and that modifying the .model attribute works (as in the last test class (which btw. should also test deleting, which you correctly added)). The duplicated tests can then be removed.

Another suggestion I have, which is not as clear cut, is to move _load_model out to be a standalone function and not a method on Card. Then add a few tests just for this function. Once we know that this function works well, testing the Card class will be much easier because we know that the model loading part is covered.

okay, I will remove the duplicate tests and make standalone _load_model function and write tests for funtion as well as passing path.

Test fixtures

I think there is a misunderstanding regarding the use of pytest fixtures. Let's take this example:

@pytest.mark.parametrize("suffix", [".pkl", ".pickle", ".skops"])
def test_save_model_card_from_path(destination_path, model_card_from_path):
    model_card_from_path.save(Path(destination_path) / "README.md")
    assert (Path(destination_path) / "README.md").exists()

Here you use the fixture suffix but your test doesn't make use of it. This means that the exact same test will be run 3 times, with no additional value. The correct way to use fixtures is to create the decorator (as you did), then add the fixture name as an argument to the function, and then to use that argument in the function (e.g. in test_plot_model_false_from_path you use the fixture correctly).

Fixture parameterisation
here the pytest fixture model_card_from_path has a param named suffix and pytest is implicitly passing the values. have a look at the above stackoverflow link, so the pytest behaviour is correct(i checked the test locally and verfied the output and behaviour before the PR ).

@BenjaminBossan
Copy link
Collaborator

here the pytest fixture model_card_from_path has a param named suffix and pytest is implicitly passing the values. have a look at the above stackoverflow link, so the pytest behaviour is correct(i checked the test locally and verfied the output and behaviour before the PR ).

Ah I see, I didn't know about this new feature. It's kinda cool but also a bit magic. I wonder if it's really preferable over using request.param to parametrize a fixture.

@adrinjalali
Copy link
Member

Oh nice, finally pytest supporting this. Been waiting for it for ages. I personally find it more intuitive than using request.param. But everywhere this pattern is used, I'd leave a comment saying it's a parameter passed to the fixture for future developers to figure it out easily.

@p-mishra1
Copy link
Contributor Author

Created a fresh PR in #205 for this.

@p-mishra1 p-mishra1 closed this Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants