Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning if fitted sklearn model being used #989

Merged
merged 9 commits into from
Nov 3, 2020
Merged

Warning if fitted sklearn model being used #989

merged 9 commits into from
Nov 3, 2020

Conversation

Neeratyoy
Copy link
Contributor

Reference Issue

Addresses #968.

What does this PR implement/fix? Explain your changes.

Prints a warning if the user passes an already fitted sklearn model to run_model_on_task.

How should this PR be tested?

import openml
from sklearn.ensemble import RandomForestClassifier


task = openml.tasks.get_task(23)
clf = RandomForestClassifier()

run = openml.runs.run_model_on_task(clf, task)
print("Unfit model run")

X, y = task.get_X_and_y()
clf.fit(X, y)
run = openml.runs.run_model_on_task(clf, task)
print("Fit model run")

@Neeratyoy Neeratyoy requested review from mfeurer and PGijsbers and removed request for mfeurer October 30, 2020 13:40
openml/extensions/sklearn/extension.py Outdated Show resolved Hide resolved
)
except NotFittedError:
# model is not fitted, as is required
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought in the Python call we discussed that perhaps we would check this at the first call to run_model_on_task?
In either case I would extract this to a separate method _raise_warning_if_fitted to make sure the functions don't get too big (they already are).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, will make it into a function and push.

As for its placement, I reconsidered it given that irrespective of what is called, run_model_on_task or run_flow_on_task, this function is what the call is reduced to. Hence went ahead with this placement for this snippet of code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_model_on_task actually calls run_flow_on_task. That said, then we would need to add a function to the extension interface that will indicate if a model is already fit, otherwise we can't check it in a general way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfeurer do you think this is something we should want? or do we just leave it to the extension devs to implement a warning if they see it fit?

Copy link
Collaborator

@mfeurer mfeurer Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having this function as callback that can be implemented by the extension deves. And yes, I expected this function to be called from the run_model_on_task function.

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good now, but you accidentally committed a .orig file. Could you please remove that again?

@Neeratyoy
Copy link
Contributor Author

This PR looks good now, but you accidentally committed a .orig file. Could you please remove that again?

Done

@mfeurer
Copy link
Collaborator

mfeurer commented Nov 2, 2020

Hey, it's a bit hard to see this but among all the failures it says:

self = <openml.extensions.sklearn.extension.SklearnExtension object at 0x0000009C4334DA58>
model = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max...                     n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
    def check_if_model_fitted(self, model: Any) -> bool:
        """Returns True/False denoting if the model has already been fitted/trained
    
        Parameters
        ----------
        model : Any
    
        Returns
        -------
        bool
        """
        try:
            # check if model is fitted
            from sklearn.exceptions import NotFittedError
            from sklearn.utils.validation import check_is_fitted
    
>           check_is_fitted(model)  # raises a NotFittedError if the model has not been trained
E           TypeError: check_is_fitted() missing 1 required positional argument: 'attributes'
openml\extensions\sklearn\extension.py:1556: TypeError

Could you please have a look?

@joaquinvanschoren
Copy link
Contributor

@all-contributors please add @Neeratyoy for code

@allcontributors
Copy link
Contributor

@joaquinvanschoren

I've put up a pull request to add @Neeratyoy! 🎉

Copy link
Collaborator

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, if the failing unit test that @mfeurer pointed out is fixed.

@PGijsbers
Copy link
Collaborator

@all-contributors please add @Neeratyoy for code

(sorry for the spam)

@allcontributors
Copy link
Contributor

@PGijsbers

I've put up a pull request to add @Neeratyoy! 🎉

@Neeratyoy
Copy link
Contributor Author

Could you please have a look?

Given the error message, it might be that the older versions of sklearn had check_is_fitted behaving differently.

I updated the design of the check to be agnostic to the sklearn versions and also to the kind of model passed.

@codecov-io
Copy link

codecov-io commented Nov 3, 2020

Codecov Report

Merging #989 into develop will decrease coverage by 0.05%.
The diff coverage is 71.42%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #989      +/-   ##
===========================================
- Coverage    87.91%   87.86%   -0.06%     
===========================================
  Files           36       36              
  Lines         4551     4565      +14     
===========================================
+ Hits          4001     4011      +10     
- Misses         550      554       +4     
Impacted Files Coverage Δ
openml/runs/functions.py 83.16% <50.00%> (-0.17%) ⬇️
openml/extensions/sklearn/extension.py 90.84% <70.00%> (-0.24%) ⬇️
openml/extensions/extension_interface.py 91.66% <100.00%> (+0.49%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a629562...1e90133. Read the comment docs.

@PGijsbers
Copy link
Collaborator

As far as I can tell these failures are a combination of timeouts and server issues, am I overlooking something?

@mfeurer
Copy link
Collaborator

mfeurer commented Nov 3, 2020

Yes, except for:

  • pytest workers crashing. This is due to the timeout of pytest being lower than the unit test running (timeout is set to 600s, if the unit test now waits for 600s, it runs over the timeout)
  • a known failure that previously only existed on Windows. I already asked @Neeratyoy to look into this.
    In my opinion, this PR can be merged.

@PGijsbers
Copy link
Collaborator

pytest workers crashing.

Yes this is what I referred to with timeouts (not server timeouts), my bad for leaving it ambiguous.

In my opinion, this PR can be merged.

Will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants