Adding sklearn docstring to flow #756

Neeratyoy · 2019-08-06T13:51:01Z

Reference Issue

Addresses #175.

What does this PR implement/fix? Explain your changes.

Adds the sklearn model description to the OpenML Flow description for a flow that uses a sklearn model. Also parses the docstring and adds the data types and parameter descriptions for the parameters_meta_info attribute of a flow.

How should this PR be tested?

import openml
from sklearn.linear_model import SGDClassifier
openml.config.start_using_configuration_for_example()
task = openml.tasks.get_task(403)
clf = SGDClassifier()
r, f = openml.runs.run_model_on_task(clf, task, return_flow=True)
print(f.description)
print(f.parameters_meta_info['learning_rate']['data_type'])
print(f.parameters_meta_info['learning_rate']['description'])

Neeratyoy · 2019-08-06T13:59:03Z

@mfeurer
Hope this is in line with the requirement of the issue.

A couple of things though:

In the example given above, when I go to the test server page for the corresponding flow_id generated after run_model_on_task, the descriptions loads and then throws a 'Not Found` error. However, the parameter descriptions come up fine.
Currently, I am updating the description for all sklearn flows. However, I am wondering where will this break given most of the popular sklearn models are existing as flows. I have noticed that if I set flow_id=None and publish() a flow which exists, it gives a ValueError for the descriptions and parameter descriptions too.

mfeurer · 2019-08-07T08:30:12Z

the descriptions loads and then throws a 'Not Found` error

This is a frontend error on the test server and nothing to worry about here.

Currently, I am updating the description for all sklearn flows

Do you mean on the test or live server? If yes, I don't think there's a reason to do so. In my opinion it's sufficient if this information is added in the future.

mfeurer

Thanks, this already looks good. I have a few comments which could make the code a bit more resilient. Could you please post a link to a "new" flow and also start adding unit tests?

openml/extensions/sklearn/extension.py

Neeratyoy · 2019-08-07T14:59:44Z

@mfeurer there will be quite a few errors namely AttributeError, KeyError which I need to take a look at. The problem I discussed with you should be raising ValueErrors specifically.

mfeurer · 2019-08-07T15:23:56Z

Hm, it appears that we need to ignore the descriptions for these unit tests.

Neeratyoy · 2019-08-07T16:21:28Z

Hm, it appears that we need to ignore the descriptions for these unit tests.

This function checks if the flow on the server and local match.
Shall I edit this to exempt the check for description and parameters_meta_info?

The result of that would be that the no exceptions are raised while the server flow is intact, whereas the local flow will have the newer sklearn documentations.
For new flows, it shouldn't be a problem either way.

More edge cases:

Please see the algorithm parameter description that lacks a str data type. If I make more characters a part of my regex search, then it breaks. This is an edge case.
Another one is for the test case here where the estimator used is another object and allows for its own set of parameters, which are not there in the docstring of the main sklearn object.

Shall the check/assignment be a little looser wherein, if a parameter match is not found, we don't add the string description for it. Especially given that this works largely. These are a few edge cases that need to be tackled, but specific manual fixes may break in the future. We are relying heavily on sklearn docs.
I could explicitly call model.get_params() and fill in the exact missing params as the keys with description and data_type as None.
Kindly advise.

mfeurer · 2019-08-09T14:28:51Z

Shall I edit this to exempt the check for description and parameters_meta_info?

Yes, please, but use a flag like for the other exemptions.

Regarding edge case number 1, I'm not sure why adding more characters to the regex breaks this. However, you could simply split the string into the part before and after the first colon and use these?

Regarding edge case number 2, I think we should just add no description. And yes, the whole thing is under the assumption that 3rd-party packages use the same documentation as scikit-learn. That's a bit unfortunate. Could you please add a comment to the scikit-learn extension describing this?

codecov-io · 2019-08-25T01:19:41Z

Codecov Report

Merging #756 into develop will increase coverage by 0.2%.
The diff coverage is 90.19%.

@@            Coverage Diff             @@
##           develop     #756     +/-   ##
==========================================
+ Coverage    88.06%   88.26%   +0.2%     
==========================================
  Files           36       36             
  Lines         4114     4278    +164     
==========================================
+ Hits          3623     3776    +153     
- Misses         491      502     +11

Impacted Files	Coverage Δ
openml/flows/functions.py	`88.07% <88.88%> (+0.1%)`	⬆️
openml/extensions/sklearn/extension.py	`91.24% <90.47%> (-0.11%)`	⬇️
openml/utils.py	`91.13% <0%> (-0.64%)`	⬇️
openml/runs/functions.py	`82.07% <0%> (-0.29%)`	⬇️
openml/evaluations/functions.py	`92.23% <0%> (-0.08%)`	⬇️
openml/testing.py	`95.92% <0%> (+0.63%)`	⬆️
openml/_api_calls.py	`88.31% <0%> (+3.89%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 053623d...7d685e1. Read the comment docs.

openml/extensions/sklearn/extension.py

tests/test_extensions/test_sklearn_extension/test_sklearn_extension.py

openml/flows/functions.py

openml/extensions/sklearn/extension.py

Neeratyoy · 2019-09-02T19:42:03Z

@mfeurer
The tests that are failing, I am not able to reproduce them locally. The descriptions match consistently.
The older version of sklearn docs 0.19 and 0.20 differ slightly from the current stable release.
Could this be a version issue?

mfeurer · 2019-09-04T08:11:05Z

Could this be a version issue?

Most likely. You could do a version switch here and check the fixture for 0.21, and the output of the function (which you had before) for the others?

Neeratyoy · 2019-09-05T20:30:31Z

@mfeurer it appears that we are testing with a really old sklearn version of 0.18.2.
Why don't we test with the latest ones, since that is probably what the users would be using?

Neeratyoy added 3 commits August 5, 2019 17:55

Using sklearn docstring as flow descriptions for sklearn flows

f1919e1

Extracting parameter type and descriptions

0b5137f

Handling certain edge cases

b0ad048

mfeurer reviewed Aug 7, 2019

View reviewed changes

Neeratyoy added 2 commits August 7, 2019 15:14

More robust failure checks + improved docstrings

d90f333

Trimming of all strings to be uploaded

6dc4345

Neeratyoy requested a review from mfeurer August 7, 2019 14:58

Neeratyoy added 2 commits August 25, 2019 00:19

Fixing edge cases to pass tests

58a6609

Fixing PEP8

41549b0

Leaner implementation for parameter docstring

235ded8

Neeratyoy marked this pull request as ready for review August 26, 2019 18:45