Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better testing of stubs #754

Closed
JukkaL opened this issue Dec 7, 2016 · 19 comments
Closed

Better testing of stubs #754

JukkaL opened this issue Dec 7, 2016 · 19 comments
Labels
project: policy Organization of the typeshed project

Comments

@JukkaL
Copy link
Contributor

JukkaL commented Dec 7, 2016

Reviewing stubs for correctness is hard, and the existing stubs have many bugs (that are gradually being fixed). We could perhaps improve both of these by having better tests for stubs. Here's a concrete idea for that (inspired by a proposal I remember having seen at Gitter):

  • Write a tester that checks definitions a stub conform to what happens at runtime by importing the target module, parsing the stub and running some checks.
    • We'd find if a stub defines a module-level thing that's not actually there at runtime.
    • We can verify that module attributes have the right types (at least for int, str, bool and other simple types).
    • Check that a class in a stub is a class at runtime (there probably are some exceptions which we can whitelist).
    • Check that a function in a stub is a callable at runtime.
    • Check for the existence of methods.
    • Check that static and class methods are declared as such (assuming that we can reliably introspect this).
    • Check that argument names of functions agree (when we can introspect them).
    • Check that arguments with a None default value have type Optional[x].
  • For 3rd party modules, pip install the package before running the tests. We can have a config file that specifies the versions of 3rd party modules to use to make the tests repeatable.

We could blacklist the modules that currently don't pass the test and gradually burn down the blacklist.

This wouldn't perfect and several things couldn't be checked automatically:

  • Instance attributes are hard to check for using introspection.
  • Things missing from stubs are hard to check for, since it's not clear how to know whether something is an internal thing or an imported things that shouldn't be part of the public interface.
  • Types can't generally be tested for correctness, except for some instances of missing Optional[...] and maybe a few other things.
  • Internal things accidentally included in stubs also probably can't be automatically detected.

However, this could still be pretty valuable, and this shouldn't be too hard to implement. We could start with very rudimentary checks and gradually improve the testing tool as we encounter new bugs in the stubs that we could automatically guard against.

@gvanrossum
Copy link
Member

IIUC this is similar to comparing the stub to what stubgen (mypy's stub generator) can easily discover, right? So maybe that would be an implementation strategy?

In general I worry that this would still be incredibly imprecise (since stubgen doesn't discover types). I also worry that the amount of test code per module, just to specify exceptions/improvements, could easily be larger than the size of the stub for the module. Which would give it a poor scaling behavior.

@JukkaL
Copy link
Contributor Author

JukkaL commented Dec 7, 2016

We might be able to reuse parts of stubgen.

This would have some nice benefits over just using stubgen:

  • We could check whether stubs still make sense against a newer version of a module with an existing stub. Stubgen is only well suited for creating the initial set of stubs, not for stub evolution.
  • We can check whether changes to existing stubs make sense. We've had some large-scale updates to stubs which have lost some useful information in the process (such as None initializers), some of which would have been preventable if we had a tool like this.
  • Stubgen makes mistakes and the output needs manual tuning. We have no automated way of checking that the manual updates are correct (and sufficient).

This would certainly be imprecise, but I believe that it would still be useful, similar how the existing typeshed tests are useful and prevent an interesting set of errors. We'd have to experiment to see whether the number of exceptions required would make this impractical.

@matthiaskramm
Copy link
Contributor

pytype has an option to test a .pyi file against the .py implementation, too:

pytype --check file.py:file.pyi

It won't do everything on your list, but it will find:

  • Methods, classes or constants declared in the .pyi, but missing in the .py.
  • Argument types in the .pyi that cause type errors in the .py.
  • Methods returning values incompatible with what is declared in the .pyi.

@JukkaL
Copy link
Contributor Author

JukkaL commented Dec 7, 2016

Cool! I wonder if we could use that in the CI scripts?

@JukkaL
Copy link
Contributor Author

JukkaL commented Dec 7, 2016

Another idea would be to check for things present only in a Python 2 or 3 stub, but not both, and giving a warning if the same thing is present at runtime in both versions.

@vlasovskikh
Copy link
Member

I'm working on using typeshed stubs in PyCharm and I must admit I have very little confidence in many stub files. I'm especially worried about their incompleteness since it will result in false warnings about missing methods.

As a part of the idea to test typeshed stubs better I propose adding Python files that use the API from stubs along with the stubs into the test_data directory. See PR #862.

The DefinitelyTyped repo for TypeScript uses this approach for testing their stubs.

@matthiaskramm
Copy link
Contributor

It seems odd to hand-craft a new .py file for every .pyi file, given that the .pyi already models an existing .py file.

@vlasovskikh
Copy link
Member

@matthiaskramm A .py file shows the usage of the API in terms known to the users of a library. It serves as a test that is red before the fix in .pyi that later becomes green. Just adding a new type hint doesn't mean that the false error seen by the user went away.

In addition, having common test data that requires an analyzer to actually resolve references to symbols and check types makes our type checkers more compatible and compliant to PEP 484.

@gvanrossum
Copy link
Member

gvanrossum commented Jan 26, 2017 via email

@vlasovskikh
Copy link
Member

When I see a typeshed stub, how can I be sure that it's correct to any extent? And since a stub overrides the whole contents of a .py file, I'm worried about adding any untested typeshed stubs to PyCharm. Luckily, we have many internal PyCharm tests for at least some stdlib modules. So we'll add typeshed stubs to PyCharm gradually as our tests cover more of them over time.

If someone changes things in typeshed in an incompatible way, we at PyCharm will notice any regressions at least. It would be better to check not only for regressions, but for incompatibilities between type checkers as well. This is one of the main reasons I'm proposing to make static tests for stubs a part of typeshed.

Most fixes to the stubs come from real examples of false errors. It's not enough to just fix a stub and forget about it. We have to run a type checker manually in order to make sure the problem is fixed. And still there may be incompatibilities between the results of type checker and the other ones. Since we already have this code example that contains the problem, why don't we add it to automated tests so there will be no regression in the future?

Static tests for type checkers could co-exist with checks by introspection. We don't have to pick just one option.

Meanwhile I'll be sending my PRs on top of the master branch without any tests.

@JukkaL
Copy link
Contributor Author

JukkaL commented Jan 27, 2017

I think that hand-crafted .py test files would be valuable, but I don't think that we should expect them to be complete or available for every module. They wouldn't replace other, more automatic tests, but they could be a useful addition in some cases.

Here are some examples where I think that they would be useful:

  • Add a test that exercises only the core functionality of a module. This would act as a basic sanity check for any changes to the stub. Even if we wouldn't catch many possible errors, at least it would be harder to break the most basic functionality of a stub.
  • Add a test that exercises particularly complex signatures. Some recent examples where this could have been helpful are dict.get (a recent change to it broke it on mypy) and requests.api (there was a bad signature). If somebody carefully crafts a tricky signature, it would be nice to be able to write a test that makes sure nobody will accidentally break it.
  • Tests are the only straightforward way I know of to check that types in a stub reflect runtime behavior. It should be possible to both type check the tests and run them using Python. Another plausible approach would be to somehow test types in a stub through running unit tests for the stubbed module, but that would be much harder to implement (e.g. automatically insert runtime type checks based on signatures in a stub). Sometimes it might be feasible to use (annotated) unit tests for a module as a test in typeshed.
  • As people make fixes to a stub, they could add tests for their changes, making the changes easier to review, especially if the test is runnable.

Mypy already has a small number of tests like this (https://github.com/python/mypy/blob/master/test-data/unit/pythoneval.test) and they've been occasionally useful.

@vlasovskikh
Copy link
Member

The idea of making contributions easier to review is a good point. I've already mentioned other points above, I think they are all valid.

@JukkaL
Copy link
Contributor Author

JukkaL commented Jan 27, 2017

Here are some more ideas for automated checks that might be pretty easy to implement, at least for an interesting subset of packages:

  • If a package has Sphinx/rst documentation, we could perhaps automatically check that all types, functions and attributes mentioned in the documentation are included in the stubs for the package. This would probably necessarily have to be fuzzy, such as only requiring that an attribute mentioned in the documentation is defined somewhere within the stubs, to avoid a large number of false positives. It might still highlight some omissions at least.
  • If a package has unit tests, we could look at which names get imported from the package by the unit tests, and verify that all those names are defined in the stubs. One way to do this would be to run mypy or another tool against the unit tests and only look at import-related errors.

@vlasovskikh
Copy link
Member

Sent a PR that suggests both static and run-time checking (via pytest) #917.

@hauntsaninja
Copy link
Collaborator

hauntsaninja commented Jan 21, 2020

Hello everybody!

I've been working on mypy's stubtest.py with the goal of adding tests to typeshed. My WIP branch is visible here: https://github.com/hauntsaninja/mypy/tree/stubtest2/scripts
The idea here, as in the original post, is to compare stubs with what we find via runtime introspection.

This version of stubtest.py can do pretty much everything @JukkaL mentioned in his original message and a couple things more.

Here's a list of errors when running against latest master of typeshed: https://gist.github.com/hauntsaninja/1ecb078ccbd293112b67fb7236727f5d
Here's the same list, but with --concise for concise output:
https://gist.github.com/hauntsaninja/012ad27b253c042f1144bedbd5985554
I've added support for whitelisting, here's the result of running stubtest.py on typeshed with py38, using the whitelist generated for py37:
https://gist.github.com/hauntsaninja/c713dc5e3bf5b419489cca5fd472cd71
Or try it out for yourself and see colorised output! The gists contain the commands I ran, or python stubtest.py --help should get you started. It takes ~20s to check typeshed on my laptop.

In general, it seems to work well. The last 25 PRs visible at https://github.com/python/typeshed/pulls?utf8=✓&q=is%3Apr+author%3Ahauntsaninja were all identified by stubtest.py. (Note: some of these have not yet been merged, so they still appear in the error lists above)

One thing to call out is that I think this sort of testing is really necessary for dealing with differences between Python versions. I recently spent some time working on trying to complete py38 support in typeshed and it took a lot of manual effort to trawl through docs for each module. This would also help prevent regressions for older Python versions.

If this seems useful, I'm curious what suggestions people have to get this to the point that we could use it in typeshed CI. The script supports outputting and using whitelists (it also notes any unused entries of whitelists), which, as Jukka suggested, we could gradually burn down. Let me know if a complete list of its capabilities is useful.

Please let me know what you think! I'd also really appreciate code review, particularly from folks familiar with mypy internals. It's pretty much a complete rewrite, so a lot of scope for mistakes.

In the short term, I'm going to be working on the following:

  • Better documenting the script.
  • Continuing to try and reduce false positives. For instance, many overloads tell white lies, so I can change things to be a more permissive of them.
  • 25% of errors come from positional-only arguments not being marked as such, so maybe add a flag to ignore that.
  • Teaching it new tricks!
  • Removing all the f-strings, so we can run it with Python 3.5

I'm also curious if people have opinions about where this should live? I'm partial to the point of view discussed in #996 – that if it continued to live in mypy it could perhaps be hard to coordinate changes to unblock typeshed CI. I could also move it to its own repo?

@JelleZijlstra
Copy link
Member

@hauntsaninja thanks, this is great! It would be really helpful if we can run this script in typeshed CI in the future.

Putting it either in the mypy or the typeshed repo could lead to issues with coordinating changes, but because the script relies heavily on mypy internals, I think it makes more sense to put it in the mypy repo. That way, when somebody makes a change to mypy that breaks the guts of the script, they can just fix stubtest in the same commit. On the other hand, if a typeshed change exposes a false positive in stubtest, it should be possible to just update a whitelist. Perhaps it would also be useful to have stubtest support a magic comment like # stubtest: ignore (or even just # type: ignore[stubtest] using mypy's new error-tagging capabilities) to make it shut up about a line.

@srittau
Copy link
Collaborator

srittau commented Jan 23, 2020

This is indeed fantastic (as evidenced by your relentless onslaught of PRs). I think the long-term goal should be to do the tests here in typeshed, as this is where potentially problematic changes happen and fixes can be applied. But if stubtest is really tied to mypy internals at the moment, I agree that the mypy repo seems more appropriate for the time being.

That said, another long-term goal for both the typeshed and mypy projects should be to untangle the build processes. Ideally, changes to one repo should not cause failures in the other one. I think this is a precondition for moving the stubtest tests here.

@hauntsaninja
Copy link
Collaborator

hauntsaninja commented Jan 24, 2020

Okay, I've gotten some of my short-term work list done, so I've opened up a draft PR against mypy at python/mypy#8325. Assuming it finds a home there, I'd like to propose the following path to stubtest in typeshed CI and look at some scenarios.

We add a test to typeshed CI that uses a pinned version of stubtest to run stubtest for each python version from py35 onwards on stdlib stubs with a python version specific whitelist that's checked into typeshed.

  • If something breaks stubtest or improvements are made to stubtest, typeshed won't be affected until we update the pinned version of stubtest (and also potentially the relevant whitelists). This is fine, since worst case we just maintain a status quo. (Separately, I'll need to find a good way of testing stubtest within the mypy repo that isn't typeshed related, eg, like writing tests :-) ).

  • If a new or updated stub-bit in typeshed needs to be whitelisted, it should be fairly easy. The whitelist format is obvious and human-readable; the change would need to be duplicated in the whitelist for each relevant python version.

  • If we fix some stub-bit, but don't update the whitelist, stubtest will complain about unused whitelist entries. This will add a little friction to fixing existing issues, but will help our whitelists burn down.

  • When a new python version is released, we can use --generate-whitelist to create a new whitelist.

Some other notes:

  • An idea I toyed with is having a single whitelist, or some sort of union-ing up delta-style logic for whitelists across python versions to reduce redundancy. However, this loses us the ability to detect unused whitelist entries and could cause false negatives.
  • Whitelists support comments. Not sure how useful that is, but maybe we'd want to have a policy of documenting reasons for new additions.
  • I'm happy to bunt on this for now while making improvements to typeshed until whitelist sizes become reasonable. And stubtest still has some false positives, eg, there's an issue with classmethods that I need to deal with.

Edit:
I've been keeping the gists with output up-to-date as we make improvements to both typeshed and stubtest. Since originally posting the gist, the number of errors/false positives it outputs has come down by 20%. The concise output is the easiest to scan: https://gist.github.com/hauntsaninja/012ad27b253c042f1144bedbd5985554
Edit 2: Now down by 35%
Edit 3: Now down by 45%, py38-only errors down by 70%

@hauntsaninja
Copy link
Collaborator

I feel stubtest does a good enough job here that we can close this issue.

stubtest provides decent overall coverage, and makes stub evolution easy, as I can attest to with Python 3.8 and 3.9 and several fixed regressions against older Pythons. It's not perfect, for instance, stubtest can't do much about return types, but it should service most of the discussion in this post.

I think our approach to type checking code to test stubs should therefore more be along the lines of #1339. In particular, it'd be nice to see something like Rust's crater or Black's primer. Doing so could also help reduce our dependence on contributors with the ability to test large internal codebases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project: policy Organization of the typeshed project
Projects
None yet
Development

No branches or pull requests

7 participants