Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typing Stubs and PEP 561 compatibility #28142

Closed
simonjayhawkins opened this issue Aug 25, 2019 · 63 comments
Closed

Typing Stubs and PEP 561 compatibility #28142

simonjayhawkins opened this issue Aug 25, 2019 · 63 comments
Labels
Needs Discussion Requires discussion from core team before further action Typing type annotations, mypy/pyright type checking

Comments

@simonjayhawkins
Copy link
Member

xref #28135 (comment)

do we want to make pandas PEP 561 compatible?

https://mypy.readthedocs.io/en/latest/installed_packages.html#making-pep-561-compatible-packages

@simonjayhawkins simonjayhawkins added the Needs Discussion Requires discussion from core team before further action label Aug 25, 2019
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Aug 25, 2019
@simonjayhawkins simonjayhawkins added the Typing type annotations, mypy/pyright type checking label Aug 25, 2019
@jbrockmendel
Copy link
Member

If we go down that road, we could consider using stub files in cython to make that code valid python (which would allow us to lint)

@zero323
Copy link

zero323 commented Jan 10, 2020

@WillAyd @jbrockmendel @simonjayhawkins How would you describe current state of things? I've seen that many core components already have pretty good API coverage. Is it realistic to expect this to happen in foreseeable future (let's say next release or two)?

To explain the context of this question ‒ for the last three years I've been working on stub files for Apache Spark. Over this time contact surface between Pandas and PySpark grown significantly, mostly due to introduction and active development of so-called Pandas udfs. Since Pandas doesn't advertise its annotation it effectively creates a growing gap, in practice not covered by type checkers.
Additionally the latest upstream developments utilize type hints in Pandas-dependent components, leading conflicts between static type checking, and upstream runtime requirements.

Furthermore lack of actionable annotations leads to rather ugly escalations in case of polymorphic functions, which accept Pandas objects, as well as other types,

Now... For some time, to partially address the problem, I've been using Protocols and dummy compatibility imports. The idea is basically to:

This approach is not without its own problems, but does the trick. If Pandas is going to PEP 561 these will become obsolete, but if such move is not going to happen any time soon, I will consider formalizing this approach, and agitating for required adjustments in core PySpark. However, given amount of red tape, I'd really like to avoid it :)

@WillAyd
Copy link
Member

WillAyd commented Jan 13, 2020

@zero323 are you asking if we plan on distributing stub files? If so I would say no for now; we have been annotating inline. Not fully complete but you'll see issues like #26766 used to track that progress

@zero323
Copy link

zero323 commented Jan 16, 2020

Thank you for the answer @WillAyd.

I didn't mean stubs explicitly, but annotations that are visible to external checkers (which tend to assume no annotations, if no stubs packages or py.typed files are present).

For me it is basically a decision between waiting a bit longer (if these are expected to be ready soon. From that perspective even dynamic annotations are good enough) or adding elaborate workarounds.

@mfcabrera
Copy link

mfcabrera commented Jan 17, 2020

As an end user I am also having the use of annotationg code with say pd.DataFrame and mypy interpreting it as Any.

I am bit confused here. in order to distribute the annotations isn't that enough to add py.typed file and modify the setup.py? That way mypy will be able to use the inline annotated code right? or there is something I am missing?

Update:
Sorry I did not see #28831

However I am bit still confused on what is need to make this happen (if it is happening).

@zero323
Copy link

zero323 commented Jan 17, 2020

I am bit confused here. in order to distribute the annotations isn't that enough to add py.typed file and modify the setup.py?

There is a bit more to that. In general it is good to have annotations that pass type checks with standardish settings, otherwise you're likely to break things downstream.

@OliverSieweke
Copy link

Related issue: #12412

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 9, 2020

Need to keep an eye on Microsoft's efforts here: https://github.com/microsoft/python-type-stubs/tree/main/pandas

@gramster
Copy link

gramster commented Jan 5, 2021

Regarding our (Microsoft) stubs:

  • our own tools such as Visual Studio Code/pyright/Pylance work best in the presence of explicit types, as type inference in Python is very hard, so we would like as many packages as possible to have types on their public APIs (while recognizing that "end user code" in Python is generally still going to not use explicit typing much)
  • we strongly favor package authors using inlined typing, with stub files filling in (say for Cython); that said, we recognize that that may not always be practical. https://github.com/microsoft/python-type-stubs#our-use-of-type-stubs
  • we have done some work in that repo to author stub files, with the intent of upstreaming them to typeshed, and in fact most of the stubs we have written have been upstreamed
  • pandas is a critical package for us, which is why we spent quite some effort writing stubs for pandas (these were driven largely from the documentation; @Dr-Irv has been super helpful in finding errors)
  • we ideally would like to get out of the business of having to maintain these stubs ourselves (we're a small team), but they are in no way ready/cleaned up enough to be upstreamed to typeshed, which has a number of requirements for submitted stubs we don't and can't easily meet, and we don't have the resources for doing the work of trying to get them inlined into pandas
  • right now we unfortunately are in an all-or-nothing stage with our tooling - i.e. if stubs are present we use them, else we use the code; we can't combine partial stubs with inlined types, as useful as that would be for incrementally inlining types from stub files into packages*. This means that at least in the near-term, we will continue bundling these pandas stubs with our tools and fixing errors as they are reported. That said, we would be very happy for them to be used by the pandas core team if they are at all useful in moving forward with inline types.

* I believe that if the stub files exist in the package distribution alongside the Python files, then we can use them, and they will take priority over the .py files (@msfterictraut could confirm). So that would be one way we could handle an incremental move to inline types. But it would require package authors to buy into that approach.

@erictraut
Copy link

@gramster, you're correct that stub files that are packaged alongside the Python source files will take precedence when used by a type checker like pyright or mypy. This is the behavior dictated by PEP 561.

Until recently, library authors had no way to verify the completeness of their type information within a "py.typed" package. We recently added a mode to pyright that analyzes a package and reports missing or incorrect type information. For more details, see this documentation.

@simonjayhawkins
Copy link
Member Author

also need to keep an eye on cython/cython#3818

@jorisvandenbossche
Copy link
Member

@gramster or @erictraut could you give some background on the "content" of the Microsoft stubs? (since we also already have type annotations in pandas as well)
Are they much more elaborate than the type annotations we have in pandas? Or did you copy the type annotations for functions in pandas that are already typed, and added stubs for things that were not yet typed in pandas?
(to have a bit an idea how divergent they are or how easily/difficult they could be combined somehow)

@gramster
Copy link

gramster commented Jan 13, 2021

We added a lot more annotations, and becuase these are stub files, included annotations for a lot of APIs implemented in Cython.

@TomAugspurger
Copy link
Contributor

Thanks to all the folks who joined the dev call on Wednesday. Here's a recap of (my understanding of) the current status, and a tentative proposal for how to proceed. More notes at https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing.

https://github.com/microsoft/python-type-stubs/tree/main/pandas contains type stub (.pyi) files for pandas. Microsoft would like to upstream them to pandas:

  • We (pandas) are better equipped to verify the correctness of the type stubs.
  • We're better equipped to maintain them as pandas evolves, signatures change.

There weren't any objections from the pandas side to taking on maintenance of the stubs. Most of the discussion was around logistics of actually upstreaming them.

pandas is already using inline types wherever possible. I think there's broad agreement that we want to continue that wherever possible (i.e. everywhere but extension modules).
However, the mere presence of a .pyi file means that (at least by default) the inline types will be ignored by type checkers for that module. So we don't want to just
dump the contents of that repo into the pandas package. That would essentially nullify our inline types which we've been building up over the past year or two.

My suggestion is to manually go through, file-by-file, and migrate the typestub files to inline types. This will require some effort by maintainers who are experienced
with typing. Especially early on, there might be some helpers that need to be migrated first before later files can integrate them. My hope is that once we've ironed out the process, we can engage pandas' pool of volunteer contributors to do the long tail of files.

Finally, we'll need to coordinate with the microsoft/python-type-stubs team once we've upstreamed individual files. Perhaps a checklist on a GitHub issue, or a google spreadsheet. Once a module has been upstreamed to pandas, we'd like to avoid changing it (just) in microsoft/python-type-stubs.

One thing I'm not sure of: what should happen once pandas has completely upstreamed all the files? We'd like the inline types to take precedence by code-completion tools in editors, so they should probably be removed for microsoft/python-type-stubs. But the pyright maintainers mentioned some ancillery benefits to .pyi files: They can also be statically parsed for docstrings (which isn't possible for things like read_csv that are dynamically generated). And they're often faster to parse since they're just the types / docstrings, not the code. Perhaps we can regularly regenerate the in microsoft/python-type-stubs from the types packaged with pandas?

@topper-123
Copy link
Contributor

topper-123 commented Jan 15, 2021

One small point is that read_csv is no longur dynamically created. read_table is also a normal function.

I don't know if that changes the position of the pyright maintainers, or if this a more general problem...

@gramster
Copy link

gramster commented Jan 15, 2021 via email

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 15, 2021

Let's make sure that we track the two use cases of typing:

  1. Typing for pandas developers - making sure that our typing of internals (and maybe externals) of pandas code works with mypy as is tested in the CI today
  2. Typing for pandas users - providing type hints for pandas users

For (2) we need to figure out a way to test that we have properly typed the public API. The work that @gramster has done provides the typing for much of the public API, and that's what we want to integrate in. I still don't know how we create a set of tests for CI that help us verify that the typing of the public API is correct.

One thing we could consider doing now is to add a py.typed file at the top of the pandas distribution, which would mean that type checkers like mypy would look inside the pandas code to check user code. (I verified this with a simple test)

@gramster if we were to do that, how would pyright handle type checking for users with the presence of the type hints in the pandas distribution (with the py.typed file present), plus the type hints that you are shipping with pyright? Would the latter just be ignored because of the presence of the py.typed file in the pandas distribution? Or if a hint wasn't found in the pandas distribution, then the stubs shipped with pyright would be checked?

@MarcoGorelli
Copy link
Member

Reckon it's OK to add a py.typed file for 1.4, even though annotations aren't complete yet?

@jakebailey
Copy link

If you add py.typed, that means that the pandas lib should be treated as type complete and the canonical source of info, and type checkers will prefer it over any other installed stubs (including stubs shipped by Pylance, various other stub projects); if the typing isn't complete yet, it may be a worse experience without much of a workaround (the preference is defined in PEP 561).

I know there are a lot of types for pandas still in Microsoft's python-type-stubs repo; I'm not sure where the effort to merge those in here for parity currently stands.

@zero323
Copy link

zero323 commented Dec 7, 2021

@jakebailey

If you add py.typed, that means that the pandas lib should be treated as type complete and the canonical source of info, and type checkers will prefer it over any other installed stubs

This doesn't seem right. Since you mentioned PEP 561 it states that first in the resolution order are

  1. Stubs or Python source manually put in the beginning of the path. Type checkers SHOULD provide this to allow the user complete control of which stubs to use, and to patch broken stubs/inline types from packages. In mypy the $MYPYPATH environment variable can be used for this.

and inline hints have second lower precedence (with only typeshed being lower).

  1. Inline packages - if there is nothing overriding the installed package, and it opts into type checking, inline types SHOULD be used.

So with inline packages you can still override any type hint, just cannot depend on the typeshed.

This can be easily tested (corresponding code can be found here)

asciicast

As of that:

be treated as type complete

partial should apply to standard packages as well

Regular packages within namespace packages in stub-package distributions are considered complete unless a py.typed with partial\n is included.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 7, 2021

Reckon it's OK to add a py.typed file for 1.4, even though annotations aren't complete yet?

I'd like to discuss the idea I sent out via email about maintaining the stubs as a pandas-stubs project that can be maintained independently of whatever we put in the pandas project. We could start by pulling over the Microsoft stubs.

@MarcoGorelli
Copy link
Member

Let's discuss tomorrow - there'd need to be a way of making sure it doesn't go out-of-date. Also, there's already https://github.com/VirtusLab/pandas-stubs

@jakebailey
Copy link

I'm happy to be wrong on the resolution order; I just recall Pylance's set of bundled stubs behaving more like an extra typeshed such that we wouldn't have to worry about packages becoming py.typed, except that in this instance, our types are preferred. It wouldn't be too hard to test, just install pandas and touch py.typed in its dir (and also the same to use pyright's type completeness report to see where it stands).

IMO inlined types are the end-all-be-all, it's just that they have to be pretty solid to end up with a good UX; @Dr-Irv has kindly put a lot of effort into improving the stubs in the meantime over a pretty long timeframe.

@erictraut
Copy link

Jake is correct about the resolution order in Pylance. As per PEP 561, users can explicitly configure a "stubPath" that overrides other type information, but the stubs that ship with Pylance are treated like typeshed stubs and have the lowest priority. So if pandas becomes "py.typed", it will override the stubs that Pylance includes. That's fine as long as the inlined type information shipped with pandas is relatively complete (preferably a superset of what is included in the Pylance stubs today).

@zero323
Copy link

zero323 commented Dec 7, 2021

I hate to make a long comment thread even longer, but is there any reason (i.e. is the design so different) why not consider migrating Microsoft stubs into pandas as-is and following with step-wise migration to inline hints? I am aware that @erictraut might disagree with me here (sorry), but that's immediate improvement for majority of users (and dependent projects).

It also reduces risk of divergence between independent stubs and ongoing annotation effort within the project, which might be a lesser concern for the end users, but pretty serious issue in case of any typed library that depends on pandas.

@erictraut
Copy link

@zero323, I think that's a great idea. I don't disagree in the slightest. If there's anything we can do to facilitate this, please let us know.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 7, 2021

@erictraut @zero323 I've asked that this be put on the agenda for our monthly pandas development meeting tomorrow (December 8, 1PM Eastern time). All are welcome to join. Details can be found here: https://pandas.pydata.org/docs/development/meeting.html#calendar

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 12, 2021

We have scheduled a meeting on January 7, 2022, at 5PM UTC time to discuss this topic in more depth. Information about that meeting can be found on the pandas development meeting calendar here: https://pandas.pydata.org/docs/development/meeting.html . All are welcome to attend.

Aside from ongoing efforts to add typing to the pandas source, there have been two efforts to provide type stubs for pandas:

  1. Microsoft has been maintaining stubs that are shipped with Visual Studio Code. See https://github.com/microsoft/python-type-stubs/tree/main/pandas . This allows the built-in type checker pyright to check pandas code.
  2. There is a pandas-stubs project (available on pypi at https://pypi.org/project/pandas-stubs/) and on Github at https://github.com/VirtusLab/pandas-stubs

One interesting aspect of the pandas-stubs project is that it can be used with other editors (pycharm, etc.) and mypy.

The key developers of both of those projects have been invited to participate in that meeting and have indicated that they will be able to attend.

We last discussed this topic in depth at our January, 2021 development meeting which led to a summary given above #28142 (comment)

In order to provide type stubs that support the pandas public API, I see a few options going forward:

  1. The pandas project and contributors continue with the current efforts to add typing to the pandas code, eventually leading to shipping pandas with a py.typed file indicating that the type stubs are complete.
  2. We integrate the efforts from the 2 aforementioned projects into the pandas project, and take over maintenance going forward, eventually shipping a py.typed file indicating that the type stubs are complete.
  3. The pandas team takes over maintenance of the pandas-stubs project, and we end up publishing/maintaining the pandas-stubs project, making sure that new releases of pandas and pandas-stubs are coordinated.
  4. We de-prioritize efforts to make the pandas public API fully typed, and let the ecosystem projects evolve to support typing with pandas.

If we choose any of the first 3 options, we should also consider how we want to test the provided stubs. In the pandas-stubs project, they have created a testing mechanism that makes sure that "normal" pandas usage passes mypy tests. The testing paradigm also handles multiple python versions. It's not clear to me if their testing mechanism also catches incorrect usage of the pandas library (i.e., if you pass incorrect parameters to a pandas method, mypy should identify it, and this should be tested).

I think there are pros and cons to each of the above options (or maybe there are other options as well), and they can be discussed here or at our meeting on January 7, 2022.

@max-sixty
Copy link
Contributor

If we choose any of the first 3 options, we should also consider how we want to test the provided stubs.

In Xarray we are attempting to use our pytests as tests for our typing too (e.g. pydata/xarray#5690 (comment); and pydata/xarray#5694 is one attempt at generalizing, currently lingering). That ensures the types are sufficiently broad. Though it doesn't ensure the types are as narrow as possible (i.e. assigning everything as Any would still pass).

Notably, I only realized that tests require some annotation at the test level; e.g. def test_X() -> None in order to test the types.


From an ecosystem POV, speaking as an xarray core-dev — we would benefit from upstream projects like pandas having a py.typed file — while the stubs that ship with pyright are useful when editing, and we can look into using pandas-stubs — having the types shipped with the code allow us to check our compatibility with an exact pandas version.

And notwithstanding the issues around replacing more complete alternative versions that pandas now has, typing doesn't need to be perfect in order to ship with a py.typed file — if some functions aren't typed, mypy handles that fine (please correct me if I'm at all wrong here). Xarray has done this for the past two years, and our typing is still improving!

@zero323
Copy link

zero323 commented Dec 12, 2021

If we choose any of the first 3 options, we should also consider how we want to test the provided stubs.

That's quite broad topic, but in general you can have multiple testing strategies. For "positive" cases (annotations are consistent and capture common usage examples) you can:

  • Test annotations for internal consistency (this can cover existing tests) ‒ that's what mypy checks on module or source root will do.
  • Test documentation and examples as long as they're self-contained ‒ effectively running mypy with your package on path, against example scripts or extracted code.

For negative cases (checkers should detect incorrect usage):

  • You can start with confirming that checkers detect incorrect usage patterns, if these are covered by tests. This is useful and virtually free, but depending on the specific approach, can quickly get out of date.
  • For more complex things you typically use targeted data tests. Mypy has some internal utilities that can be used to that (personally, I utilized these in pyspark-stubs). pytest-mypy-plugins uses the same approach, with much better UX IMHO.

@gertcuykens
Copy link

gertcuykens commented Oct 4, 2022

pip3.10 install -U 'psycopg[c]'
pip3.10 install -U --pre SQLAlchemy
pip3.10 install -U pandas

    engine = create_engine(
        "postgresql+psycopg://gert:p@localhost/gert", echo=False, future=True
    )

    with engine.begin() as conn:
        stmt = select(A)
        df = pd.read_sql(stmt, conn)
        with pd.option_context('display.max_rows', None, 'display.max_columns', None):
            print(df)
        df.to_sql('test', conn, if_exists='replace', index = False)

Type of "to_sql" is partially unknown

pandas-dev/pandas-stubs#353

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 4, 2022

@gertcuykens please try the pandas-stubs package as that is the supported way for public typing stubs.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@simonjayhawkins
Copy link
Member Author

          > Thanks @Dr-Irv, but is their any reason why we should not use builtin type checking?

There are a few reasons:

  1. The types inside of pandas are not complete.
  2. The types inside of pandas are meant more for internal type checking of the code, and less for users of pandas.
  3. Many of the API functions are not typed in the pandas source.

That's why we created pandas-stubs. It's meant to support type checking of user code, and we can do a LOT more checking there than is possible within the pandas code itself. We also have a testing mechanism that tests the validity of those type stubs.

Install pandas-stubs and use it. You'll be glad you did!

Originally posted by @Dr-Irv in #49865 (comment)

@davetapley
Copy link
Contributor

@simonjayhawkins hi, just checking in to see if pandas-stubs is still recommended?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 10, 2023

@simonjayhawkins hi, just checking in to see if pandas-stubs is still recommended?

I'll answer for him, and the answer is yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Requires discussion from core team before further action Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging a pull request may close this issue.