Skip to content

DataFrame with generic type #295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ramonhagenaars opened this issue Sep 12, 2022 · 10 comments
Open

DataFrame with generic type #295

ramonhagenaars opened this issue Sep 12, 2022 · 10 comments

Comments

@ramonhagenaars
Copy link

Hi there!

I'm the developer/maintainer of nptyping, a little library for type hinting around numpy (and pandas DataFrame soon).

What I'm working on, is to allow a type hint for pandas.DataFrame with a "structure" describing the contents:

from nptyping import DataFrame


df: DataFrame[_["x: Int, y: Int"]]

(ignore the underscore for now)

I'd like to make use of ("extend") the stubs of this repo. For that, I would need the DataFrame to take a generic type.

I've done this same thing with numpy's ndarray. This is how numpy hint their ndarray with generics:

class ndarray(_ArrayOrScalarCommon, Generic[_ShapeType, _DType_co]):
    ...

For DataFrame, I was hoping to find something similar, like:

class DataFrame(NDFrame, Generic[_Structure]):
    ...

This would allow combining these stubs with my type, while keeping the type checkers (MyPy, Pylance, ...) happy. I can imagine though that you may want to wait for Python3.11 with this and use variadic generics. Have you considered / would you consider adding generics to the pandas types?

TLDR

Are there any plans to add generics to the stub for DataFrame (and the other pandas types)?

@twoertwein
Copy link
Member

It is definitely worth thinking about which aspects of a DataFrame should be generic: each column could have its own type, the index could be generic, columns could be generic (they are not always strings!), but making everything generic will create quite some work :). I expect in most cases these arguments will be tricky to handle ("or even worse", require type-checker specific plugins). For example .drop might remove columns/rows - not sure how to handle that from a type checker perspective.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Sep 12, 2022

Are there any plans to add generics to the stub for DataFrame (and the other pandas types)?

We have made Series generic in the stubs. It took quite a while to get that into a place that seems to work well.

For DataFrame, we don't have any plans at the moment. We've been focusing on covering as much of the pandas API as we can, making sure that the types are not too narrow or too wide.

Is there any reason that you couldn't just create your own type GenericDataFrame that extends the pandas DataFrame with a Generic that would do what you want?

@ramonhagenaars
Copy link
Author

Thank you for responding!

@twoertwin

It is definitely worth thinking about which aspects of a DataFrame should be generic... but making everything generic will create quite some work :)

True that. Also, users may want to cherry pick what to type hint (e.g. the columns, but not the index). Some users may choose to not use the generics at all. For that reason, I think the upcoming variadic generics are interesting.

I expect in most cases these arguments will be tricky to handle

Agreed. In fact, I think it's impossible for static type checkers (MyPy, Pylance) to fully cover all aspects due to the dynamic nature of pandas types (with methods such as drop indeed). The focus of my library is dynamic checks, while staying MyPy/Pylance compliant.

@Dr-Irv

For DataFrame, we don't have any plans at the moment.

I can imagine that you are quite occupied already. Do you think it will make it to the backlog eventually? And would you welcome any suggestion from my part (in the form of a pull request or otherwise)?

Is there any reason that you couldn't just create your own type GenericDataFrame that extends the pandas DataFrame with a Generic that would do what you want?

Unfortunately, there is. My goal is to remain static type checker compliant and by overriding the pandas.DataFrame with my own (let's say GenericDataFrame), my clients will get type checker complaints that they are providing pandas DataFrame instances to e.g. functions that are expecting GenericDataFrame. Also, there is currently no way to "extend" - change parts of - stub files.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Sep 13, 2022

I can imagine that you are quite occupied already.

Yes, aside from a few people that have done one or two PRs, there are 3 of us who have been doing most of the work.

Do you think it will make it to the backlog eventually?

We can keep this issue open as something to consider in the future.

And would you welcome any suggestion from my part (in the form of a pull request or otherwise)?

We are absolutely open to PRs !!

@twoertwein
Copy link
Member

type checker complaints that they are providing pandas DataFrame instances to e.g. functions that are expecting GenericDataFrame

One probably ugly workaround could be to create a script on your side to convert the DataFrame stubs into a Protocol. You could then make this Protocol generic.

@twoertwein
Copy link
Member

One reason for not adding a mostly unused Generic parameter in the stubs is that pyright's reportUnknownParameterType will probably complain about it when the parameter is not explicitly specified.

@ramonhagenaars
Copy link
Author

@Dr-Irv

We can keep this issue open as something to consider in the future.

Makes sense.

We are absolutely open to PRs !!

Great! I might dive into this when I can find the time for it. As I said, I think the approach should be towards variadic generics, but it will require some investigation on my part.

@twoertwein

One probably ugly workaround could be to create a script on your side to convert the DataFrame stubs into a Protocol. You could then make this Protocol generic.

That's a clever workaround (with an acceptable smell to it indeed 😃 ). One problem with Protocols that I already found, is that at least MyPy will take forever to type check once the Protocol becomes larger. It makes sense: it needs to check every method and attribute separately every time it encounters the Protocol. I tried basically what you proposed for numpy.ndarray which, like pandas.DataFrame has an extensive API. I think I will encounter this same issue with DataFrame. Making this Protocol a small subset of your stub would work though.

Another workaround I'm now thinking about is to fork this repo and add the generics. I will need to figure out some synchronization process to keep up to date with the work on this repo.

One reason for not adding a mostly unused Generic parameter in the stubs is that pyright's reportUnknownParameterType will probably complain about it when the parameter is not explicitly specified.

Hmm I see. Maybe that will be resolved when using variadic generics, but I'm not sure.

@twoertwein
Copy link
Member

One problem with Protocols that I already found, is that at least MyPy will take forever to type check once the Protocol becomes larger.

The following might be faster but works only for mypy:

from typing import (
    Protocol,
    TypeVar,
)

import pandas as pd

GenericType = TypeVar("GenericType", covariant=True)

class DataFrame(Protocol[GenericType], pd.DataFrame):
    ...

def foo(x: DataFrame) -> None:
    ...

foo(pd.DataFrame())

Pyright error's with: error: Protocol class "Type[DataFrame[GenericType@DataFrame]]" cannot derive from non-protocol class "Type[DataFrame]"

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Sep 13, 2022

Pyright error's with: error: Protocol class "Type[DataFrame[GenericType@DataFrame]]" cannot derive from non-protocol class "Type[DataFrame]"

I would venture a guess here that pyright is correct, and mypy is incorrect.

@ramonhagenaars
Copy link
Author

The following might be faster but works only for mypy

Smart trick. However they seem to have fixed this with MyPy==0.971 (latest). It reports similarly to pyright with:

error: All bases of a protocol must be protocols  [misc]

I agree that this indeed is expected behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants