-
Notifications
You must be signed in to change notification settings - Fork 0
Ensure pyarrow/interchange passes type checks by mypy, pyright and ty #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| # Package is not installed, parse git tag at runtime | ||
| try: | ||
| import setuptools_scm | ||
| import setuptools_scm # type: ignore[import-untyped] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example of external dependencies not providing stubs or being typed.
| __version__ = "" | ||
|
|
||
| import pyarrow.lib as _lib | ||
| import pyarrow.lib as _lib # type: ignore[import-not-found] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example of internal stubs missing.
| return func.call(args, None, memory_pool) | ||
| else: | ||
| def wrapper(*args, memory_pool=None, options=None, **kwargs): | ||
| def wrapper(*args, memory_pool=None, options=None, **kwargs): # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nasty one, as the wrapper signature is inconsistent within two logical branches. I couldn't find a quick fix.
|
|
||
| @property | ||
| def null_count(self) -> int: | ||
| def null_count(self) -> int | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using | operator here that is only available from python 3.10 - do we rather want to stick to Optional[X], which is supported earlier already? Or maybe rather add from __future__ import annotations? In any case, we should chose one and be consistent within the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spec uses Optional and we're currently supporting from 3.9 on and pyarrow-stubs uses |. I'd go with |.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Given we're supporting 3.9 and 3.9 did not yet have | I suppose we add from __future__ import annotations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched all the Optional to | within interchange folder. annotations import was arleady there.
| The metadata for the column. See `DataFrame.metadata` for more details. | ||
| """ | ||
| pass | ||
| return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't break any tests and seemed "the right thing to do", but it would be great if somebody with more insight into the project looks at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty dict seems ok as per spec this is implementing. Actual metadata would be even better, but that's out of scope :).
| ) | ||
| offset_buff, offset_dtype = ( | ||
| buffers["offsets"] if buffers["offsets"] else (None, None) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm rather sure this fixes the actual implementation, as I am not really sure TypedDict actually throws any TypeErrors as what the original logic was counting on. I might be wrong, and there was some other reason TypeErrors were expected. Either solution passes the unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this approach work?
validity_buff, validity_dtype = buffers.get("validity", (None, None))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly no, as per the signature of ColumnBuffers validity field is optional and as such the getter actually receives a value (None) which can not be mapped to (None, None). See:
arrow/python/pyarrow/interchange/column.py
Line 125 in a606011
| validity: Optional[Tuple[_PyArrowBuffer, Dtype]] |
Same for offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could rework the ColumnBuffer to make this nicer (e.g. Tuple[_PyArrowBuffer, Dtype] is a good candidate for a dedicated type given that it's used in several places), but I'm trying to keep the changes at the minimum here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! Agreed, let's keep changes to a minimum.
rok
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @mpelko!
Some comments, but nothing major.
This will be useful to discuss the inline vs stubs annotation for .py files.
| ) | ||
| offset_buff, offset_dtype = ( | ||
| buffers["offsets"] if buffers["offsets"] else (None, None) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this approach work?
validity_buff, validity_dtype = buffers.get("validity", (None, None))| The metadata for the column. See `DataFrame.metadata` for more details. | ||
| """ | ||
| pass | ||
| return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty dict seems ok as per spec this is implementing. Actual metadata would be even better, but that's out of scope :).
|
|
||
| @property | ||
| def null_count(self) -> int: | ||
| def null_count(self) -> int | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spec uses Optional and we're currently supporting from 3.9 on and pyarrow-stubs uses |. I'd go with |.
python/pyarrow/interchange/column.py
Outdated
| raise ValueError( | ||
| "Column offsets buffer must have 2 or 3 buffers, " | ||
| f"but has {n} buffers: {array.buffers()}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this won't break anything? Same above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're sure it didn't break the tests. :-) I lack the insight in the project to know if this would break other things. Are there cases where the pa.Array can have more than 3 buffers? If so, the existing logic (on main) would return None for _get_data_buffer or for _get_offsets_buffer and the new logic would raise a ValueError.
Alternatively we could simply change the signatures of the corresponding functions to allow for returning optional values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For offsets I now allow for None as a return of this function, as this is actually ok in the usage. As per above, the _get_data really should return something as it's packed into something that expects some data so I am keeping an explicit ValueError in case there is no data in pa.Array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we never have more then three see (I think variadic buffers are handled differently).
Yeah, I was thinking maybe someone is depending on this returning None.
Let's leave this as is.
|
This looks good now. I think we need to wait for the discussion to catch up now :) |
Rationale for this change
Example of fixes required to pass a type check for inline typed python files in pyarrow/interchange.
Allows for passing:
Results of running the above check
BEFORE (main):
AFTER (this branch):
What changes are included in this PR?
Are these changes tested?
All tests still pass locally. Additionally, type checks using
mypy,pyrightandtyon pyarrow/interchange/dataframe.py are now passing.Are there any user-facing changes?
No.