Skip to content

Conversation

@mpelko
Copy link
Owner

@mpelko mpelko commented Aug 20, 2025

Rationale for this change

Example of fixes required to pass a type check for inline typed python files in pyarrow/interchange.

Allows for passing:

pushd arrow/python
mypy pyarrow/interchange
pyright pyarrow/interchange
ty check pyarrow/interchange
popd

Results of running the above check

BEFORE (main):

...
Found 44 errors in 8 files (checked 5 source files)  # by mypy
...
17 errors, 0 warnings, 0 informations  # by pyright
...
Found 14 diagnostics  # by ty

AFTER (this branch):

Success: no issues found in 5 source files  # mypy
0 errors, 0 warnings, 0 informations  # pyright
WARN ty is pre-release software and not ready for production use. Expect to encounter bugs, missing features, and fatal errors.
Checking ------------------------------------------------------------ 5/5 files                                                                                   All checks passed!  # ty

What changes are included in this PR?

  • Combination of required type ignores, mainly for missing pyarrow type stubs from coming from cython files.
  • Some type ignores for untyped external dependencies
  • Minor changes, either fixing the uncovered issues or adding more precision to type handling.

Are these changes tested?

All tests still pass locally. Additionally, type checks using mypy, pyright and ty on pyarrow/interchange/dataframe.py are now passing.

Are there any user-facing changes?

No.

@mpelko mpelko marked this pull request as ready for review August 20, 2025 08:38
# Package is not installed, parse git tag at runtime
try:
import setuptools_scm
import setuptools_scm # type: ignore[import-untyped]
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of external dependencies not providing stubs or being typed.

__version__ = ""

import pyarrow.lib as _lib
import pyarrow.lib as _lib # type: ignore[import-not-found]
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of internal stubs missing.

return func.call(args, None, memory_pool)
else:
def wrapper(*args, memory_pool=None, options=None, **kwargs):
def wrapper(*args, memory_pool=None, options=None, **kwargs): # type: ignore
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nasty one, as the wrapper signature is inconsistent within two logical branches. I couldn't find a quick fix.


@property
def null_count(self) -> int:
def null_count(self) -> int | None:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using | operator here that is only available from python 3.10 - do we rather want to stick to Optional[X], which is supported earlier already? Or maybe rather add from __future__ import annotations? In any case, we should chose one and be consistent within the project.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spec uses Optional and we're currently supporting from 3.9 on and pyarrow-stubs uses |. I'd go with |.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Given we're supporting 3.9 and 3.9 did not yet have | I suppose we add from __future__ import annotations?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched all the Optional to | within interchange folder. annotations import was arleady there.

The metadata for the column. See `DataFrame.metadata` for more details.
"""
pass
return {}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't break any tests and seemed "the right thing to do", but it would be great if somebody with more insight into the project looks at this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty dict seems ok as per spec this is implementing. Actual metadata would be even better, but that's out of scope :).

)
offset_buff, offset_dtype = (
buffers["offsets"] if buffers["offsets"] else (None, None)
)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rather sure this fixes the actual implementation, as I am not really sure TypedDict actually throws any TypeErrors as what the original logic was counting on. I might be wrong, and there was some other reason TypeErrors were expected. Either solution passes the unit tests.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this approach work?

validity_buff, validity_dtype = buffers.get("validity", (None, None))

Copy link
Owner Author

@mpelko mpelko Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly no, as per the signature of ColumnBuffers validity field is optional and as such the getter actually receives a value (None) which can not be mapped to (None, None). See:

validity: Optional[Tuple[_PyArrowBuffer, Dtype]]

Same for offsets.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could rework the ColumnBuffer to make this nicer (e.g. Tuple[_PyArrowBuffer, Dtype] is a good candidate for a dedicated type given that it's used in several places), but I'm trying to keep the changes at the minimum here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Agreed, let's keep changes to a minimum.

Copy link

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @mpelko!
Some comments, but nothing major.
This will be useful to discuss the inline vs stubs annotation for .py files.

)
offset_buff, offset_dtype = (
buffers["offsets"] if buffers["offsets"] else (None, None)
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this approach work?

validity_buff, validity_dtype = buffers.get("validity", (None, None))

The metadata for the column. See `DataFrame.metadata` for more details.
"""
pass
return {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty dict seems ok as per spec this is implementing. Actual metadata would be even better, but that's out of scope :).


@property
def null_count(self) -> int:
def null_count(self) -> int | None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spec uses Optional and we're currently supporting from 3.9 on and pyarrow-stubs uses |. I'd go with |.

Comment on lines 543 to 546
raise ValueError(
"Column offsets buffer must have 2 or 3 buffers, "
f"but has {n} buffers: {array.buffers()}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this won't break anything? Same above.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're sure it didn't break the tests. :-) I lack the insight in the project to know if this would break other things. Are there cases where the pa.Array can have more than 3 buffers? If so, the existing logic (on main) would return None for _get_data_buffer or for _get_offsets_buffer and the new logic would raise a ValueError.

Alternatively we could simply change the signatures of the corresponding functions to allow for returning optional values?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For offsets I now allow for None as a return of this function, as this is actually ok in the usage. As per above, the _get_data really should return something as it's packed into something that expects some data so I am keeping an explicit ValueError in case there is no data in pa.Array.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we never have more then three see (I think variadic buffers are handled differently).

Yeah, I was thinking maybe someone is depending on this returning None.

Let's leave this as is.

@rok
Copy link

rok commented Aug 22, 2025

This looks good now. I think we need to wait for the discussion to catch up now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants