Ensure pyarrow/interchange passes type checks by mypy, pyright and ty #2

mpelko · 2025-08-20T08:43:55Z

Example of external dependencies not providing stubs or being typed.

mpelko · 2025-08-20T08:44:10Z

Example of internal stubs missing.

mpelko · 2025-08-20T08:45:45Z

This is a nasty one, as the wrapper signature is inconsistent within two logical branches. I couldn't find a quick fix.

mpelko · 2025-08-20T08:48:13Z

Using | operator here that is only available from python 3.10 - do we rather want to stick to Optional[X], which is supported earlier already? Or maybe rather add from __future__ import annotations? In any case, we should chose one and be consistent within the project.

Spec uses Optional and we're currently supporting from 3.9 on and pyarrow-stubs uses |. I'd go with |.

Sounds good. Given we're supporting 3.9 and 3.9 did not yet have | I suppose we add from __future__ import annotations?

Switched all the Optional to | within interchange folder. annotations import was arleady there.

mpelko · 2025-08-20T08:49:00Z

This didn't break any tests and seemed "the right thing to do", but it would be great if somebody with more insight into the project looks at this.

Empty dict seems ok as per spec this is implementing. Actual metadata would be even better, but that's out of scope :).

mpelko · 2025-08-20T08:50:52Z

I'm rather sure this fixes the actual implementation, as I am not really sure TypedDict actually throws any TypeErrors as what the original logic was counting on. I might be wrong, and there was some other reason TypeErrors were expected. Either solution passes the unit tests.

Would this approach work?

validity_buff, validity_dtype = buffers.get("validity", (None, None))

Sadly no, as per the signature of ColumnBuffers validity field is optional and as such the getter actually receives a value (None) which can not be mapped to (None, None). See:

arrow/python/pyarrow/interchange/column.py

Line 125 in a606011

validity: Optional[Tuple[_PyArrowBuffer, Dtype]]

Same for offsets.

We could rework the ColumnBuffer to make this nicer (e.g. Tuple[_PyArrowBuffer, Dtype] is a good candidate for a dedicated type given that it's used in several places), but I'm trying to keep the changes at the minimum here.

Thanks for the explanation! Agreed, let's keep changes to a minimum.

-Original file line number
+Diff line change
@@ Expand Up / @@ -41,24 +41,24 @@ @@
     except ImportError:
         # Package is not installed, parse git tag at runtime
         try:
-            import setuptools_scm
+            import setuptools_scm  # type: ignore[import-untyped]
             # Code duplicated from setup.py to avoid a dependency on each other
             def parse_git(root, **kwargs):
                 """
                 Parse function for setuptools_scm that ignores tags for non-C++
                 subprojects, e.g. apache-arrow-js-XXX tags.
                 """
-                from setuptools_scm.git import parse
+                from setuptools_scm.git import parse  # type: ignore[import-untyped]
                 kwargs['describe_command'] = \
                     "git describe --dirty --tags --long --match 'apache-arrow-[0-9]*.*'"
                 return parse(root, **kwargs)
             __version__ = setuptools_scm.get_version('../',
                                                      parse=parse_git)
         except ImportError:
-            __version__ = None
+            __version__ = ""
-    import pyarrow.lib as _lib
+    import pyarrow.lib as _lib  # type: ignore[import-not-found]
     from pyarrow.lib import (BuildInfo, CppBuildInfo, RuntimeInfo, set_timezone_db_path,
                              MonthDayNano, VersionInfo, build_info, cpp_build_info,
                              cpp_version, cpp_version_info, runtime_info,
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -15,7 +15,7 @@ @@
     # specific language governing permissions and limitations
     # under the License.
-    from pyarrow._compute import (  # noqa
+    from pyarrow._compute import (  # type: ignore[import-not-found] # noqa
         Function,
         FunctionOptions,
         FunctionRegistry,
@@ Expand Down Expand Up / @@ -251,7 +251,7 @@ def wrapper(*args, memory_pool=None): @@
                     return Expression._call(func_name, list(args))
                 return func.call(args, None, memory_pool)
         else:
-            def wrapper(*args, memory_pool=None, options=None, **kwargs):
+            def wrapper(*args, memory_pool=None, options=None, **kwargs):  # type: ignore
                 if arity is not Ellipsis:
                     if len(args) < arity:
                         raise TypeError(
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -21,7 +21,7 @@ @@
     from pyarrow.util import _is_path_like, _stringify_path
-    from pyarrow._fs import (  # noqa
+    from pyarrow._fs import (  # type: ignore[import-not-found] # noqa
         FileSelector,
         FileType,
         FileInfo,
@@ Expand All / @@ -40,22 +40,22 @@ @@
     _not_imported = []
     try:
-        from pyarrow._azurefs import AzureFileSystem  # noqa
+        from pyarrow._azurefs import AzureFileSystem  # type: ignore[import-not-found] # noqa
     except ImportError:
         _not_imported.append("AzureFileSystem")
     try:
-        from pyarrow._hdfs import HadoopFileSystem  # noqa
+        from pyarrow._hdfs import HadoopFileSystem  # type: ignore[import-not-found] # noqa
     except ImportError:
         _not_imported.append("HadoopFileSystem")
     try:
-        from pyarrow._gcsfs import GcsFileSystem  # noqa
+        from pyarrow._gcsfs import GcsFileSystem  # type: ignore[import-not-found] # noqa
     except ImportError:
         _not_imported.append("GcsFileSystem")
     try:
-        from pyarrow._s3fs import (  # noqa
+        from pyarrow._s3fs import (  # type: ignore[import-not-found] # noqa
             AwsDefaultS3RetryStrategy, AwsStandardS3RetryStrategy,
             S3FileSystem, S3LogLevel, S3RetryStrategy, ensure_s3_initialized,
             finalize_s3, ensure_s3_finalized, initialize_s3, resolve_s3_region)
@@ Expand Down Expand Up / @@ -111,7 +111,7 @@ def _ensure_filesystem(filesystem, *, use_mmap=False): @@
         else:
             # handle fsspec-compatible filesystems
             try:
-                import fsspec
+                import fsspec  # type: ignore[import-untyped]
             except ImportError:
                 pass
             else:
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -22,7 +22,6 @@ @@
         Any,
         Dict,
         Iterable,
-        Optional,
         Tuple,
     )
@@ Expand Down Expand Up / @@ -122,13 +121,13 @@ class ColumnBuffers(TypedDict): @@
         # first element is a buffer containing mask values indicating missing data;
         # second element is the mask value buffer's associated dtype.
         # None if the null representation is not a bit or byte mask
-        validity: Optional[Tuple[_PyArrowBuffer, Dtype]]
+        validity: Tuple[_PyArrowBuffer, Dtype] | None
         # first element is a buffer containing the offset values for
         # variable-size binary data (e.g., variable-length strings);
         # second element is the offsets buffer's associated dtype.
         # None if the data buffer does not have an associated offsets buffer
-        offsets: Optional[Tuple[_PyArrowBuffer, Dtype]]
+        offsets: Tuple[_PyArrowBuffer, Dtype] | None
     class CategoricalDescription(TypedDict):
@@ Expand All / @@ -139,7 +138,7 @@ class CategoricalDescription(TypedDict): @@
         is_dictionary: bool
         # Python-level only (e.g. ``{int: str}``).
         # None if not a dictionary-style categorical.
-        categories: Optional[_PyArrowColumn]
+        categories: _PyArrowColumn | None
     class Endianness:
@@ Expand Down Expand Up / @@ -314,13 +313,20 @@ def _dtype_from_arrowdtype( @@
                 kind = DtypeKind.CATEGORICAL
                 arr = self._col
                 indices_dtype = arr.indices.type
-                _, f_string = _PYARROW_KINDS.get(indices_dtype)
+                indices_dtype_tuple = _PYARROW_KINDS.get(indices_dtype)
+                if indices_dtype_tuple is None:
+                    raise ValueError(
+                        f"Data type {indices_dtype} not supported by interchange protocol"
+                    )
+                _, f_string = indices_dtype_tuple
                 return kind, bit_width, f_string, Endianness.NATIVE
             else:
-                kind, f_string = _PYARROW_KINDS.get(dtype, (None, None))
-                if kind is None:
+                optional_kind, f_string = _PYARROW_KINDS.get(dtype, (None, ""))
+                if optional_kind is None:
                     raise ValueError(
-                        f"Data type {dtype} not supported by interchange protocol")
+                        f"Data type {dtype} not supported by interchange protocol"
+                    )
+                kind = optional_kind
                 return kind, bit_width, f_string, Endianness.NATIVE
@@ Expand Down Expand Up / @@ -379,7 +385,7 @@ def describe_null(self) -> Tuple[ColumnNullType, Any]: @@
                 return ColumnNullType.USE_BITMASK, 0
         @property
-        def null_count(self) -> int:
+        def null_count(self) -> int | None:
             """
             Number of null elements, if known.
@@ Expand All / @@ -394,7 +400,7 @@ def metadata(self) -> Dict[str, Any]: @@
             """
             The metadata for the column. See `DataFrame.metadata` for more details.
             """
-            pass
+            return {}
         def num_chunks(self) -> int:
             """
@@ Expand All / @@ -403,7 +409,7 @@ def num_chunks(self) -> int: @@
             return 1
         def get_chunks(
-            self, n_chunks: Optional[int] = None
+            self, n_chunks: int | None = None
         ) -> Iterable[_PyArrowColumn]:
             """
             Return an iterator yielding the chunks.
@@ Expand Down Expand Up / @@ -486,6 +492,11 @@ def _get_data_buffer( @@
                 return _PyArrowBuffer(array.buffers()[1]), dtype
             elif n == 3:
                 return _PyArrowBuffer(array.buffers()[2]), dtype
+            else:
+                raise ValueError(
+                    "Column data buffer must have 2 or 3 buffers, "
+                    f"but has {n} buffers: {array.buffers()}"
+                )
         def _get_validity_buffer(self) -> Tuple[_PyArrowBuffer, Any]:
             """
@@ Expand All @@
                     "There are no missing values so "
                     "does not have a separate mask")
-        def _get_offsets_buffer(self) -> Tuple[_PyArrowBuffer, Any]:
+        def _get_offsets_buffer(self) -> Tuple[_PyArrowBuffer, Any] | None:
             """
             Return the buffer containing the offset values for variable-size binary
             data (e.g., variable-length strings) and the buffer's associated dtype.
@@ Expand All @@
                 else:
                     dtype = (DtypeKind.INT, 32, "i", Endianness.NATIVE)
                 return _PyArrowBuffer(array.buffers()[1]), dtype
+            return None

-Original file line number
+Diff line change
@@ Expand Up / @@ -19,7 +19,6 @@ @@
     from typing import (
         Any,
         Iterable,
-        Optional,
         Sequence,
     )
@@ Expand Down Expand Up / @@ -174,7 +173,7 @@ def select_columns_by_name( @@
             )
         def get_chunks(
-            self, n_chunks: Optional[int] = None
+            self, n_chunks: int | None = None
         ) -> Iterable[_PyArrowDataFrame]:
             """
             Return an iterator yielding the chunks.
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure pyarrow/interchange passes type checks by mypy, pyright and ty #2

Uh oh!

Diff view

Diff view

There are no files selected for viewing

mpelko Aug 20, 2025

Uh oh!

mpelko Aug 20, 2025

Uh oh!

mpelko Aug 20, 2025

Uh oh!

mpelko Aug 20, 2025

Uh oh!

rok Aug 21, 2025

Uh oh!

mpelko Aug 22, 2025

Uh oh!

mpelko Aug 22, 2025

Uh oh!

mpelko Aug 20, 2025

Uh oh!

rok Aug 21, 2025

Uh oh!

mpelko Aug 20, 2025

Uh oh!

rok Aug 21, 2025

Uh oh!

mpelko Aug 22, 2025 •

edited

Loading

Uh oh!

mpelko Aug 22, 2025

Uh oh!

rok Aug 22, 2025

Uh oh!

-Original file line number
+Diff line change
@@ Expand Up / @@ -21,7 +21,7 @@ @@
     import pyarrow as pa
-    from pyarrow.lib import (IpcReadOptions, IpcWriteOptions, ReadStats, WriteStats,  # noqa
+    from pyarrow.lib import (IpcReadOptions, IpcWriteOptions, ReadStats, WriteStats,  # type: ignore[import-not-found] # noqa
                              Message, MessageReader,
                              RecordBatchReader, _ReadPandasMixin,
                              MetadataVersion, Alignment,
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -20,7 +20,7 @@ @@
     from enum import IntEnum
-    from pyarrow.lib import (is_boolean_value,  # noqa
+    from pyarrow.lib import (is_boolean_value,  # type: ignore[import-not-found] # noqa
                              is_integer_value,
                              is_float_value)
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up @@
             self._cls = cls
             if 'sphinx' in sys.modules:
-                from sphinx.ext.autodoc import ALL
+                from sphinx.ext.autodoc import ALL  # type: ignore[import-not-found]
             else:
                 ALL = object()
@@ Expand Down @@

Ensure pyarrow/interchange passes type checks by mypy, pyright and ty #2

Are you sure you want to change the base?

Uh oh!

Ensure pyarrow/interchange passes type checks by mypy, pyright and ty #2

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpelko Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpelko Aug 22, 2025 •

edited

Loading