TYP: Use Protocols for file-like objects in read/to_* #43951

twoertwein · 2021-10-10T02:23:24Z

Fixes #41610, rebased on top of #43855.

This PR does a few things:

break FilePathOrBuffer apart to not mix basic types and generics
use protocols instead of union of specific classes
define many fine-grained protocols for the to/read methods/functions

I tested that the protocols are sufficient (need no additional attributes/methods) using mock classes with:

read_csv (python/c/pyarrow; w/wo compression) and to_csv (w/wo compression)
to_json and read_json (each w/wo compression)
to_pickle, read_pickle
to_excel (openpyxl/xlsxwriter) and read_excel (openpyxl)
to_stata and read_stata

Future: use many overloads for get_handle to return the (wrapped) fine-grained protocols.

twoertwein · 2021-11-03T02:46:46Z

pandas/core/frame.py

@@ -2674,14 +2675,14 @@ def to_markdown(

        with get_handle(buf, mode, storage_options=storage_options) as handles:
            assert not isinstance(handles.handle, (str, mmap.mmap))
-            handles.handle.writelines(result)
+            handles.handle.write(result)


result is a str, writelines worked because str also a Sequence.

twoertwein · 2021-11-03T03:07:55Z

pandas/io/parsers/readers.py

-            # The C engine doesn't need the file-like to have the "__next__"
-            # attribute. However, the Python engine explicitly calls
-            # "__next__(...)" when iterating through such an object, meaning it
+        if is_file_like(f) and engine != "c" and not hasattr(f, "__iter__"):


__iter__ by itself seems to be enough.

Previously TextFileReader._check_file_or_buffer would raise a ValueError for the "python" engine if f was a tempfile.SpooledTemporaryFile (which implements __iter__ but not __next__)

Just wanted to check if SpooledTemporaryFile now works in the "python" engine for TextFileReader (and thus this test should no longer skip the python parser) or if this is an oversight? Sorry if this is a very niche case but it's something I've just run into on 1.3.4 and noticed that this logic had been recently changed in this PR

SpooledTemporaryFile works if you open it in text mode, e.g., SpooledTemporaryFile(mode="r+t").

The issue is that SpooledTemporaryFile does not have the attribute/property readable which io.TextIOWrapper requires (if you open SpooledTemporaryFile in binary mode). If you are lucky, you can convince the cpython maintainers to make SpooledTemporaryFile compatible with io.TextIOWrapper in a future python version?

Technically, we should add readable to the binary protocols that are wrapped in TextIOWrapper (I think only for read_json and read_read_csv).

pep8speaks · 2021-11-03T03:11:25Z

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-11-16 21:51:31 UTC

jreback

wow looks good, much more explicit.

cc @pandas-dev/pandas-core for comments

jreback · 2021-11-06T23:54:39Z

pandas/_typing.py

+
+
+class ReadCsvBuffer(ReadBuffer[AnyStr_cov], Protocol):
+    def __iter__(self) -> Iterator[AnyStr_cov]:


not for the othe rengines?

The c-engine is fine with ReadBuffer. I'll test whether ReadBuffer is also sufficient for the pyarrow engine.

pyarrow needs closed

phofl

Looks really good, small comments

pandas/_typing.py

Dr-Irv · 2021-11-08T14:44:29Z

pandas/io/feather_format.py

@@ -26,7 +28,7 @@
 @doc(storage_options=generic._shared_docs["storage_options"])
 def to_feather(
    df: DataFrame,
-    path: FilePathOrBuffer[bytes],
+    path: FilePath | WriteBuffer[bytes],


Are we going to have to document these new types? Because the docs will show the argument as FilePath | WriteBuffer[bytes] in the signature, but the docs (in this case) show path : string file path, or file-like object so I'm worried about confusion when people see that the signature includes a type that isn't documented

Good point! I think there are probably at least two options to avoid this confusion:

create a new section in one of the IO overviews that contains definitions for these *Buffers, then update the doc-string to point to the definition page

Luckily, most to/read functions work with a Read/WriteBuffer: updating the doc-string to "str/byte readable/writeable file-like object" might be sufficient for them? For functions that have more specific needs (probably only read_csv; truncate (excel) and readlines (pickle) are also fairly standard?) update the doc-string to outline the "unusual" requirements (__iter__ for the python engine).

I don't have strong opinions about this, except that the doc-strings need to be updated to some degree :)

Maybe do both? Create a section that explains the types (in case anyone searches for them), but make all the doc strings explicit, maybe something like "`ReadBuffer[bytes] (str/byte readable file-like object)"

Because the docs will show the argument as...

I'm not sure what is meant by this, the docs don't show the type-hints.

https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.to_feather.html

The current docs don't because we don't have type hints in the 1.3.4, but I think they will show in the future with all the typing work that has been done. But I could be wrong about that.

If they are not showing up, we should discuss whether they should be showing for the entire docs as we keep adding typing to the method/function signatures.

Agreed that this is an issue that is becoming more and more prominent, and something that should be discussed. However, it seems to me that this is already pervasive throughout the code (e.g. "array-like") and this PR doesn't make the issue significantly worse.

I updated the doc-strings to be consistent (without outlining the exact requirements).

Having an overview page of all typing definitions (not just IO) would be great (and probalby also necesary/expected when pandas becomes a py.typed library).

I know that google-conform doc-strings allow to skip the type descritpion if type annotatiosn are present. That would at least ensure that the doc-string and the type annotations are in sync.

Dr-Irv

Just concerned about the docs implication of what the signatures say and how we document the arguments for each of the readers and writers

twoertwein · 2021-11-14T20:59:29Z

sorry, I squashed - thought it would make the rebasing easier.

jreback · 2021-11-14T21:02:07Z

sorry, I squashed - thought it would make the rebasing easier.

you can or not - doesn't matter as squashes in merge anyways

jreback · 2021-11-17T02:11:42Z

thanks @twoertwein

twoertwein added the Typing type annotations, mypy/pyright type checking label Oct 10, 2021

twoertwein commented Nov 3, 2021

View reviewed changes

twoertwein marked this pull request as ready for review November 3, 2021 03:23

twoertwein mentioned this pull request Nov 3, 2021

TYP: make IOHandles generic #43855

Merged

jreback requested changes Nov 6, 2021

View reviewed changes

jreback added this to the 1.4 milestone Nov 6, 2021

phofl reviewed Nov 7, 2021

View reviewed changes

pandas/_typing.py Outdated Show resolved Hide resolved

pandas/_typing.py Outdated Show resolved Hide resolved

Dr-Irv reviewed Nov 8, 2021

View reviewed changes

twoertwein added 2 commits November 15, 2021 20:32

TYP: Use Protocols for file-like objects in read/to_*

115aaaf

doc

69e0534

twoertwein mentioned this pull request Nov 16, 2021

avoid a warning by PyTypeChecker #44478

Closed

4 tasks

twoertwein added 3 commits November 15, 2021 20:44

unusec import

ea0926b

revert is_file_like change

fbc6fde

different error on py38?

0c77828

jreback approved these changes Nov 17, 2021

View reviewed changes

jreback merged commit 2cc1227 into pandas-dev:master Nov 17, 2021

twoertwein mentioned this pull request Nov 17, 2021

stdlib stubs are unnecessarily strict with file-like objects python/typeshed#4212

Closed

simonjayhawkins mentioned this pull request Nov 18, 2021

TYP: improve typing for DataFrame.to_string #44426

Merged

jeffreykennethli mentioned this pull request Jan 12, 2022

pandas==1.4.0rc0 FilePathOrBuffer deprecation breaks modin import modin-project/modin#3947

Closed

twoertwein mentioned this pull request Sep 24, 2022

TYP: tighten IO Protocols pandas-dev/pandas-stubs#326

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TYP: Use Protocols for file-like objects in read/to_* #43951

TYP: Use Protocols for file-like objects in read/to_* #43951

twoertwein commented Oct 10, 2021 •

edited

Loading

twoertwein Nov 3, 2021

twoertwein Nov 3, 2021

kevinmarsh Nov 24, 2021

twoertwein Nov 24, 2021

twoertwein Nov 24, 2021

pep8speaks commented Nov 3, 2021 •

edited

Loading

jreback left a comment

jreback Nov 6, 2021

twoertwein Nov 7, 2021

twoertwein Nov 7, 2021

phofl left a comment

Dr-Irv Nov 8, 2021

twoertwein Nov 8, 2021

Dr-Irv Nov 8, 2021

rhshadrach Nov 8, 2021

Dr-Irv Nov 9, 2021

rhshadrach Nov 13, 2021

twoertwein Nov 16, 2021

Dr-Irv left a comment

twoertwein commented Nov 14, 2021

jreback commented Nov 14, 2021

jreback commented Nov 17, 2021



		class ReadCsvBuffer(ReadBuffer[AnyStr_cov], Protocol):
		def __iter__(self) -> Iterator[AnyStr_cov]:

TYP: Use Protocols for file-like objects in read/to_* #43951

TYP: Use Protocols for file-like objects in read/to_* #43951

Conversation

twoertwein commented Oct 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Nov 3, 2021 • edited Loading

Comment last updated at 2021-11-16 21:51:31 UTC

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dr-Irv left a comment

Choose a reason for hiding this comment

twoertwein commented Nov 14, 2021

jreback commented Nov 14, 2021

jreback commented Nov 17, 2021

twoertwein commented Oct 10, 2021 •

edited

Loading

pep8speaks commented Nov 3, 2021 •

edited

Loading