Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TYP: Use Protocols for file-like objects in read/to_* #43951

Merged
merged 5 commits into from
Nov 17, 2021
Merged

TYP: Use Protocols for file-like objects in read/to_* #43951

merged 5 commits into from
Nov 17, 2021

Conversation

twoertwein
Copy link
Member

@twoertwein twoertwein commented Oct 10, 2021

Fixes #41610, rebased on top of #43855.

This PR does a few things:

  1. break FilePathOrBuffer apart to not mix basic types and generics
  2. use protocols instead of union of specific classes
  3. define many fine-grained protocols for the to/read methods/functions

I tested that the protocols are sufficient (need no additional attributes/methods) using mock classes with:

  • read_csv (python/c/pyarrow; w/wo compression) and to_csv (w/wo compression)
  • to_json and read_json (each w/wo compression)
  • to_pickle, read_pickle
  • to_excel (openpyxl/xlsxwriter) and read_excel (openpyxl)
  • to_stata and read_stata

Future: use many overloads for get_handle to return the (wrapped) fine-grained protocols.

@twoertwein twoertwein added the Typing type annotations, mypy/pyright type checking label Oct 10, 2021
@@ -2674,14 +2675,14 @@ def to_markdown(

with get_handle(buf, mode, storage_options=storage_options) as handles:
assert not isinstance(handles.handle, (str, mmap.mmap))
handles.handle.writelines(result)
handles.handle.write(result)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result is a str, writelines worked because str also a Sequence.

# The C engine doesn't need the file-like to have the "__next__"
# attribute. However, the Python engine explicitly calls
# "__next__(...)" when iterating through such an object, meaning it
if is_file_like(f) and engine != "c" and not hasattr(f, "__iter__"):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__iter__ by itself seems to be enough.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously TextFileReader._check_file_or_buffer would raise a ValueError for the "python" engine if f was a tempfile.SpooledTemporaryFile (which implements __iter__ but not __next__)

Just wanted to check if SpooledTemporaryFile now works in the "python" engine for TextFileReader (and thus this test should no longer skip the python parser) or if this is an oversight? Sorry if this is a very niche case but it's something I've just run into on 1.3.4 and noticed that this logic had been recently changed in this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpooledTemporaryFile works if you open it in text mode, e.g., SpooledTemporaryFile(mode="r+t").

The issue is that SpooledTemporaryFile does not have the attribute/property readable which io.TextIOWrapper requires (if you open SpooledTemporaryFile in binary mode). If you are lucky, you can convince the cpython maintainers to make SpooledTemporaryFile compatible with io.TextIOWrapper in a future python version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, we should add readable to the binary protocols that are wrapped in TextIOWrapper (I think only for read_json and read_read_csv).

@pep8speaks
Copy link

pep8speaks commented Nov 3, 2021

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-11-16 21:51:31 UTC

@twoertwein twoertwein marked this pull request as ready for review November 3, 2021 03:23
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow looks good, much more explicit.

cc @pandas-dev/pandas-core for comments



class ReadCsvBuffer(ReadBuffer[AnyStr_cov], Protocol):
def __iter__(self) -> Iterator[AnyStr_cov]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for the othe rengines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The c-engine is fine with ReadBuffer. I'll test whether ReadBuffer is also sufficient for the pyarrow engine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow needs closed

@jreback jreback added this to the 1.4 milestone Nov 6, 2021
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, small comments

@@ -26,7 +28,7 @@
@doc(storage_options=generic._shared_docs["storage_options"])
def to_feather(
df: DataFrame,
path: FilePathOrBuffer[bytes],
path: FilePath | WriteBuffer[bytes],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to have to document these new types? Because the docs will show the argument as FilePath | WriteBuffer[bytes] in the signature, but the docs (in this case) show path : string file path, or file-like object so I'm worried about confusion when people see that the signature includes a type that isn't documented

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I think there are probably at least two options to avoid this confusion:

  • create a new section in one of the IO overviews that contains definitions for these *Buffers, then update the doc-string to point to the definition page
  • Luckily, most to/read functions work with a Read/WriteBuffer: updating the doc-string to "str/byte readable/writeable file-like object" might be sufficient for them? For functions that have more specific needs (probably only read_csv; truncate (excel) and readlines (pickle) are also fairly standard?) update the doc-string to outline the "unusual" requirements (__iter__ for the python engine).

I don't have strong opinions about this, except that the doc-strings need to be updated to some degree :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do both? Create a section that explains the types (in case anyone searches for them), but make all the doc strings explicit, maybe something like "`ReadBuffer[bytes] (str/byte readable file-like object)"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the docs will show the argument as...

I'm not sure what is meant by this, the docs don't show the type-hints.

https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.to_feather.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current docs don't because we don't have type hints in the 1.3.4, but I think they will show in the future with all the typing work that has been done. But I could be wrong about that.

If they are not showing up, we should discuss whether they should be showing for the entire docs as we keep adding typing to the method/function signatures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this is an issue that is becoming more and more prominent, and something that should be discussed. However, it seems to me that this is already pervasive throughout the code (e.g. "array-like") and this PR doesn't make the issue significantly worse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the doc-strings to be consistent (without outlining the exact requirements).

Having an overview page of all typing definitions (not just IO) would be great (and probalby also necesary/expected when pandas becomes a py.typed library).

I know that google-conform doc-strings allow to skip the type descritpion if type annotatiosn are present. That would at least ensure that the doc-string and the type annotations are in sync.

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just concerned about the docs implication of what the signatures say and how we document the arguments for each of the readers and writers

@twoertwein
Copy link
Member Author

sorry, I squashed - thought it would make the rebasing easier.

@jreback
Copy link
Contributor

jreback commented Nov 14, 2021

sorry, I squashed - thought it would make the rebasing easier.

you can or not - doesn't matter as squashes in merge anyways

@jreback jreback merged commit 2cc1227 into pandas-dev:master Nov 17, 2021
@jreback
Copy link
Contributor

jreback commented Nov 17, 2021

thanks @twoertwein

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TYP/DOC: Use Protocols for file-like objects in read/to_*
8 participants