-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TYP: Use Protocols for file-like objects in read/to_* #43951
Conversation
@@ -2674,14 +2675,14 @@ def to_markdown( | |||
|
|||
with get_handle(buf, mode, storage_options=storage_options) as handles: | |||
assert not isinstance(handles.handle, (str, mmap.mmap)) | |||
handles.handle.writelines(result) | |||
handles.handle.write(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result
is a str
, writelines
worked because str
also a Sequence
.
# The C engine doesn't need the file-like to have the "__next__" | ||
# attribute. However, the Python engine explicitly calls | ||
# "__next__(...)" when iterating through such an object, meaning it | ||
if is_file_like(f) and engine != "c" and not hasattr(f, "__iter__"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__iter__
by itself seems to be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously TextFileReader._check_file_or_buffer
would raise a ValueError
for the "python" engine if f
was a tempfile.SpooledTemporaryFile
(which implements __iter__
but not __next__
)
Just wanted to check if SpooledTemporaryFile
now works in the "python" engine for TextFileReader
(and thus this test should no longer skip the python parser) or if this is an oversight? Sorry if this is a very niche case but it's something I've just run into on 1.3.4
and noticed that this logic had been recently changed in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SpooledTemporaryFile
works if you open it in text mode, e.g., SpooledTemporaryFile(mode="r+t")
.
The issue is that SpooledTemporaryFile
does not have the attribute/property readable
which io.TextIOWrapper
requires (if you open SpooledTemporaryFile
in binary mode). If you are lucky, you can convince the cpython maintainers to make SpooledTemporaryFile
compatible with io.TextIOWrapper
in a future python version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, we should add readable
to the binary protocols that are wrapped in TextIOWrapper (I think only for read_json and read_read_csv).
Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-11-16 21:51:31 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow looks good, much more explicit.
cc @pandas-dev/pandas-core for comments
|
||
|
||
class ReadCsvBuffer(ReadBuffer[AnyStr_cov], Protocol): | ||
def __iter__(self) -> Iterator[AnyStr_cov]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not for the othe rengines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The c-engine is fine with ReadBuffer
. I'll test whether ReadBuffer
is also sufficient for the pyarrow engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow needs closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good, small comments
@@ -26,7 +28,7 @@ | |||
@doc(storage_options=generic._shared_docs["storage_options"]) | |||
def to_feather( | |||
df: DataFrame, | |||
path: FilePathOrBuffer[bytes], | |||
path: FilePath | WriteBuffer[bytes], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to have to document these new types? Because the docs will show the argument as FilePath | WriteBuffer[bytes]
in the signature, but the docs (in this case) show path : string file path, or file-like object
so I'm worried about confusion when people see that the signature includes a type that isn't documented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I think there are probably at least two options to avoid this confusion:
- create a new section in one of the IO overviews that contains definitions for these
*Buffer
s, then update the doc-string to point to the definition page - Luckily, most to/read functions work with a
Read/WriteBuffer
: updating the doc-string to "str/byte readable/writeable file-like object" might be sufficient for them? For functions that have more specific needs (probably only read_csv; truncate (excel) and readlines (pickle) are also fairly standard?) update the doc-string to outline the "unusual" requirements (__iter__
for the python engine).
I don't have strong opinions about this, except that the doc-strings need to be updated to some degree :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do both? Create a section that explains the types (in case anyone searches for them), but make all the doc strings explicit, maybe something like "`ReadBuffer[bytes] (str/byte readable file-like object)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the docs will show the argument as...
I'm not sure what is meant by this, the docs don't show the type-hints.
https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.to_feather.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current docs don't because we don't have type hints in the 1.3.4, but I think they will show in the future with all the typing work that has been done. But I could be wrong about that.
If they are not showing up, we should discuss whether they should be showing for the entire docs as we keep adding typing to the method/function signatures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that this is an issue that is becoming more and more prominent, and something that should be discussed. However, it seems to me that this is already pervasive throughout the code (e.g. "array-like") and this PR doesn't make the issue significantly worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the doc-strings to be consistent (without outlining the exact requirements).
Having an overview page of all typing definitions (not just IO) would be great (and probalby also necesary/expected when pandas becomes a py.typed library).
I know that google-conform doc-strings allow to skip the type descritpion if type annotatiosn are present. That would at least ensure that the doc-string and the type annotations are in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just concerned about the docs implication of what the signatures say and how we document the arguments for each of the readers and writers
sorry, I squashed - thought it would make the rebasing easier. |
you can or not - doesn't matter as squashes in merge anyways |
thanks @twoertwein |
Fixes #41610, rebased on top of #43855.
This PR does a few things:
FilePathOrBuffer
apart to not mix basic types and genericsI tested that the protocols are sufficient (need no additional attributes/methods) using mock classes with:
Future: use many overloads for
get_handle
to return the (wrapped) fine-grained protocols.