Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support binary file handles in to_csv #35129

Merged
merged 3 commits into from
Aug 7, 2020
Merged

support binary file handles in to_csv #35129

merged 3 commits into from
Aug 7, 2020

Conversation

twoertwein
Copy link
Member

@twoertwein twoertwein commented Jul 5, 2020

The first commit addresses #35058 (comment): python's open cannot take an encoding argument when mode contains a 'b' (opened in binary mode). This avoids an error when executing df.to_csv("output.csv", mode="w+b").

The second commit fixes #35058, #19827, #23854 *, and #13068 *: to_csv supports file handles in binary mode if mode contains a b and it honors encoding. to_csv re-invented a lot that was already done in get_handle. Let get_handle do the heavy lifting and remove all special cases from to_csv.

The third commit fixes #22555: some compression algorithms did not set the mode to be writeable for file handles. Together with the re-factoring in the second commit, it is now possible to write to binary file handles with compression!

*requesting an encoding for a non-binary file handles through to_csv still doesn't work but imho also doesn't make sense: specify the encoding yourself when opening the file or use a binary file handle

@twoertwein twoertwein changed the title allow to_csv to be used with binary mode support binary file handles in to_csv Jul 6, 2020
@jreback jreback requested a review from gfyoung July 8, 2020 15:41
@jreback jreback added IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label labels Jul 8, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok to me, can you add a whatsnew new in other enhancements section. (or bug fixes I/O if this is really a bug fix?)

@twoertwein
Copy link
Member Author

twoertwein commented Jul 8, 2020

thank you @jreback! The first commit is technically a bug fix as it prevents an error from occurring (df.to_csv("output.csv", mode="w+b") - to be fair there is no good reason for using binary mode when a filename is passed) but the second commit is a new feature as it allows to use binary file handles with to_csv. I will add two entries to whatsnew.

I think that this PR should also fix #23854 @eode, #19827 @colobas, and #13068 @graingert (as long as the user puts a 'b' in mode).

@jreback
Copy link
Contributor

jreback commented Jul 9, 2020

thank you @jreback! The first commit is technically a bug fix as it prevents an error from occurring (df.to_csv("output.csv", mode="w+b") - to be fair there is no good reason for using binary mode when a filename is passed) but the second commit is a new feature as it allows to use binary file handles with to_csv. I will add two entries to whatsnew.

I think that this PR should also fix #23854 @eode, #19827 @colobas, and #13068 @graingert (as long as the user puts a 'b' in mode).

awesome! can you have tests for each of these issues, edit the top of the PR to say you close, and add to the whatsnew?

@gfyoung
Copy link
Member

gfyoung commented Jul 9, 2020

This PR looks good! Let's add the tests that @jreback requested, and I think we should be good.

@twoertwein
Copy link
Member Author

twoertwein commented Jul 10, 2020

#19827: my issue #35058 is actually a duplicate of #19827 (I had a different use case but it is the same underlying technical issue). His code example is already covered in the test. I referenced his issue in the test.

#23854 and #13068: this PR will fix 'half' of both issues. If the file handle is in binary mode it will fix their issue (honor the encoding) but for non-binary file handles the issues are still not addressed. I added a test-case containing sample code from these two issues to validate the encoding for binary file handles.

I could try to extend this PR (or open a new one?) to cover the encoding of non-binary file handles as well (using the StringIO buffer approach for all file handles should fix this). I don't think this is possible (string.encode() returns bytes but a non-binary file handle wants a string). People who provide a file handle and want their encoding to be honored should provide a binary file handle (or set the encoding themselves in the open call when creating a non-binary file handle). I changed the to_csv documentation to note that encoding isn't supported for non-binary file handles. I wouldn't mind to also mark #23854 and #13068 as fixed (one half is fixed and the other half is 'not supported').

@twoertwein twoertwein requested a review from jreback July 11, 2020 03:04
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @twoertwein very comprehensive

just some clarifications and doc requests

doc string updates in to_csv looks good - could expand even to include a full example in the main docs (io.rst) and link in the doc-string

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
pandas/core/generic.py Outdated Show resolved Hide resolved
pandas/io/formats/csvs.py Outdated Show resolved Hide resolved
pandas/io/common.py Outdated Show resolved Hide resolved
pandas/io/common.py Outdated Show resolved Hide resolved
pandas/tests/io/formats/test_to_csv.py Show resolved Hide resolved
pandas/tests/io/test_common.py Show resolved Hide resolved
@pep8speaks
Copy link

pep8speaks commented Jul 11, 2020

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-07 00:24:21 UTC

@twoertwein twoertwein requested a review from jreback July 14, 2020 03:21
@twoertwein
Copy link
Member Author

twoertwein commented Jul 17, 2020

I overhauled a part of to_csv that re-invented parts of get_handle (my initial PR made use of these re-inventions). The code is smaller and simpler :)

@twoertwein
Copy link
Member Author

twoertwein commented Jul 18, 2020

There seems to be an unclosed file handle on windows causing a test failure. I cannot reproduce this error on my linux machine. @jreback @gfyoung do you have an idea how to debug that? edit: it seems that at least one merged PR triggers the same failure. edit: seemed to be an CI hiccup, it passes now :)

This test has some unclosed file warnings, but other PRs have the same warnings (so hopefully not related to this PR).

Except the above test failure, I think/hope this PR is done :)

@twoertwein
Copy link
Member Author

rebased and 1.1 -> 1.2

@jreback @gfyoung

@gfyoung
Copy link
Member

gfyoung commented Jul 29, 2020

@twoertwein : It looks like there are test failures unrelated to your PR. You may need to rebase or merge master again.

@twoertwein
Copy link
Member Author

@gfyoung rebased. Master seems to have the same issue, all other tests pass :)

@gfyoung
Copy link
Member

gfyoung commented Jul 29, 2020

@twoertwein : The changes look okay to me right now. Unfortunately, we will need to wait until the master failure is fixed.

@jreback jreback added this to the 1.2 milestone Aug 3, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twoertwein very nice. minor test comments and a doc request. pls merge master and ping on green.

pandas/tests/io/formats/test_to_csv.py Outdated Show resolved Hide resolved
pandas/tests/io/test_common.py Outdated Show resolved Hide resolved
pandas/tests/io/test_compression.py Outdated Show resolved Hide resolved
@@ -13,6 +13,25 @@ including other versions of pandas.
Enhancements
~~~~~~~~~~~~

.. _whatsnew_120.binary_handle_to_csv:

Support for binary file handles in ``to_csv``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this example to the io.rst section as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new subsection in "Data Handling". Is that an appropriate place?

@jreback
Copy link
Contributor

jreback commented Aug 3, 2020

@gfyoung if any other comments.

pandas/core/generic.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment, pls merge master and ping on green.

doc/source/user_guide/io.rst Show resolved Hide resolved
@twoertwein
Copy link
Member Author

@jreback ping

@jreback jreback merged commit 3b88446 into pandas-dev:master Aug 7, 2020
@jreback
Copy link
Contributor

jreback commented Aug 7, 2020

thanks @twoertwein very nice!

@twoertwein
Copy link
Member Author

@jreback and @gfyoung Thank you very much for your help getting this PR merged :)

@@ -3080,6 +3086,10 @@ def to_csv(
supported for compression modes 'gzip' and 'bz2'
as well as 'zip'.

.. versionchanged:: 1.2.0

Compression is supported for non-binary file objects.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have been "Compression is supported for binary file objects."! @jreback Should I create a one-line PR for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened a PR: #35615

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
4 participants