Skip to content

BUG: Avoids b' prefix for bytes in to_csv() (#9712) #35004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sidhant007
Copy link

@sidhant007 sidhant007 commented Jun 26, 2020

Avoids b' prefix written for bytes in the to_csv() method (in accordance with this proposal)

The encoding parameter passing in to_csv() method is used as the encoding scheme to decode the bytes.

Example:

import pandas as pd
import sys
df = pd.DataFrame({
    b'hello': [b'abc', b'def'],
    b'world': ['pqr', 'xyz']
})
df.to_csv(sys.stdout, encoding='utf-8')

After the bug fix will print:

,hello,world
0,abc,pqr
1,def,xyz

Currently the to_csv() method prints:

,b'hello',b'world'
0,b'abc',pqr
1,b'def',xyz

@sidhant007
Copy link
Author

The CI check complains on linting:
##[error]./pandas/core/apply.py:9:1:F401:'pandas._libs.reduction as libreduction' imported but unused
I believe this error is not associated with this PR.

@WillAyd
Copy link
Member

WillAyd commented Jun 26, 2020

We shouldn't need a new keyword for this - can you not just use the encoding parameter?

@WillAyd WillAyd added the IO CSV read_csv, to_csv label Jun 26, 2020
@sidhant007
Copy link
Author

sidhant007 commented Jun 26, 2020

We shouldn't need a new keyword for this - can you not just use the encoding parameter?

So, I had tried (and failed) to get the csv writer behavior for writing bytes changed in Python itself can see this discussion on the bpo issue tracker. Some of the maintainers there raised a very valid point (here) of how there can be potential data-loss((mojibake) in-cases when the bytes to be written have been collected from an unknown source that uses a different encoding scheme compared to the encoding scheme with which the csv file is opened.

For most cases I do agree that encoding parameter will just work, that is why I proposed the bytes_encoding to be an optional parameter, which if not passed takes the value of what the encoding parameter is.

@WillAyd
Copy link
Member

WillAyd commented Jun 26, 2020

I personally agree with what the Python devs said - the current behavior isn't all that terrible or may an error should be raised. If neither of those, then at the very least encoding could be used, but definitely a new keyword here is confusing

@pep8speaks
Copy link

pep8speaks commented Jun 29, 2020

Hello @sidhant007! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-29 15:42:06 UTC

@sidhant007
Copy link
Author

sidhant007 commented Jun 29, 2020

I personally agree with what the Python devs said - the current behavior isn't all that terrible or may an error should be raised. If neither of those, then at the very least encoding could be used, but definitely a new keyword here is confusing

If you look at the issue (#9712) there seems to be a consensus that the current behavior is indeed a bug (here) and the participants on that discussion in general showed preference of getting it fixed compared to raising an exception.

I understand your point of view that the new keyword can cause confusion. Accordingly I have updated my PR removing the bytes_encoding additional parameter and using the encoding parameter provided instead.

@WillAyd
Copy link
Member

WillAyd commented Jun 29, 2020

For some perspective that comment is 4 years old and came at at time when we still had Python2 compat, so probably needs to be reconsidered. I am -1 on doing this and think we should match the stdlib, but @TomAugspurger might have other thoughts

@TomAugspurger
Copy link
Contributor

I wasn't aware of the stdlib's behavior here, so feel free to ignore my opinion.

@jzwinck
Copy link
Contributor

jzwinck commented Jul 2, 2020

@WillAyd @TomAugspurger

The stdlib csv behavior is not useful for users. It does satisfy some sort of conceptual purity, but as we know Pandas is not so constrained. We should make pd.to_csv() useful for users, not beholden to an ideal of encodings that many real programs won't achieve (and sometimes can't, because they'd run out of memory due to the double-storing everything plus the fact that unicode takes more space than bytes, so more than 2x space overhead).

If you look at comments from users about this topic, nobody says they prefer the b' prefix in their CSV output as a user. The only people who prefer it are developers.

A lot of data comes into Pandas from outside of Python, and to_csv() is one of the most popular ways to get data out. No other program understands the b' prefixes, so this breaks the main use case. And a lot of data also arrives via NumPy, where 'S' still means bytes, even in Python 3.

@jreback
Copy link
Contributor

jreback commented Jul 2, 2020

i’m raising an error we if detect bytes

transcoding bytes to text seems slightly suspect here; sure it’s convenient but this deliberately bypassing the clear difference between bytes and strings; we and python are very strict about this so relaxing it is -1

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs discussion; impl is adding a lot of pure tech debt

@jzwinck
Copy link
Contributor

jzwinck commented Jul 2, 2020

@jreback I don't think this would be "bypassing the clear difference between bytes and strings", rather the initial proposal was to have an explicit parameter to control the encoding of bytes into strings. Would you like that version back again? The one where bytes_encoding is a separate parameter? It certainly doesn't let anyone bypass anything, it is just a more performant way of decoding all the bytes instead of doing it column-by-column and maybe running out of memory along the way.

@jreback
Copy link
Contributor

jreback commented Jul 2, 2020

@jzwinck i appreciate your point but what is so hard about .decode()

that is the canonical way to do this (and is clear this is an exceptional case)

@jzwinck
Copy link
Contributor

jzwinck commented Jul 2, 2020

@jreback Would you prefer a solution where to_csv() on bytes raises an error and tells the user to first invoke some helper function (which we could add if it doesn't already exist) to create a copy of their DataFrame with all bytes converted to unicode so they can call to_csv() on that? This would consume >100% extra memory compared with the solution currently in this PR, but at least it could be convenient.

@jreback
Copy link
Contributor

jreback commented Jul 2, 2020

i don’t think we need to provide a helper function at all

just raise an error

@WillAyd
Copy link
Member

WillAyd commented Jul 6, 2020

@sidhant007 can you update this PR to raise instead?

@sidhant007
Copy link
Author

@WillAyd I dont agree with the approach of raising an error and am thus closing this PR. I willl remain open to discuss other suggestions/ideas on how to tackle this issue.

@sidhant007 sidhant007 closed this Jul 8, 2020
@sidhant007 sidhant007 deleted the to-csv-bytes-fix branch July 9, 2020 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_csv and bytes on Python 3.
6 participants