BUG: Avoids b' prefix for bytes in to_csv() (#9712) #35004

sidhant007 · 2020-06-26T04:33:51Z

Avoids b' prefix written for bytes in the to_csv() method (in accordance with this proposal)

The encoding parameter passing in to_csv() method is used as the encoding scheme to decode the bytes.

Example:

import pandas as pd
import sys
df = pd.DataFrame({
    b'hello': [b'abc', b'def'],
    b'world': ['pqr', 'xyz']
})
df.to_csv(sys.stdout, encoding='utf-8')

After the bug fix will print:

,hello,world
0,abc,pqr
1,def,xyz

Currently the to_csv() method prints:

,b'hello',b'world'
0,b'abc',pqr
1,b'def',xyz

closes to_csv and bytes on Python 3. #9712
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

sidhant007 · 2020-06-26T06:45:19Z

The CI check complains on linting:
##[error]./pandas/core/apply.py:9:1:F401:'pandas._libs.reduction as libreduction' imported but unused
I believe this error is not associated with this PR.

WillAyd · 2020-06-26T15:31:48Z

We shouldn't need a new keyword for this - can you not just use the encoding parameter?

sidhant007 · 2020-06-26T16:27:04Z

We shouldn't need a new keyword for this - can you not just use the encoding parameter?

So, I had tried (and failed) to get the csv writer behavior for writing bytes changed in Python itself can see this discussion on the bpo issue tracker. Some of the maintainers there raised a very valid point (here) of how there can be potential data-loss((mojibake) in-cases when the bytes to be written have been collected from an unknown source that uses a different encoding scheme compared to the encoding scheme with which the csv file is opened.

For most cases I do agree that encoding parameter will just work, that is why I proposed the bytes_encoding to be an optional parameter, which if not passed takes the value of what the encoding parameter is.

WillAyd · 2020-06-26T16:32:14Z

I personally agree with what the Python devs said - the current behavior isn't all that terrible or may an error should be raised. If neither of those, then at the very least encoding could be used, but definitely a new keyword here is confusing

pep8speaks · 2020-06-29T14:21:33Z

Hello @sidhant007! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-29 15:42:06 UTC

sidhant007 · 2020-06-29T14:25:51Z

I personally agree with what the Python devs said - the current behavior isn't all that terrible or may an error should be raised. If neither of those, then at the very least encoding could be used, but definitely a new keyword here is confusing

If you look at the issue (#9712) there seems to be a consensus that the current behavior is indeed a bug (here) and the participants on that discussion in general showed preference of getting it fixed compared to raising an exception.

I understand your point of view that the new keyword can cause confusion. Accordingly I have updated my PR removing the bytes_encoding additional parameter and using the encoding parameter provided instead.

WillAyd · 2020-06-29T16:15:59Z

For some perspective that comment is 4 years old and came at at time when we still had Python2 compat, so probably needs to be reconsidered. I am -1 on doing this and think we should match the stdlib, but @TomAugspurger might have other thoughts

TomAugspurger · 2020-06-29T16:20:34Z

I wasn't aware of the stdlib's behavior here, so feel free to ignore my opinion.

jzwinck · 2020-07-02T05:57:20Z

@WillAyd @TomAugspurger

The stdlib csv behavior is not useful for users. It does satisfy some sort of conceptual purity, but as we know Pandas is not so constrained. We should make pd.to_csv() useful for users, not beholden to an ideal of encodings that many real programs won't achieve (and sometimes can't, because they'd run out of memory due to the double-storing everything plus the fact that unicode takes more space than bytes, so more than 2x space overhead).

If you look at comments from users about this topic, nobody says they prefer the b' prefix in their CSV output as a user. The only people who prefer it are developers.

A lot of data comes into Pandas from outside of Python, and to_csv() is one of the most popular ways to get data out. No other program understands the b' prefixes, so this breaks the main use case. And a lot of data also arrives via NumPy, where 'S' still means bytes, even in Python 3.

jreback · 2020-07-02T10:20:50Z

i’m raising an error we if detect bytes

transcoding bytes to text seems slightly suspect here; sure it’s convenient but this deliberately bypassing the clear difference between bytes and strings; we and python are very strict about this so relaxing it is -1

jreback

needs discussion; impl is adding a lot of pure tech debt

jzwinck · 2020-07-02T11:03:54Z

@jreback I don't think this would be "bypassing the clear difference between bytes and strings", rather the initial proposal was to have an explicit parameter to control the encoding of bytes into strings. Would you like that version back again? The one where bytes_encoding is a separate parameter? It certainly doesn't let anyone bypass anything, it is just a more performant way of decoding all the bytes instead of doing it column-by-column and maybe running out of memory along the way.

jreback · 2020-07-02T11:15:30Z

@jzwinck i appreciate your point but what is so hard about .decode()

that is the canonical way to do this (and is clear this is an exceptional case)

jzwinck · 2020-07-02T14:04:25Z

@jreback Would you prefer a solution where to_csv() on bytes raises an error and tells the user to first invoke some helper function (which we could add if it doesn't already exist) to create a copy of their DataFrame with all bytes converted to unicode so they can call to_csv() on that? This would consume >100% extra memory compared with the solution currently in this PR, but at least it could be convenient.

jreback · 2020-07-02T14:31:12Z

i don’t think we need to provide a helper function at all

just raise an error

WillAyd · 2020-07-06T18:03:20Z

@sidhant007 can you update this PR to raise instead?

sidhant007 · 2020-07-08T01:00:02Z

@WillAyd I dont agree with the approach of raising an error and am thus closing this PR. I willl remain open to discuss other suggestions/ideas on how to tackle this issue.

sidhant007 force-pushed the to-csv-bytes-fix branch from 0b97788 to 3869dc7 Compare June 26, 2020 05:19

WillAyd added the IO CSV read_csv, to_csv label Jun 26, 2020

sidhant007 force-pushed the to-csv-bytes-fix branch from f7e480e to 636a915 Compare June 29, 2020 14:21

sidhant007 force-pushed the to-csv-bytes-fix branch from 636a915 to 549f577 Compare June 29, 2020 14:23

sidhant007 force-pushed the to-csv-bytes-fix branch from 549f577 to 17cfd73 Compare June 29, 2020 15:35

BUG: Avoids b' prefix for bytes in to_csv() (pandas-dev#9712)

385991b

sidhant007 force-pushed the to-csv-bytes-fix branch from 17cfd73 to 385991b Compare June 29, 2020 15:42

jreback requested changes Jul 2, 2020

View reviewed changes

sidhant007 closed this Jul 8, 2020

sidhant007 deleted the to-csv-bytes-fix branch July 9, 2020 15:56

Uh oh!

BUG: Avoids b' prefix for bytes in to_csv() (#9712) #35004

BUG: Avoids b' prefix for bytes in to_csv() (#9712) #35004

Uh oh!

Conversation

sidhant007 commented Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sidhant007 commented Jun 26, 2020

Uh oh!

WillAyd commented Jun 26, 2020

Uh oh!

sidhant007 commented Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillAyd commented Jun 26, 2020

Uh oh!

pep8speaks commented Jun 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-06-29 15:42:06 UTC

Uh oh!

sidhant007 commented Jun 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillAyd commented Jun 29, 2020

Uh oh!

TomAugspurger commented Jun 29, 2020

Uh oh!

jzwinck commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Jul 2, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jzwinck commented Jul 2, 2020

Uh oh!

jreback commented Jul 2, 2020

Uh oh!

jzwinck commented Jul 2, 2020

Uh oh!

jreback commented Jul 2, 2020

Uh oh!

WillAyd commented Jul 6, 2020

Uh oh!

sidhant007 commented Jul 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sidhant007 commented Jun 26, 2020 •

edited

Loading

sidhant007 commented Jun 26, 2020 •

edited

Loading

pep8speaks commented Jun 29, 2020 •

edited

Loading

sidhant007 commented Jun 29, 2020 •

edited

Loading

jzwinck commented Jul 2, 2020 •

edited

Loading