-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Avoids b' prefix for bytes in to_csv() (#9712) #35004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0b97788
to
3869dc7
Compare
The CI check complains on linting: |
We shouldn't need a new keyword for this - can you not just use the |
So, I had tried (and failed) to get the csv writer behavior for writing bytes changed in Python itself can see this discussion on the bpo issue tracker. Some of the maintainers there raised a very valid point (here) of how there can be potential data-loss((mojibake) in-cases when the bytes to be written have been collected from an unknown source that uses a different encoding scheme compared to the encoding scheme with which the csv file is opened. For most cases I do agree that |
I personally agree with what the Python devs said - the current behavior isn't all that terrible or may an error should be raised. If neither of those, then at the very least |
f7e480e
to
636a915
Compare
Hello @sidhant007! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-06-29 15:42:06 UTC |
636a915
to
549f577
Compare
If you look at the issue (#9712) there seems to be a consensus that the current behavior is indeed a bug (here) and the participants on that discussion in general showed preference of getting it fixed compared to raising an exception. I understand your point of view that the new keyword can cause confusion. Accordingly I have updated my PR removing the |
549f577
to
17cfd73
Compare
17cfd73
to
385991b
Compare
For some perspective that comment is 4 years old and came at at time when we still had Python2 compat, so probably needs to be reconsidered. I am -1 on doing this and think we should match the stdlib, but @TomAugspurger might have other thoughts |
I wasn't aware of the stdlib's behavior here, so feel free to ignore my opinion. |
The stdlib csv behavior is not useful for users. It does satisfy some sort of conceptual purity, but as we know Pandas is not so constrained. We should make If you look at comments from users about this topic, nobody says they prefer the A lot of data comes into Pandas from outside of Python, and |
i’m raising an error we if detect bytes transcoding bytes to text seems slightly suspect here; sure it’s convenient but this deliberately bypassing the clear difference between bytes and strings; we and python are very strict about this so relaxing it is -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs discussion; impl is adding a lot of pure tech debt
@jreback I don't think this would be "bypassing the clear difference between bytes and strings", rather the initial proposal was to have an explicit parameter to control the encoding of bytes into strings. Would you like that version back again? The one where |
@jzwinck i appreciate your point but what is so hard about .decode() that is the canonical way to do this (and is clear this is an exceptional case) |
@jreback Would you prefer a solution where |
i don’t think we need to provide a helper function at all just raise an error |
@sidhant007 can you update this PR to raise instead? |
@WillAyd I dont agree with the approach of raising an error and am thus closing this PR. I willl remain open to discuss other suggestions/ideas on how to tackle this issue. |
Avoids b' prefix written for bytes in the
to_csv()
method (in accordance with this proposal)The
encoding
parameter passing into_csv()
method is used as the encoding scheme to decode the bytes.Example:
After the bug fix will print:
Currently the
to_csv()
method prints:black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff