Skip to content

Encoding non-ascii characters in to_csv with encodings #1966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jseabold opened this issue Sep 25, 2012 · 5 comments
Closed

Encoding non-ascii characters in to_csv with encodings #1966

jseabold opened this issue Sep 25, 2012 · 5 comments
Labels
IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@jseabold
Copy link
Contributor

Right now if you want to use an encoding, you have to make sure that all fields that contain non-ascii characters are in that encoding. Maybe this is okay, but I'm working with a lot of data and it is a bit cumbersome for me to be doing these checks constantly. For example

from StringIO import StringIO
import pandas

df = pandas.read_table(StringIO('Ki\xc3\x9fwetter, Wolfgang;Ki\xc3\x9fwetter, Wolfgang'), sep=";", header=None)
df["X.1"] = df["X.1"].apply(lambda x : x.decode('utf-8'))
df.to_csv("blah.csv", encoding="utf-8")

The question is, is this something that the user should be worrying about or is there a "safe_encode" that could be used instead, similar idea to #1804?

@jseabold
Copy link
Contributor Author

Just ran into this again. I'm starting to have the feeling that the I/O operations should do the conversion. For example, I want to read in a utf-8 file and then convert it to latin-1. It would be great if I could do

read_csv(..., 'utf-8')
to_csv(..., 'latin-1')

but currently you have to go through the intermediate apply step for each string-like column. Might work on a PR for this if I can get some time this afternoon.

@wesm
Copy link
Member

wesm commented Sep 28, 2012

Note you can use x.str.decode('utf-8') now. Not that that helps too much

@ghost
Copy link

ghost commented Oct 3, 2012

After a week of digging through the unicode nether-regions of the tree, I support jseabold's
comment on strictly enforcing unicode internally and encoding/decoding at I/O
points only.

Otherwise, you end up having brittle assumptions about encodings and if clauses handling corner-cases
all over the codebase.

@ghost
Copy link

ghost commented Oct 3, 2012

I apologize for all these useless messages, github is doing unexpected things, and I can't
figure out how to purge these. hopefully they'll go away when I tear down the PR branches.

I'll be more cautious in the future.

@wesm
Copy link
Member

wesm commented Nov 27, 2012

Passing encoding='utf-8' yields unicode strings now. Closing this issue

In [4]: df = pandas.read_table(StringIO('Ki\xc3\x9fwetter, Wolfgang;Ki\xc3\x9fwetter, Wolfgang'), sep=";", header=None, encoding='utf-8')

In [5]: df
Out[5]: 
                    X0                   X1
0  Kißwetter, Wolfgang  Kißwetter, Wolfgang

In [6]: df['X0']
Out[6]: 
0    Kißwetter, Wolfgang
Name: X0

In [7]: df['X0'][0]
Out[7]: u'Ki\xdfwetter, Wolfgang'

@wesm wesm closed this as completed Nov 27, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

2 participants