Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str accessor fails for object-typed data that is actually numeric #11939

Closed
michaelbilow opened this issue Jan 1, 2016 · 5 comments
Closed
Labels
Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data Usage Question

Comments

@michaelbilow
Copy link

One thing that is important for me, since convert_objects has been deprecated (#11221), is that removing it from my code yields a bug that I've reproduced in simplified form here:

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 2]})
print df.dtypes  #The dtype of a is an int64
df.a = df.a.astype(np.object)
print df.dtypes  #Now, the dtype of a is an object
df.a.str.decode('utf-8', errors='ignore').str.encode('utf-8')
## AttributeError: Can only use .str accessor with string values, 
## which use np.object_ dtype in pandas

In my code, I use convert_objects to convert everything that can be converted away from an object into something else, then I use the column's dtype to check if it can be handled by the unicode decode-encode step.

@mortada
Copy link
Contributor

mortada commented Jan 2, 2016

seems odd to me to do .astype(np.object), as you are essentially trying to convert the integers into strings or bytes? I'd suggest something like .apply(str)

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a': [1, 2]})

In [3]: print df.dtypes
a    int64
dtype: object

In [4]: df.a = df.a.apply(str)

In [5]: print df.dtypes
a    object
dtype: object

In [6]: df.a.str.decode('utf-8', errors='ignore').str.encode('utf-8')
Out[6]:
0    1
1    2
Name: a, dtype: object

this works for me with python2.7.x and pandas 0.17.1

@jreback
Copy link
Contributor

jreback commented Jan 2, 2016

@michaelbilow you shouldn't use object except if you have only strings. Putting numbers and such in object dtypes, while possible, is non-performant.

This is reacting correctly, so closing.

@jreback jreback closed this as completed Jan 2, 2016
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Usage Question Strings String extension data type and string data labels Jan 2, 2016
@michaelbilow
Copy link
Author

Thanks for explaining. The bigger issue is that I'm trying to do the unicode conversion (for the purpose of writing to Excel) on a huge number of completely different dataframes, to the point that it'd be a big headache to figure out what the format of each thing that ends up being labeled an object's actual dtype should be.

df.convert_objects() does a very good job of figuring out what is a string and what isn't, but with it being deprecated I don't know of an efficient replacement.

@jreback
Copy link
Contributor

jreback commented Jan 2, 2016

the replacements are pd.to_numeric/pd.to_datetime/pd.to_timedelta. The point is .convert_objects essentially would convert too aggressivley. You need to have some idea what the things are (or you are certainly welcome to try progressive conversions yourself. But its too magical for pandas to do this directly).

@michaelbilow
Copy link
Author

I see. For my purpose, perhaps I'm better off either converting all the unknown object types using astype(str), or by catching the AttributeErrors as they arise, since the only thing I really care about is dodging the UnicodeDecodeErrors that come up when writing to Excel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data Usage Question
Projects
None yet
Development

No branches or pull requests

3 participants