str accessor fails for object-typed data that is actually numeric #11939

michaelbilow · 2016-01-01T19:35:58Z

One thing that is important for me, since convert_objects has been deprecated (#11221), is that removing it from my code yields a bug that I've reproduced in simplified form here:

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 2]})
print df.dtypes  #The dtype of a is an int64
df.a = df.a.astype(np.object)
print df.dtypes  #Now, the dtype of a is an object
df.a.str.decode('utf-8', errors='ignore').str.encode('utf-8')
## AttributeError: Can only use .str accessor with string values, 
## which use np.object_ dtype in pandas

In my code, I use convert_objects to convert everything that can be converted away from an object into something else, then I use the column's dtype to check if it can be handled by the unicode decode-encode step.

mortada · 2016-01-02T01:04:00Z

seems odd to me to do .astype(np.object), as you are essentially trying to convert the integers into strings or bytes? I'd suggest something like .apply(str)

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a': [1, 2]})

In [3]: print df.dtypes
a    int64
dtype: object

In [4]: df.a = df.a.apply(str)

In [5]: print df.dtypes
a    object
dtype: object

In [6]: df.a.str.decode('utf-8', errors='ignore').str.encode('utf-8')
Out[6]:
0    1
1    2
Name: a, dtype: object

this works for me with python2.7.x and pandas 0.17.1

jreback · 2016-01-02T01:25:50Z

@michaelbilow you shouldn't use object except if you have only strings. Putting numbers and such in object dtypes, while possible, is non-performant.

This is reacting correctly, so closing.

michaelbilow · 2016-01-02T01:43:55Z

Thanks for explaining. The bigger issue is that I'm trying to do the unicode conversion (for the purpose of writing to Excel) on a huge number of completely different dataframes, to the point that it'd be a big headache to figure out what the format of each thing that ends up being labeled an object's actual dtype should be.

df.convert_objects() does a very good job of figuring out what is a string and what isn't, but with it being deprecated I don't know of an efficient replacement.

jreback · 2016-01-02T01:47:19Z

the replacements are pd.to_numeric/pd.to_datetime/pd.to_timedelta. The point is .convert_objects essentially would convert too aggressivley. You need to have some idea what the things are (or you are certainly welcome to try progressive conversions yourself. But its too magical for pandas to do this directly).

michaelbilow · 2016-01-02T02:16:16Z

I see. For my purpose, perhaps I'm better off either converting all the unknown object types using astype(str), or by catching the AttributeErrors as they arise, since the only thing I really care about is dodging the UnicodeDecodeErrors that come up when writing to Excel.

jreback closed this as completed Jan 2, 2016

jreback added Dtype Conversions Unexpected or buggy dtype conversions Usage Question Strings String extension data type and string data labels Jan 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

str accessor fails for object-typed data that is actually numeric #11939

str accessor fails for object-typed data that is actually numeric #11939

michaelbilow commented Jan 1, 2016

mortada commented Jan 2, 2016

jreback commented Jan 2, 2016

michaelbilow commented Jan 2, 2016

jreback commented Jan 2, 2016

michaelbilow commented Jan 2, 2016

str accessor fails for object-typed data that is actually numeric #11939

str accessor fails for object-typed data that is actually numeric #11939

Comments

michaelbilow commented Jan 1, 2016

mortada commented Jan 2, 2016

jreback commented Jan 2, 2016

michaelbilow commented Jan 2, 2016

jreback commented Jan 2, 2016

michaelbilow commented Jan 2, 2016