-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Support for numpy strings #5261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
well it's a lot simpler fixed arrays are nice but when changing elements you have to be very careful to avoid truncation you would also have another type to support as you always need object type (as it's the top of the hierarchy) most cython routines accept object already ; these would need conversion both to and from in my experience you need special handling when dealing with Unicode (so you soul need another type or just make these object) so could be done but big project |
That makes sense. I would be curious to know what they advantages of native
|
in theory fixed array sizes should be much faster however another big issue is the lack of numpy nan support |
see comments on #10351. In light of categoricals (and lack of nan support) this is a non-starter for general strings. |
But will leave open as a specialized dtype option for a certain subset of analysis (e.g. its prob useful to the bio guys) |
That is unfortunate to me that Pandas cannot just do what I asked it to do. I know that I should exercise care using fixed width strings, but they are much faster than Python objects. |
Another drawback to not having fixed-length numpy string: When using pyarrow, presence of columns with a dtype of 'object' will prevent pyarrow to (de)serialize in a zero-copy fashion. As the user of the data, often I can determine the maximum length of some text columns. The string routines can happily truncate my strings to the specified length (the same for assignment). |
I suspect that proper-string support will land sometime soon. Perhaps next year sometime. In the meantime, you could write a simple extension array to convince pandas to not coerce your fixed-width ndarray of strings to object. You shouldn't take this direct approach, since it's using private methods, but this is the basic idea: In [39]: class NumPyString(pd.core.arrays.PandasArray):
...: _typ = 'extensionarray'
In [40]: arr = np.array(['a', 'b', 'c'])
In [41]: s = pd.Series(NumPyString(arr))
In [42]: s
Out[42]:
0 a
1 b
2 c
dtype: str32
In [46]: s.array._ndarray
Out[46]: array(['a', 'b', 'c'], dtype='<U1') Things like string methods won't work either. |
With StringDType, I think this issue can be closed. |
Agreed, thanks. xref #35169 for those interested in following along. |
This is more of a question than an issue:
why does Pandas use object type rather than the native fixed length numpy strings?
For example:
but:
The text was updated successfully, but these errors were encountered: