-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Need API support and __repr__ to discover the storage used for strings #59342
Comments
@arnaudlegout thanks for opening the issue! First quick note: at the moment numpy 2.0 string dtype is not supported in the Then, the API to inspect and discover the storage is actually already available, as the >>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"], dtype="str")
>>> ser.dtype
str
>>> ser.dtype.storage
'pyarrow' So I think the main discussion is how the |
I think it makes sense to have the storage and na_value as part of the repr. While @jorisvandenbossche is correct that you can inspect this with attributes, that also assumes developers know in advance what those attributes are. By putting it into the repr instead it becomes a little clearer to developers what they might need to consider |
@WillAyd right, I was not aware of the I did not find the |
I think the name "python" for the fallback storage option is not future proof? If I'm reading the PDEP-14 right, the fallback is a numpy array of python
do we have a timeline on this? It seems like PDEP-10 will be reverted by PDEP-15, so pyarrow is going to stay an optional dependency. So to force users who just want the vectorisation speed benefits (and nothing more) to install pyarrow will practically lessen the importance of numpy 2.0 string implementation as they would've have already moved to pyarrow in pandas 3.0. pandas 3.0 is a golden opportunity to incorporate numpy 2.0 string dtype, as users who will shift to a newer major version of pandas, would also most likely shift to a newer major version of numpy. but if we still don't want to force numpy 2.0, we could have an intermediate fallback no? basically I'm saying we should fast track numpy 2.0 string implementation xD |
I think a numpy 2.0 string data type would needs it own PDEP. We already have a proliferation of string data types in pandas, so it needs some discussion to define what value we see from adding another, and to define what the semantics of it are. |
xref #60305 The outcome of that discussion could impact what should be done here. If we want the dtypes with only the storage being different to be considered equal and to hide the implementation detail to users then I guess we would not want to update the repr to display the storage? if the dtypes for the array with numpy semantics and the array using pd.NA are not considered equal in the equality checks then it may be that the na_value should be included in the repr. |
I really disagree with the willingness to "hide implementation details". The underlying implementation should be considered as a detail. It has strong performance (both in space and speed) implications, and considering performance as a detail for a regular pandas user is, in my opinion, a mistake. The impact of the dtypes implementation should even be more documented. with a numpy boolean is 8 bits, a nullable numpy boolean is 16 bits, and a nullable pyarrow boolean is 2 bits? why changing a string in a large string Series is fast with "string[python]", slow with "string[pyarrow]", why in some cases the pyarrow is producing a memory overflow? These implementation issues are not details and should not be hidden to the user (even the beginner). |
Originally raised in #58551 (comment)
Problem Description
With PDEP-14 there is the need for developers to be aware of the storage used for strings. Indeed, the storage might have a lot of impact of performance, for instance
pyarrow
storageChunkedArray
)python
storagenumpy
2.0 strings storage (I don't have a good knowledge of these new strings, and never tested them)Feature Description
I would like to have two way to discover the storage
__repr__
goal is to give information on the inner of an object, one option suggested by @jorisvandenbossche is to display<pandas.StringDtype(storage=...)>
instead ofstring[storage]
.get_storage
that returns the storage (not sure what is possible with the current implementation, would be best to have a class, otherwise, a string). The API is useful to check before running a time consuming code that we have the correct storage.Alternative Solutions
.
Additional Context
No response
The text was updated successfully, but these errors were encountered: