Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Need API support and __repr__ to discover the storage used for strings #59342

Open
arnaudlegout opened this issue Jul 29, 2024 · 7 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Milestone

Comments

@arnaudlegout
Copy link
Contributor

arnaudlegout commented Jul 29, 2024

Originally raised in #58551 (comment)

Problem Description

With PDEP-14 there is the need for developers to be aware of the storage used for strings. Indeed, the storage might have a lot of impact of performance, for instance

  • pyarrow storage
    • pros: compact (optimal memory footprint), fast (vectorization)
    • cons: immutable (so any modification creates a new string pyarrow ChunkedArray)
  • python storage
    • pros: mutable
    • cons: highest memory footprint (each string is a different Python object), slow (no vectorization)
  • numpy 2.0 strings storage (I don't have a good knowledge of these new strings, and never tested them)
    • pros: compact, vectorization, mutable (my understanding is that is takes more space and is slower than pyarrow strings)
    • cons: different representations depending on a string size, which make understanding performance harder

Feature Description

I would like to have two way to discover the storage

  • __repr__ goal is to give information on the inner of an object, one option suggested by @jorisvandenbossche is to display <pandas.StringDtype(storage=...)> instead of string[storage]
  • .get_storage that returns the storage (not sure what is possible with the current implementation, would be best to have a class, otherwise, a string). The API is useful to check before running a time consuming code that we have the correct storage.

Alternative Solutions

.

Additional Context

No response

@arnaudlegout arnaudlegout added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 29, 2024
@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Aug 21, 2024
@jorisvandenbossche
Copy link
Member

@arnaudlegout thanks for opening the issue!

First quick note: at the moment numpy 2.0 string dtype is not supported in the pd.StringDtype at the moment (but could be in the future), so right now the two options to consider are "pyarrow" and "python" (i.e. object-dtype)

Then, the API to inspect and discover the storage is actually already available, as the .storage attribute on the StringDtype instance:

>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"], dtype="str")
>>> ser.dtype
str
>>> ser.dtype.storage
'pyarrow'

So I think the main discussion is how the __repr__ should look like.

@WillAyd
Copy link
Member

WillAyd commented Aug 22, 2024

I think it makes sense to have the storage and na_value as part of the repr. While @jorisvandenbossche is correct that you can inspect this with attributes, that also assumes developers know in advance what those attributes are. By putting it into the repr instead it becomes a little clearer to developers what they might need to consider

@arnaudlegout
Copy link
Contributor Author

@WillAyd right, I was not aware of the .storage attribute and indeed getting information on the na_value is interesting.

I did not find the .storage in the pandas documentation, so it would be great to also complement the documentation to show the available attributes to inspect the storage properties.

@pantheraleo-7
Copy link

pantheraleo-7 commented Sep 21, 2024

so right now the two options to consider are "pyarrow" and "python" (i.e. object-dtype)

I think the name "python" for the fallback storage option is not future proof? If I'm reading the PDEP-14 right, the fallback is a numpy array of python str objects. So the fallback storage option name should be "numpy".

  • when numpy 2.0 strings will be implemented as a fallback, the name "python" won't make sense anymore
  • it kinda don't make sense even right now because we are storing those objects in a numpy array anyway
  • also, the names "pyarrow" and "numpy" would complement each other better ig

numpy 2.0 string dtype is not supported in the pd.StringDtype at the moment (but could be in the future)

do we have a timeline on this? It seems like PDEP-10 will be reverted by PDEP-15, so pyarrow is going to stay an optional dependency. So to force users who just want the vectorisation speed benefits (and nothing more) to install pyarrow will practically lessen the importance of numpy 2.0 string implementation as they would've have already moved to pyarrow in pandas 3.0.

pandas 3.0 is a golden opportunity to incorporate numpy 2.0 string dtype, as users who will shift to a newer major version of pandas, would also most likely shift to a newer major version of numpy.

but if we still don't want to force numpy 2.0, we could have an intermediate fallback no?
use pyarrow if installed >>> use numpy 2.0 str dtype if numpy>=2.0 is installed >>> use numpy object dtype

basically I'm saying we should fast track numpy 2.0 string implementation xD

@WillAyd
Copy link
Member

WillAyd commented Sep 21, 2024

I think a numpy 2.0 string data type would needs it own PDEP. We already have a proliferation of string data types in pandas, so it needs some discussion to define what value we see from adding another, and to define what the semantics of it are.

@simonjayhawkins
Copy link
Member

I think it makes sense to have the storage and na_value as part of the repr. While @jorisvandenbossche is correct that you can inspect this with attributes, that also assumes developers know in advance what those attributes are. By putting it into the repr instead it becomes a little clearer to developers what they might need to consider

xref #60305

The outcome of that discussion could impact what should be done here. If we want the dtypes with only the storage being different to be considered equal and to hide the implementation detail to users then I guess we would not want to update the repr to display the storage?

if the dtypes for the array with numpy semantics and the array using pd.NA are not considered equal in the equality checks then it may be that the na_value should be included in the repr.

@arnaudlegout
Copy link
Contributor Author

I really disagree with the willingness to "hide implementation details". The underlying implementation should be considered as a detail. It has strong performance (both in space and speed) implications, and considering performance as a detail for a regular pandas user is, in my opinion, a mistake.

The impact of the dtypes implementation should even be more documented. with a numpy boolean is 8 bits, a nullable numpy boolean is 16 bits, and a nullable pyarrow boolean is 2 bits? why changing a string in a large string Series is fast with "string[python]", slow with "string[pyarrow]", why in some cases the pyarrow is producing a memory overflow? These implementation issues are not details and should not be hidden to the user (even the beginner).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

6 participants