-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: EA interface - strictness of _from_sequence #33254
Comments
@TomAugspurger replied (#32586 (comment)) I think my preference is for |
@jbrockmendel replied (#32586 (comment))
I would rather avoid having adding another constructor (im already not wild about IIRC a big part of the reason why _from_sequence became less strict than originally intended is because DTA/TDA/PA _from_sequence took on a bunch of cases that the DTI/TDI/PI constructors handle, in particular
Maybe something like |
Well, see the option number 2 above, which proposes to defer this to the
I don't think this is necessarily required. The values for factorize can be completely internal to the array (not to be used by users). |
I am experimenting a bit with the this, and I am running into the following issue: Right now, Practical example where a Sparse[int] dtype array results in a Sparse[bool] array:
Now, if we pass the actual dtype instance to Slightly related, at #32586 (comment), @crepererum raises another point regarding the current
So in any case (also for the future liberal method), we should clarify those questions. Should there be a difference between whether a dtype is passed or not? And when should we internally pass a dtype or not? |
How often do we pass dtype in practice? If it is close-to-always, we might simplify the code by making it a required argument. |
Maybe not use |
A keyword is certainly an option (I wouldn't call it |
That's a good point, though a seperate method would have the same problem though (except if it routes to |
Yes, that's the idea (see the point 1. in the top post). |
Sorry for closing this, I pressed the wrong button. Are we close to a decision on this? The docs on text data says: "We recommend using StringDtype to store text data.", but there are problems in 1.0 and master like these below: >>> pd.Series([1, "x"], dtype="string")
ValueError: StringArray requires a sequence of strings or pandas.NA
>>> pd.Series([1, "x"]).astype("string")
ValueError: StringArray requires a sequence of strings or pandas.NA
# using str in general just works
>>> pd.Series([1, "x"], dtype=str) # works
>>> pd.Series([1, "x"], dtype=str).astype("string") # works
>>> pd.Series([1, "x"]).astype(str).astype("string") # works
# arbitrary conversion to StringDtype doesn't work
>>> pd.Series([1, 2]).astype("string")
ValueError: StringArray requires a sequence of strings or pandas.NA
>>> pd.Series([1, 2]).astype(str).astype("string") # works Those differences are IMO quite unintuitive for a user who expects StringDtype to be a "better str". I've made a PR on this in #33465. It lets go of the strictness of |
Gentle ping. |
Trying to tighten what DatetimeArray/TimedeltaArray
The A few options come to mind:
Thoughts @jorisvandenbossche ? |
I think many of those questions also circle back to the casting discussion at #22384. If we have a general way to do "from dtype -> to dtype" for all EA dtypes (eg extension dtypes can register such conversions), then the generic |
I am trying to revive my old branch for this. |
IIUC the issue is cases where the dtype for a sequence of scalars is amibiguous (despite them being valid for a strict _from_sequence). The options that come to mind are:
My inclination is to lean towards being as strict as possible, in part bc it is easier to subsequently relax than the other way around. Also not obvious how we would enforce it for 3rd party EAs. |
I've been poking at changing One option that would handle the .astype situation (and I think some construction situations) is to be strict when dtype is not specified and non-strict otherwise. Really at this point it is very clear we need something strict and something non-strict. |
Copying part of the discussion out of #32586 into a more specific issue.
Problem statement: currently,
_from_sequence
is not very explicit in what it should accept as scalars. In practice, this means that it is mostly very liberal, as it is also used under the hood when creating an array of any list-like of objects inpd.array(.., dtype)
.However, in some cases we need a more strict version that only accepts actual scalars of the array (meaning, the type of values you get from
array[0]
orarray.max()
in case it supports that kind of reductions). This causes some issues like #31108.So, what should
_from_sequence
accept? Should it only be sequences that are unambiguously this dtype?I think it will be useful to have a "strict" version that basically only accepts instances of ExtensionDtype.type or NA values. But we also still need a "liberal" method for the other use cases like
pd.array(.., dtype)
.The strict version would be used when, for some reason, we go through object dtype (or a list of scalars, or something equivalent). For example in groupby, where we assemble a list of scalars from the reductions into a new column.
From a testing point of view, that would mean we can test that
EA._the_strict_method(np.asarray(EA, dtype=object), dtype=EA.dtype)
andEA._the_strict_method(list(EA), dtype=EA.dtype)
can roundtrip faithfully.Assuming we agree that we need a strict version for certain use cases, I think there are two main options:
Keep
_from_sequence
as is, and add a new_from_scalars
method that is more strict (that in the base class can call_from_sequence
initially for backwards compatibility). We can use_from_scalars
in those cases where we need the strict version, and keep using_from_sequence
elsewhere (eg inpd.array(.., dtype=)
)Update the expectation in our spec that
_from_sequence
should only accept a sequence of scalars of the array's type (so make_from_sequence
the strict method), and use theastype
machinery for construction. Basically, the current flexible_from_sequence
would then be equivalent to casting an object ndarray (or generally any type) to your EA dtype.Are there preferences? (or other options?)
From a backwards compatibility point of view, I think both are similar (in both cases you need to update a method (
_from_scalars
or_from_sequence
), and in both cases initially the flexible version will still be used as fallback until the update is done).The second option of course requires an update to the astype machinery (#22384), which doesn't exist today, but on the other hand is also something we need to do at some point eventually (but a much bigger topic to solve).
The text was updated successfully, but these errors were encountered: