-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: use "safe" casting by default in astype() / constructors #45588
Comments
I agree with this proposal. Nice write up. |
Great write up. I agree in principle. I'll just play devil's advocate and suggest some scenarios which it might be worthwhile to think through: Float truncation successful subsets:If Composition of operations:Will each conversion be treated individually or is there generic structure that you are proposing to put in place, for custom datatypes also. For example: One may have
In this example the float to int truncation is negated by the the datetime to date truncation which is quite natural. In response to the is the |
Thanks! That's always useful :)
Yes, the idea is that generally casting float -> int works, and only raises (by default) if truncation would happen. In the case of The use case you bring up is indeed a typical one for which this new behaviour would work nicely IMO: you have a column with in theory integer values, but for some reason they are stored as floats (e.g. because of
I suppose the exact behaviour of each cast will be a case-by-case decision for the involved dtypes, but we should of course make sure we have some general guidelines or rules on what we consider safe or not (the top post tries to provide some basis for this), and try to ensure this gives a consistent behaviour for the different dtypes in pandas.
This case is not explicitly included in the top post, but I would say this is also not valid if truncation happens, to be consistent with the float -> int cast (basically, If the above raises if truncation happens, that also solves the "problem" of being able to side track truncation in an
Can you clarify this last bit? What do you mean with "instruct on the truncation casts"? |
We discussed this a bit on the community call last week. Summarizing some take-aways / discussion points from that. First, given that this still caused some confusion, I want to reiterate the difference with numpy's casting levels (the Concrete example: in numpy, casting int8 to int64 is safe cast, and casting int64 to int8 is not (regardless of whether the actual values fit in the int8 range). That means that you can either ask for a safe cast and always get an error even if the values are in range, or ask for an unsafe cast and always get a silent overflow in case of out of range values: # actual numpy behaviour
>>> np.array([10], dtype="int64").astype("int8", casting="safe")
TypeError: Cannot cast array data from dtype('int64') to dtype('int8') according to the rule 'safe'
>>> np.array([1000], dtype="int64").astype("int8", casting="unsafe")
array([-24], dtype=int8) What you can't obtain with numpy's astype and casting levels (without manual checking) it to allow a cast from int64 to int8, but raise an error if there would be overflow. So with a hypothetical example: # proposed behaviour for pandas
# 10 is within range for int8, so this cast works
>>> pd.Series([10], dtype="int64").astype("int8")
0 10
dtype: int8
# but 1000 would overflow, so we raise an error
>>> pd.Series([1000], dtype="int64").astype("int8")
ValueError: casting from int64 to int8 would overflow / value 1000 not in range for int8 So this kind of value-based behaviour is not part of numpy's "casting levels" concept. (in addition, there are also some casts that numpy considers "safe" that are not safe at all, such as All to say that what is proposed here in this issue is not an adaptation of numpy's casting levels in pandas One can argue that this gives "value-dependent behaviour", which is something we are trying to move away from in other contexts. This is true, but two reasons why I think in this case this is fine: 1) it's not the resulting shape or dtype that is values-dependent, but only whether it errors at runtime or not (while for example in concat we have cases where the resulting dtype depends on the values, which is something we want to avoid), and 2) we have such value-dependent behaviour in casting already to some extent. For this second argument, take for example casting a string to float with current numpy or pandas: >>> np.array(["1.2"]).astype("float64")
array([1.2])
>>> np.array(["A"]).astype("float64")
...
ValueError: could not convert string to float: 'A' This already has the "raise ValueError if conversion cannot be done correctly" type of behaviour (and so also numpy has this type of behaviour in this case, it is only not impacted by the casting keyword). And basically this issue proposes the extend the number of cases where we raise such a ValueError (by default). Some specific aspects that came up in the discussion:
|
I think this is probably the right direction, and I can see the utility of
simply recasting an array of int64 to int8 when the data will fit. Would it
be better to invent a new conversion type, something like "value_safe" or
just "value" which would perform the check. The downside of always checking
is that it could be expensive in large arrays.
Also, what about converting int64 -> double? While this is considered
"safe", it isn't really in the sense that it loses information.
f=np.array([2**62-2**32-4-2-1],dtype="i8")
g = f.astype(float).astype("i8",casting="unsafe")
Something like "value_safe" might include this case where the value
couldn't be safely round-tripped.
A related function or keyword would be an automatic version of astype that
would automatically cast integer values to the smallest type that can
represent the data. Might be out-of-scope, but this seems very useful when
trying to economize on memory.
Kevin
…On Wed, Feb 16, 2022 at 2:55 PM Joris Van den Bossche < ***@***.***> wrote:
We discussed this a bit on the community call last week. Summarizing some
take-aways / discussion points from that.
First, given that this still caused some confusion, I want to reiterate
the difference with numpy's casting levels (the casting keyword in, for
example, ndarray.astype
<https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html>,
with possible values of "no", "equiv", "safe", "same_kind", "unsafe").
In numpy, those casting levels are purely *based on the dtypes*, while
what I propose here is about behaviour that is *based on the values* that
are being cast.
Concrete example: in numpy, casting int8 to int64 is safe cast, and
casting int64 to int8 is not (regardless of whether the actual values fit
in the int8 range). That means that you can either ask for a safe cast and
always get an error even if the values are in range, or ask for an unsafe
cast and always get a silent overflow in case of out of range values:
# actual numpy behaviour
>>> np.array([10], dtype="int64").astype("int8", casting="safe")TypeError: Cannot cast array data from dtype('int64') to dtype('int8') according to the rule 'safe'>>> np.array([1000], dtype="int64").astype("int8", casting="unsafe")array([-24], dtype=int8)
What you can't obtain with numpy's astype and casting levels (without
manual checking) it to allow a cast from int64 to int8, but raise an error
if there would be overflow. So with a hypothetical example:
# proposed behaviour for pandas
# 10 is within range for int8, so this cast works>>> pd.Series([10], dtype="int64").astype("int8")0 10dtype: int8# but 1000 would overflow, so we raise an error>>> pd.Series([1000], dtype="int64").astype("int8")ValueError: casting from int64 to int8 would overflow / value 1000 not in range for int8
So this kind of value-based behaviour is not part of numpy's "casting
levels" concept.
(in addition, there are also some casts that numpy considers "safe" that
are not safe at all, such as np.array([1_000_000_0000],
dtype="datetime64[s]").astype("datetime64[ns]", casting="safe")
converting s to ns resolution and actually overflows)
All to say that what is proposed here in this issue is not an adaptation
of numpy's casting levels in pandas
------------------------------
One can argue that this gives "value-dependent behaviour", which is
something we are trying to move away from in other contexts. This is true,
but two reasons why I think in this case this is fine: 1) it's not the
resulting shape or dtype that is values-dependent, but only whether it
errors at runtime or not (while for example in concat we have cases where
the resulting *dtype* depends on the values, which is something we want
to avoid), and 2) we have such value-dependent behaviour in casting already
to some extent.
For this second argument, take for example casting a string to float with
current numpy or pandas:
>>> np.array(["1.2"]).astype("float64")array([1.2])
>>> np.array(["A"]).astype("float64")
...ValueError: could not convert string to float: 'A'
This already has the *"raise ValueError if conversion cannot be done
correctly"* type of behaviour (and so also numpy has this type of
behaviour in this case, it is only not impacted by the casting keyword).
And basically this issue proposes the extend the number of cases where we
raise such a ValueError (by default).
------------------------------
Some specific aspects that came up in the discussion:
- Do we have better terminology to talk about this / for naming
keywords?
Above I use the term "safe", but this can cause confusion given that
numpy also has a casting level named "safe", while it means something
different in practice. On the other hand, for people not familiar with
numpy's casting levels, "safe" might actually be one of the best / clearest
terms?
Alternatives that were suggested were "strict", "intact" or
"save_values".
- Could we actually re-use the casting keyword from numpy, and expand
it in pandas' astype with additional options?
(Personally, I wouldn't do this: it's a different concept (see the
dtype vs value based explanation above), so combining it might not make it
clearer I think. In addition, I don't recall much demand in pandas to
actually expose the casting levels from numpy)
- What is the relationship with the errors keyword?
The default of errors="raise" could become the "safe" casting
described here (raising an error if a conversion would fail), and we could
have a errors="unsafe" option to get the faster, non-checking version.
- The keyword we would add to control this safety could take an Enum
as value, to enable fine grained control case-by-case (allow one case but
not another, such as alloing float to int truncation but not int overflow).
Such an enum could work like the re flags that can be combined with |.
We could still have a value that can be passed to the keyword to disable or
enable all cases for convenience (such as the proposed safe=True/False)
- One additional case of "unsafe casting" that was mentioned and is
not included in the examples in the top post, is casting to categorical
dtype with values not present in the categories.
—
Reply to this email directly, view it on GitHub
<#45588 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKTSRNIY5XV5ZZBQRJNOLDU3O3HHANCNFSM5MVF754A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are on a team that was mentioned.Message
ID: ***@***.***>
|
@bashtage thanks for taking a look at this! And sorry for the slow reply.
Indeed, such a conversion can also loose information for very large integers. This case is mentioned in the top-post (see "Float truncation" section in "Concrete examples", I can't seem to link to it) in a section about float truncation (so float -> int), I should maybe make the int -> float case its own section as well for visibility.
We already have this somewhat available in |
is the way forward here to add a keyword without yet changing the default behavior? |
As long as there is a way to disable, not opposed. This falls under int/float overflow but I'll mention string to int/float won't roundtrip for certain values. |
(note: this has been partly discussed as part of #22384, but opening a dedicated issue here (it's also not limited to extension types), and what follows is an attempt to summarize the discussion up to now, and providing some more context and examples)
Context
In general, pandas currently can perform silent "unsafe" casting in several cases, both in the constructor (eg
Series(.., dtype=..)
) as in the explicitastype(..)
call.One typical case is the silent integer overflow in the following example:
While I am using the terms "safe" and "unsafe" here, those are not exactly well defined. In the context of this issue, I am meaning "value / information preserving" or "roundtripping".
In that context, the cast from 1000 to -24 is clearly not value preserving or a roudtrippable conversion. In contrast, for example a cast from the float 2.0 to the integer 2 is information preserving (except for the exact type) and roundtrippable. Also the conversion from Timestamp("2012-01-01") to the string "2012-01-01" can be considered as such (although those actual values don't evaluate equal).
There are a few cases of "unsafe" casting where you potentially can silently get wrong values. I currently think of the following cases (are there others in pandas?):
At the bottom of this post, I gave a concrete explanation and examples for each of those cases.
Numpy has a concept of "casting" levels for how permissive data conversions are allowed to be (eg the
casting
keyword inndarray.astype
), with possible values of "no", "equiv", "safe", "same_kind", "unsafe".However, I don't think that translates very well to pandas. In numpy, those casting levels are pre-defined for all combinations of data types, while the cases of unsafe casting I mention above depends on the actual values, not strictly the dtypes.
For example, casting int64 to int8 is considered "unsafe" in numpy ("same_kind" to be correct, but so not "safe"). But if all your int64 integers are actually within the int8 range, doing this cast is safe in practice (at runtime), so IMO we shouldn't raise an error about this by default.
On the other hand, casting int64 to float64 is considered "safe" by numpy, but in practice you can have very large integers that cannot actually be safely cast to float. Or similarly, casting datetime64[s] to datetime64[ns] is also considered safe by numpy, but you can have out-of-bounds values that won't fit in the nanosecond range in practice.
Therefore, I think for pandas, it's more useful to look at the "safety at run-time" (i.e. don't decide upfront about safe vs unsafe casts based on the dtypes, but handle runtime errors (out of bound values, values that would overflow or get truncated, etc)). This way, I would only consider two cases:
(e.g. pandas (in contrast to numpy) disallows casting datetime64 to timedelta64)
Note 1: this is basically the current situation in pandas, except that for the supported casts we don't have a consistent rule about cast safety and ways to deal with this (i.e. this is what this issue is about)
Note 2: we can also have a lot of discussion about which casts to allow and which not (eg do we want to support casting datetime to int? -> #45034). But let's keep those cases for separate issues, and focus the discussion here on the cast safety aspect for casts we clearly agree on are supported.
Proposal
The proposal is to move towards having safe casting by default in pandas, and have this consistently in both the constructor as explicit
astype
.Quoting from @TomAugspurger (#22384 (comment)), he proposes to agree on a couple principles, and work from those:
Series(values, dtype=dtype)
andvalues.astype(dtype=dtype)
.I am including the first point because this "safe casting or not" issue is relevant for both the constructors as the
astype
. But I would keep the practical aspect of this point (how do we achieve this consistency in the code) for a separate discussion, and keep the focus here on the principle and the second point about the default safety.Some assorted general considerations / questions:
safe=True/False
in astype) be sufficient? Or do we want more fine-grained control? (eg case by case)pd.Series(["1.0", "A"]).astype(float)
). I would keep this as a separate discussion (this already raises by default, and I don't think we want to change that), although the idea of adding anerrors="coerce"
option to the exising keyword could also be relevant for the unsafe casting cases. And it might be the question whether we want to combine this in a single keyword?cc @pandas-dev/pandas-core
Concrete examples
Integer overflow
This can happen when casting to different bit-width or signed-ness. Generally, in
astype
, we don't check for this and silently overflow (following numpy behaviour):In the Series constructor, we already added a deprecation warning about changing this in the future:
Another example casting a negative number to unsigned integer:
Float truncation
This typically happens when casting floats to integer when your floating numbers are not fully rounded. Following numpy, the behaviour of our
astype
or constructors is to truncate the floats:Many might find this the expected behaviour, but I want to point out that it can actually be better to explicitly round/ceil/floor, as the "truncation" is not the same as rounding (which I think users would naively expect). For example, you get different numbers here with
round
vs theastype
shown above:In the constructor, when not starting from a numpy array, we actually already raised an error for float truncation in older version (on master this seems to ignore the dtype and give float as result):
The truncation can also happen in the cast the other way around from integer to float. Large integers can not always be faithfully represented in the float range. For example:
Timestamp overflow
Numpy is known to silently overflow for out-of-bounds timestamps when casting to a different resolution, eg:
We already check for this, and eg in constructor raise:
When we support multiple resolutions, this will also apply to
astype
.(and the same also applies to timedelta data)
Timestamp truncation
Related to the above, but now when going to a coarser resolution, you can loose information. Numpy will also silently truncate in this case:
In pandas you can see a similar behaviour (the result is truncated, but still nanoseconds in the return value).
When we support multiple resolutions, this will become more relevant. And similar to the float truncation above, it might be more explicit to round/ceil/floor first.
(and the same also applies to timedelta data)
Sidenote: something similar can happen for Period data, but there we don't support rounding as an alternative.
NA / NaN conversion
One additional case that is somewhat pandas specific because of not supporting missing values in all dtypes, is casting to data with missing values to integer dtype (not sure if there are actually other dtypes?).
Again, numpy silently gives wrong numbers:
In pandas, in most cases, we actually already have safe casting for this case, and raise an error. For example:
There are some cases, however, where we still silently convert the NaN / NaT to a number:
Note that this actually is broader than NaN in the float->int case, as we also have the same error when casting
inf
to int. So also for the nullable integer dtype, the "non-finite" values are still a relevant case.The text was updated successfully, but these errors were encountered: