Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: use "safe" casting by default in astype() / constructors #45588

Open
jorisvandenbossche opened this issue Jan 24, 2022 · 8 comments
Open
Labels
API Design Astype Constructors Series/DataFrame/Index/pd.array Constructors Needs Discussion Requires discussion from core team before further action

Comments

@jorisvandenbossche
Copy link
Member

(note: this has been partly discussed as part of #22384, but opening a dedicated issue here (it's also not limited to extension types), and what follows is an attempt to summarize the discussion up to now, and providing some more context and examples)

Context

In general, pandas currently can perform silent "unsafe" casting in several cases, both in the constructor (eg Series(.., dtype=..)) as in the explicit astype(..) call.
One typical case is the silent integer overflow in the following example:

>>> pd.Series([1000], dtype="int64").astype("int8")
0   -24
dtype: int8

While I am using the terms "safe" and "unsafe" here, those are not exactly well defined. In the context of this issue, I am meaning "value / information preserving" or "roundtripping".
In that context, the cast from 1000 to -24 is clearly not value preserving or a roudtrippable conversion. In contrast, for example a cast from the float 2.0 to the integer 2 is information preserving (except for the exact type) and roundtrippable. Also the conversion from Timestamp("2012-01-01") to the string "2012-01-01" can be considered as such (although those actual values don't evaluate equal).

There are a few cases of "unsafe" casting where you potentially can silently get wrong values. I currently think of the following cases (are there others in pandas?):

  • Integer overflow
  • Float truncation
  • Timestamp overflow and truncation
  • NA / NaN conversion

At the bottom of this post, I gave a concrete explanation and examples for each of those cases.

Numpy has a concept of "casting" levels for how permissive data conversions are allowed to be (eg the casting keyword in ndarray.astype), with possible values of "no", "equiv", "safe", "same_kind", "unsafe".
However, I don't think that translates very well to pandas. In numpy, those casting levels are pre-defined for all combinations of data types, while the cases of unsafe casting I mention above depends on the actual values, not strictly the dtypes.

For example, casting int64 to int8 is considered "unsafe" in numpy ("same_kind" to be correct, but so not "safe"). But if all your int64 integers are actually within the int8 range, doing this cast is safe in practice (at runtime), so IMO we shouldn't raise an error about this by default.
On the other hand, casting int64 to float64 is considered "safe" by numpy, but in practice you can have very large integers that cannot actually be safely cast to float. Or similarly, casting datetime64[s] to datetime64[ns] is also considered safe by numpy, but you can have out-of-bounds values that won't fit in the nanosecond range in practice.

Therefore, I think for pandas, it's more useful to look at the "safety at run-time" (i.e. don't decide upfront about safe vs unsafe casts based on the dtypes, but handle runtime errors (out of bound values, values that would overflow or get truncated, etc)). This way, I would only consider two cases:

  1. Casts that are simply not supported and will directly raise a TypeError.
    (e.g. pandas (in contrast to numpy) disallows casting datetime64 to timedelta64)
  2. Casts that are generally supported, but could result in an unsafe cast / raise a ValueError during execution depending on the actual values.

Note 1: this is basically the current situation in pandas, except that for the supported casts we don't have a consistent rule about cast safety and ways to deal with this (i.e. this is what this issue is about)

Note 2: we can also have a lot of discussion about which casts to allow and which not (eg do we want to support casting datetime to int? -> #45034). But let's keep those cases for separate issues, and focus the discussion here on the cast safety aspect for casts we clearly agree on are supported.


Proposal

The proposal is to move towards having safe casting by default in pandas, and have this consistently in both the constructor as explicit astype.

Quoting from @TomAugspurger (#22384 (comment)), he proposes to agree on a couple principles, and work from those:

  1. pandas should be consistent with itself between Series(values, dtype=dtype) and values.astype(dtype=dtype).
  2. pandas should by default error at runtime when type casting causes loss of equality / information (integer overflow, float -> int truncation, ...), with the option to disable that check (since it will be expensive).

I am including the first point because this "safe casting or not" issue is relevant for both the constructors as the astype. But I would keep the practical aspect of this point (how do we achieve this consistency in the code) for a separate discussion, and keep the focus here on the principle and the second point about the default safety.

Some assorted general considerations / questions:

  • Do we agree on the list of "unsafe" cases? Are there other cases? Or would you leave out some cases?
  • For moving towards this, we will have to deprecate a bunch of silent unsafe cases first.
  • Will a single toggle (eg safe=True/False in astype) be sufficient? Or do we want more fine-grained control? (eg case by case)
  • Having safe casting by default has performance implication (see some example timings at #22384 (comment) to get an idea), but so there will be a keyword to disable the checks if you don't care about the unsafe casts or are sure you don't have values that would result in unsafe casts.
  • All the unsafe cases discussed here are about casts that can be done (on the numpy array level) but can loose information or give wrong values. In addition, there are also "conversion errors" that never work for certain values, eg casting strings to float where one of the strings does not represent a float (pd.Series(["1.0", "A"]).astype(float)). I would keep this as a separate discussion (this already raises by default, and I don't think we want to change that), although the idea of adding an errors="coerce" option to the exising keyword could also be relevant for the unsafe casting cases. And it might be the question whether we want to combine this in a single keyword?
  • If we make our casts safe by default, the question will also come up if we will follow this default in other contexts where a cast is done implicitly (eg when concatting, in operations, .. that involve data with different data types). But I would propose to keep those as separate, follow-up discussions (the issue description is already way too long :))

cc @pandas-dev/pandas-core


Concrete examples

Integer overflow

This can happen when casting to different bit-width or signed-ness. Generally, in astype, we don't check for this and silently overflow (following numpy behaviour):

>>> pd.Series([1000], dtype="int64").astype("int8")
0   -24
dtype: int8

In the Series constructor, we already added a deprecation warning about changing this in the future:

>>> pd.Series([1000], dtype="int8")
FutureWarning: Values are too large to be losslessly cast to int8. In a future version this
will raise OverflowError. To retain the old behavior, use pd.Series(values).astype(int8)
0   -24
dtype: int8

Another example casting a negative number to unsigned integer:

>>> pd.Series([-1000], dtype="int64").astype("uint64")
0    18446744073709550616
dtype: uint64

Float truncation

This typically happens when casting floats to integer when your floating numbers are not fully rounded. Following numpy, the behaviour of our astype or constructors is to truncate the floats:

>>> pd.Series([0.5, 1.5], dtype="float64").astype("int64")
0    0
1    1
dtype: int64

Many might find this the expected behaviour, but I want to point out that it can actually be better to explicitly round/ceil/floor, as the "truncation" is not the same as rounding (which I think users would naively expect). For example, you get different numbers here with round vs the astype shown above:

>>> pd.Series([0.5, 1.5], dtype="float64").round()
0    0.0
1    2.0
dtype: float64

In the constructor, when not starting from a numpy array, we actually already raised an error for float truncation in older version (on master this seems to ignore the dtype and give float as result):

>>> pd.Series([1.0, 2.5], dtype="int64") 
...
ValueError: Trying to coerce float values to integers

The truncation can also happen in the cast the other way around from integer to float. Large integers can not always be faithfully represented in the float range. For example:

>>> pd.Series([1_100_100_100_100], dtype="int64").astype("float32")
0    1.100100e+12
dtype: float32

# the repr above is not clear about the truncation, but casting back to integer shows it
>>> pd.Series([1_100_100_100_100], dtype="int64").astype("float32").astype("int64")
0    1100100141056
dtype: int64

Timestamp overflow

Numpy is known to silently overflow for out-of-bounds timestamps when casting to a different resolution, eg:

>>> np.array(["2300-01-01"], dtype="datetime64[s]").astype("datetime64[ns]")
array(['1715-06-13T00:25:26.290448384'], dtype='datetime64[ns]')

We already check for this, and eg in constructor raise:

>>> pd.Series(np.array(["2300-01-01"], dtype="datetime64[s]"), dtype="datetime64[ns]")
...
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2300-01-01 00:00:00

When we support multiple resolutions, this will also apply to astype.
(and the same also applies to timedelta data)

Timestamp truncation

Related to the above, but now when going to a coarser resolution, you can loose information. Numpy will also silently truncate in this case:

>>> np.array(["2022-01-01 00:00:00.01"], dtype="datetime64[ns]").astype("datetime64[s]")
array(['2022-01-01T00:00:00'], dtype='datetime64[s]')

In pandas you can see a similar behaviour (the result is truncated, but still nanoseconds in the return value).

>>> pd.Series(["2022-01-01 00:00:00.01"], dtype="datetime64[ns]").astype("datetime64[s]")
0   2022-01-01
dtype: datetime64[ns]

When we support multiple resolutions, this will become more relevant. And similar to the float truncation above, it might be more explicit to round/ceil/floor first.
(and the same also applies to timedelta data)

Sidenote: something similar can happen for Period data, but there we don't support rounding as an alternative.

NA / NaN conversion

One additional case that is somewhat pandas specific because of not supporting missing values in all dtypes, is casting to data with missing values to integer dtype (not sure if there are actually other dtypes?).

Again, numpy silently gives wrong numbers:

>>> np.array([1.0, np.nan], dtype="float64").astype("int64")
array([                   1, -9223372036854775808])

In pandas, in most cases, we actually already have safe casting for this case, and raise an error. For example:

>>> pd.Series([1.0, np.nan], dtype="float64").astype("int64") 
...
IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

There are some cases, however, where we still silently convert the NaN / NaT to a number:

>>> pd.array(["2012-01-01", "NaT"], dtype="datetime64[ns]").astype("int64") 
array([ 1325376000000000000, -9223372036854775808])

Note that this actually is broader than NaN in the float->int case, as we also have the same error when casting inf to int. So also for the nullable integer dtype, the "non-finite" values are still a relevant case.

@jorisvandenbossche jorisvandenbossche added API Design Needs Discussion Requires discussion from core team before further action Astype labels Jan 24, 2022
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 24, 2022

  • pandas should be consistent with itself between Series(values, dtype=dtype) and values.astype(dtype=dtype).
  • pandas should by default error at runtime when type casting causes loss of equality / information (integer overflow, float -> int truncation, ...), with the option to disable that check (since it will be expensive).

I agree with this proposal. Nice write up.

@attack68
Copy link
Contributor

Great write up. I agree in principle.

I'll just play devil's advocate and suggest some scenarios which it might be worthwhile to think through:

Float truncation successful subsets:

If Series([1.1, 2.2], dtype="float64").astype("int64") fails because it loses information should Series([1.0, 2.0], dtype="float64").astype("int64") also fail as a wider part of the failing class, even though this particular subset does not lose information. Or should it still succeed since there may be a common use case where integers are read in as floats and then conversion of this subset happens to be a particulalrly common operation?

Composition of operations:

Will each conversion be treated individually or is there generic structure that you are proposing to put in place, for custom datatypes also. For example:

One may have Series([1000.1], dtype="float64").astype("int64") failing as above, but one may have:

float_ = Series([1000.1], dtype="float64")  # is valid
ts_ = float_.astype("datetime")  # is valid
dt_ = ts_.astype("date")  # truncates the time of a datetime
int_ = dt_.astype("int") # is a direct conversion.

In this example the float to int truncation is negated by the the datetime to date truncation which is quite natural.

In response to the is the safe=True/False toggle enough, perhaps an option could instruct on the truncation casts?

@jorisvandenbossche
Copy link
Member Author

I'll just play devil's advocate

Thanks! That's always useful :)

Float truncation successful subsets:

If Series([1.1, 2.2], dtype="float64").astype("int64") fails because it loses information should Series([1.0, 2.0], dtype="float64").astype("int64") also fail as a wider part of the failing class, even though this particular subset does not lose information. Or should it still succeed since there may be a common use case where integers are read in as floats and then conversion of this subset happens to be a particulalrly common operation?

Yes, the idea is that generally casting float -> int works, and only raises (by default) if truncation would happen. In the case of Series([1.0, 2.0], dtype="float64").astype("int64"), no truncation would happen, so there would be no error.

The use case you bring up is indeed a typical one for which this new behaviour would work nicely IMO: you have a column with in theory integer values, but for some reason they are stored as floats (e.g. because of np.nan being present, which is a very common case in pandas I think), and you want to convert them to integers (eg after doing fillna()) while being sure you are not by accident truncating actual float values. If you want to truncate the float values, you can do that explicitly with round() (or with safe=False in astype)

Composition of operations:

Will each conversion be treated individually or is there generic structure that you are proposing to put in place, for custom datatypes also.

I suppose the exact behaviour of each cast will be a case-by-case decision for the involved dtypes, but we should of course make sure we have some general guidelines or rules on what we consider safe or not (the top post tries to provide some basis for this), and try to ensure this gives a consistent behaviour for the different dtypes in pandas.

ts_ = float_.astype("datetime") # is valid

This case is not explicitly included in the top post, but I would say this is also not valid if truncation happens, to be consistent with the float -> int cast (basically, float -> datetime is a float -> int under the hood). Now, it's also an open question whether we want to allow this cast to start with (see #45034 (comment) for this discussion).

If the above raises if truncation happens, that also solves the "problem" of being able to side track truncation in an float -> int cast by going through datetime.

In response to the is the safe=True/False toggle enough, perhaps an option could instruct on the truncation casts?

Can you clarify this last bit? What do you mean with "instruct on the truncation casts"?

@jorisvandenbossche
Copy link
Member Author

We discussed this a bit on the community call last week. Summarizing some take-aways / discussion points from that.

First, given that this still caused some confusion, I want to reiterate the difference with numpy's casting levels (the casting keyword in, for example, ndarray.astype, with possible values of "no", "equiv", "safe", "same_kind", "unsafe").
In numpy, those casting levels are purely based on the dtypes, while what I propose here is about behaviour that is based on the values that are being cast.

Concrete example: in numpy, casting int8 to int64 is safe cast, and casting int64 to int8 is not (regardless of whether the actual values fit in the int8 range). That means that you can either ask for a safe cast and always get an error even if the values are in range, or ask for an unsafe cast and always get a silent overflow in case of out of range values:

# actual numpy behaviour

>>> np.array([10], dtype="int64").astype("int8", casting="safe")
TypeError: Cannot cast array data from dtype('int64') to dtype('int8') according to the rule 'safe'
>>> np.array([1000], dtype="int64").astype("int8", casting="unsafe")
array([-24], dtype=int8)

What you can't obtain with numpy's astype and casting levels (without manual checking) it to allow a cast from int64 to int8, but raise an error if there would be overflow. So with a hypothetical example:

# proposed behaviour for pandas

# 10 is within range for int8, so this cast works
>>> pd.Series([10], dtype="int64").astype("int8")
0    10
dtype: int8
# but 1000 would overflow, so we raise an error
>>> pd.Series([1000], dtype="int64").astype("int8")
ValueError: casting from int64 to int8 would overflow / value 1000 not in range for int8

So this kind of value-based behaviour is not part of numpy's "casting levels" concept.

(in addition, there are also some casts that numpy considers "safe" that are not safe at all, such as np.array([1_000_000_0000], dtype="datetime64[s]").astype("datetime64[ns]", casting="safe") converting s to ns resolution and actually overflows)

All to say that what is proposed here in this issue is not an adaptation of numpy's casting levels in pandas


One can argue that this gives "value-dependent behaviour", which is something we are trying to move away from in other contexts. This is true, but two reasons why I think in this case this is fine: 1) it's not the resulting shape or dtype that is values-dependent, but only whether it errors at runtime or not (while for example in concat we have cases where the resulting dtype depends on the values, which is something we want to avoid), and 2) we have such value-dependent behaviour in casting already to some extent.

For this second argument, take for example casting a string to float with current numpy or pandas:

>>> np.array(["1.2"]).astype("float64")
array([1.2])

>>> np.array(["A"]).astype("float64")
...
ValueError: could not convert string to float: 'A'

This already has the "raise ValueError if conversion cannot be done correctly" type of behaviour (and so also numpy has this type of behaviour in this case, it is only not impacted by the casting keyword). And basically this issue proposes the extend the number of cases where we raise such a ValueError (by default).


Some specific aspects that came up in the discussion:

  • Do we have better terminology to talk about this / for naming keywords?
    Above I use the term "safe", but this can cause confusion given that numpy also has a casting level named "safe", while it means something different in practice. On the other hand, for people not familiar with numpy's casting levels, "safe" might actually be one of the best / clearest terms?
    Alternatives that were suggested were "strict", "intact" or "save_values".
  • Could we actually re-use the casting keyword from numpy, and expand it in pandas' astype with additional options?
    (Personally, I wouldn't do this: it's a different concept (see the dtype vs value based explanation above), so combining it might not make it clearer I think. In addition, I don't recall much demand in pandas to actually expose the casting levels from numpy)
  • What is the relationship with the errors keyword?
    The default of errors="raise" could become the "safe" casting described here (raising an error if a conversion would fail), and we could have a errors="unsafe" option to get the faster, non-checking version.
  • The keyword we would add to control this safety could take an Enum as value, to enable fine grained control case-by-case (allow one case but not another, such as alloing float to int truncation but not int overflow). Such an enum could work like the re flags that can be combined with |. We could still have a value that can be passed to the keyword to disable or enable all cases for convenience (such as the proposed safe=True/False)
  • One additional case of "unsafe casting" that was mentioned and is not included in the examples in the top post, is casting to categorical dtype with values not present in the categories.

@bashtage
Copy link
Contributor

bashtage commented Feb 17, 2022 via email

@jorisvandenbossche
Copy link
Member Author

Would it be better to invent a new conversion type, something like "value_safe" or just "value" which would perform the check. The downside of always checking is that it could be expensive in large arrays.

@bashtage thanks for taking a look at this! And sorry for the slow reply.
Can you clarify the above statement a bit? What do you mean exactly with a "new conversion type"? (an additional value for numpy's casting keyword? Or a separate method?) And do you mean that you would rather see it opt-in, than make it (eventually) the default behaviour?

Also, what about converting int64 -> double? While this is considered "safe", it isn't really in the sense that it loses information.

Indeed, such a conversion can also loose information for very large integers. This case is mentioned in the top-post (see "Float truncation" section in "Concrete examples", I can't seem to link to it) in a section about float truncation (so float -> int), I should maybe make the int -> float case its own section as well for visibility.

A related function or keyword would be an automatic version of astype that
would automatically cast integer values to the smallest type that can
represent the data.

We already have this somewhat available in to_numeric (eg pd.to_numeric(pd.Series([1, 2, 3], dtype="int64"), downcast="integer") returns a Series with int8 dtype), but that's a bit hidden / not convenient to use from a dataframe (while useful, I would agree this is out-of-scope for the current discussion).

@jbrockmendel
Copy link
Member

is the way forward here to add a keyword without yet changing the default behavior?

@rhshadrach
Copy link
Member

As long as there is a way to disable, not opposed. This falls under int/float overflow but I'll mention string to int/float won't roundtrip for certain values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Astype Constructors Series/DataFrame/Index/pd.array Constructors Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants