API: use "safe" casting by default in astype() / constructors #45588

jorisvandenbossche · 2022-01-24T12:07:11Z

(note: this has been partly discussed as part of #22384, but opening a dedicated issue here (it's also not limited to extension types), and what follows is an attempt to summarize the discussion up to now, and providing some more context and examples)

Context

In general, pandas currently can perform silent "unsafe" casting in several cases, both in the constructor (eg Series(.., dtype=..)) as in the explicit astype(..) call.
One typical case is the silent integer overflow in the following example:

>>> pd.Series([1000], dtype="int64").astype("int8")
0   -24
dtype: int8

While I am using the terms "safe" and "unsafe" here, those are not exactly well defined. In the context of this issue, I am meaning "value / information preserving" or "roundtripping".
In that context, the cast from 1000 to -24 is clearly not value preserving or a roudtrippable conversion. In contrast, for example a cast from the float 2.0 to the integer 2 is information preserving (except for the exact type) and roundtrippable. Also the conversion from Timestamp("2012-01-01") to the string "2012-01-01" can be considered as such (although those actual values don't evaluate equal).

There are a few cases of "unsafe" casting where you potentially can silently get wrong values. I currently think of the following cases (are there others in pandas?):

Integer overflow
Float truncation
Timestamp overflow and truncation
NA / NaN conversion

At the bottom of this post, I gave a concrete explanation and examples for each of those cases.

Numpy has a concept of "casting" levels for how permissive data conversions are allowed to be (eg the casting keyword in ndarray.astype), with possible values of "no", "equiv", "safe", "same_kind", "unsafe".
However, I don't think that translates very well to pandas. In numpy, those casting levels are pre-defined for all combinations of data types, while the cases of unsafe casting I mention above depends on the actual values, not strictly the dtypes.

For example, casting int64 to int8 is considered "unsafe" in numpy ("same_kind" to be correct, but so not "safe"). But if all your int64 integers are actually within the int8 range, doing this cast is safe in practice (at runtime), so IMO we shouldn't raise an error about this by default.
On the other hand, casting int64 to float64 is considered "safe" by numpy, but in practice you can have very large integers that cannot actually be safely cast to float. Or similarly, casting datetime64[s] to datetime64[ns] is also considered safe by numpy, but you can have out-of-bounds values that won't fit in the nanosecond range in practice.

Therefore, I think for pandas, it's more useful to look at the "safety at run-time" (i.e. don't decide upfront about safe vs unsafe casts based on the dtypes, but handle runtime errors (out of bound values, values that would overflow or get truncated, etc)). This way, I would only consider two cases:

Casts that are simply not supported and will directly raise a TypeError.
(e.g. pandas (in contrast to numpy) disallows casting datetime64 to timedelta64)
Casts that are generally supported, but could result in an unsafe cast / raise a ValueError during execution depending on the actual values.

Note 1: this is basically the current situation in pandas, except that for the supported casts we don't have a consistent rule about cast safety and ways to deal with this (i.e. this is what this issue is about)

Note 2: we can also have a lot of discussion about which casts to allow and which not (eg do we want to support casting datetime to int? -> #45034). But let's keep those cases for separate issues, and focus the discussion here on the cast safety aspect for casts we clearly agree on are supported.

Proposal

The proposal is to move towards having safe casting by default in pandas, and have this consistently in both the constructor as explicit astype.

Quoting from @TomAugspurger (#22384 (comment)), he proposes to agree on a couple principles, and work from those:

pandas should be consistent with itself between Series(values, dtype=dtype) and values.astype(dtype=dtype).
pandas should by default error at runtime when type casting causes loss of equality / information (integer overflow, float -> int truncation, ...), with the option to disable that check (since it will be expensive).

I am including the first point because this "safe casting or not" issue is relevant for both the constructors as the astype. But I would keep the practical aspect of this point (how do we achieve this consistency in the code) for a separate discussion, and keep the focus here on the principle and the second point about the default safety.

Some assorted general considerations / questions:

Do we agree on the list of "unsafe" cases? Are there other cases? Or would you leave out some cases?
For moving towards this, we will have to deprecate a bunch of silent unsafe cases first.
Will a single toggle (eg safe=True/False in astype) be sufficient? Or do we want more fine-grained control? (eg case by case)
Having safe casting by default has performance implication (see some example timings at #22384 (comment) to get an idea), but so there will be a keyword to disable the checks if you don't care about the unsafe casts or are sure you don't have values that would result in unsafe casts.
All the unsafe cases discussed here are about casts that can be done (on the numpy array level) but can loose information or give wrong values. In addition, there are also "conversion errors" that never work for certain values, eg casting strings to float where one of the strings does not represent a float (pd.Series(["1.0", "A"]).astype(float)). I would keep this as a separate discussion (this already raises by default, and I don't think we want to change that), although the idea of adding an errors="coerce" option to the exising keyword could also be relevant for the unsafe casting cases. And it might be the question whether we want to combine this in a single keyword?
If we make our casts safe by default, the question will also come up if we will follow this default in other contexts where a cast is done implicitly (eg when concatting, in operations, .. that involve data with different data types). But I would propose to keep those as separate, follow-up discussions (the issue description is already way too long :))

cc @pandas-dev/pandas-core

Concrete examples

Integer overflow

This can happen when casting to different bit-width or signed-ness. Generally, in astype, we don't check for this and silently overflow (following numpy behaviour):

>>> pd.Series([1000], dtype="int64").astype("int8")
0   -24
dtype: int8

In the Series constructor, we already added a deprecation warning about changing this in the future:

>>> pd.Series([1000], dtype="int8")
FutureWarning: Values are too large to be losslessly cast to int8. In a future version this
will raise OverflowError. To retain the old behavior, use pd.Series(values).astype(int8)
0   -24
dtype: int8

Another example casting a negative number to unsigned integer:

>>> pd.Series([-1000], dtype="int64").astype("uint64")
0    18446744073709550616
dtype: uint64

Float truncation

This typically happens when casting floats to integer when your floating numbers are not fully rounded. Following numpy, the behaviour of our astype or constructors is to truncate the floats:

>>> pd.Series([0.5, 1.5], dtype="float64").astype("int64")
0    0
1    1
dtype: int64

Many might find this the expected behaviour, but I want to point out that it can actually be better to explicitly round/ceil/floor, as the "truncation" is not the same as rounding (which I think users would naively expect). For example, you get different numbers here with round vs the astype shown above:

>>> pd.Series([0.5, 1.5], dtype="float64").round()
0    0.0
1    2.0
dtype: float64

In the constructor, when not starting from a numpy array, we actually already raised an error for float truncation in older version (on master this seems to ignore the dtype and give float as result):

>>> pd.Series([1.0, 2.5], dtype="int64") 
...
ValueError: Trying to coerce float values to integers

The truncation can also happen in the cast the other way around from integer to float. Large integers can not always be faithfully represented in the float range. For example:

>>> pd.Series([1_100_100_100_100], dtype="int64").astype("float32")
0    1.100100e+12
dtype: float32

# the repr above is not clear about the truncation, but casting back to integer shows it
>>> pd.Series([1_100_100_100_100], dtype="int64").astype("float32").astype("int64")
0    1100100141056
dtype: int64

Timestamp overflow

Numpy is known to silently overflow for out-of-bounds timestamps when casting to a different resolution, eg:

>>> np.array(["2300-01-01"], dtype="datetime64[s]").astype("datetime64[ns]")
array(['1715-06-13T00:25:26.290448384'], dtype='datetime64[ns]')

We already check for this, and eg in constructor raise:

>>> pd.Series(np.array(["2300-01-01"], dtype="datetime64[s]"), dtype="datetime64[ns]")
...
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2300-01-01 00:00:00

When we support multiple resolutions, this will also apply to astype.
(and the same also applies to timedelta data)

Timestamp truncation

Related to the above, but now when going to a coarser resolution, you can loose information. Numpy will also silently truncate in this case:

>>> np.array(["2022-01-01 00:00:00.01"], dtype="datetime64[ns]").astype("datetime64[s]")
array(['2022-01-01T00:00:00'], dtype='datetime64[s]')

In pandas you can see a similar behaviour (the result is truncated, but still nanoseconds in the return value).

>>> pd.Series(["2022-01-01 00:00:00.01"], dtype="datetime64[ns]").astype("datetime64[s]")
0   2022-01-01
dtype: datetime64[ns]

When we support multiple resolutions, this will become more relevant. And similar to the float truncation above, it might be more explicit to round/ceil/floor first.
(and the same also applies to timedelta data)

Sidenote: something similar can happen for Period data, but there we don't support rounding as an alternative.

NA / NaN conversion

One additional case that is somewhat pandas specific because of not supporting missing values in all dtypes, is casting to data with missing values to integer dtype (not sure if there are actually other dtypes?).

Again, numpy silently gives wrong numbers:

>>> np.array([1.0, np.nan], dtype="float64").astype("int64")
array([                   1, -9223372036854775808])

In pandas, in most cases, we actually already have safe casting for this case, and raise an error. For example:

>>> pd.Series([1.0, np.nan], dtype="float64").astype("int64") 
...
IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

There are some cases, however, where we still silently convert the NaN / NaT to a number:

>>> pd.array(["2012-01-01", "NaT"], dtype="datetime64[ns]").astype("int64") 
array([ 1325376000000000000, -9223372036854775808])

Note that this actually is broader than NaN in the float->int case, as we also have the same error when casting inf to int. So also for the nullable integer dtype, the "non-finite" values are still a relevant case.

The text was updated successfully, but these errors were encountered:

Dr-Irv · 2022-01-24T15:25:52Z

pandas should be consistent with itself between Series(values, dtype=dtype) and values.astype(dtype=dtype).

pandas should by default error at runtime when type casting causes loss of equality / information (integer overflow, float -> int truncation, ...), with the option to disable that check (since it will be expensive).

I agree with this proposal. Nice write up.

attack68 · 2022-01-25T11:34:47Z

Great write up. I agree in principle.

I'll just play devil's advocate and suggest some scenarios which it might be worthwhile to think through:

Float truncation successful subsets:

If Series([1.1, 2.2], dtype="float64").astype("int64") fails because it loses information should Series([1.0, 2.0], dtype="float64").astype("int64") also fail as a wider part of the failing class, even though this particular subset does not lose information. Or should it still succeed since there may be a common use case where integers are read in as floats and then conversion of this subset happens to be a particulalrly common operation?

Composition of operations:

Will each conversion be treated individually or is there generic structure that you are proposing to put in place, for custom datatypes also. For example:

One may have Series([1000.1], dtype="float64").astype("int64") failing as above, but one may have:

float_ = Series([1000.1], dtype="float64")  # is valid
ts_ = float_.astype("datetime")  # is valid
dt_ = ts_.astype("date")  # truncates the time of a datetime
int_ = dt_.astype("int") # is a direct conversion.

In this example the float to int truncation is negated by the the datetime to date truncation which is quite natural.

In response to the is the safe=True/False toggle enough, perhaps an option could instruct on the truncation casts?

jorisvandenbossche · 2022-01-25T12:45:52Z

I'll just play devil's advocate

Thanks! That's always useful :)

Float truncation successful subsets:

If Series([1.1, 2.2], dtype="float64").astype("int64") fails because it loses information should Series([1.0, 2.0], dtype="float64").astype("int64") also fail as a wider part of the failing class, even though this particular subset does not lose information. Or should it still succeed since there may be a common use case where integers are read in as floats and then conversion of this subset happens to be a particulalrly common operation?

Yes, the idea is that generally casting float -> int works, and only raises (by default) if truncation would happen. In the case of Series([1.0, 2.0], dtype="float64").astype("int64"), no truncation would happen, so there would be no error.

The use case you bring up is indeed a typical one for which this new behaviour would work nicely IMO: you have a column with in theory integer values, but for some reason they are stored as floats (e.g. because of np.nan being present, which is a very common case in pandas I think), and you want to convert them to integers (eg after doing fillna()) while being sure you are not by accident truncating actual float values. If you want to truncate the float values, you can do that explicitly with round() (or with safe=False in astype)

Composition of operations:

Will each conversion be treated individually or is there generic structure that you are proposing to put in place, for custom datatypes also.

I suppose the exact behaviour of each cast will be a case-by-case decision for the involved dtypes, but we should of course make sure we have some general guidelines or rules on what we consider safe or not (the top post tries to provide some basis for this), and try to ensure this gives a consistent behaviour for the different dtypes in pandas.

ts_ = float_.astype("datetime") # is valid

This case is not explicitly included in the top post, but I would say this is also not valid if truncation happens, to be consistent with the float -> int cast (basically, float -> datetime is a float -> int under the hood). Now, it's also an open question whether we want to allow this cast to start with (see #45034 (comment) for this discussion).

If the above raises if truncation happens, that also solves the "problem" of being able to side track truncation in an float -> int cast by going through datetime.

In response to the is the safe=True/False toggle enough, perhaps an option could instruct on the truncation casts?

Can you clarify this last bit? What do you mean with "instruct on the truncation casts"?

jorisvandenbossche · 2022-02-16T14:55:35Z

We discussed this a bit on the community call last week. Summarizing some take-aways / discussion points from that.

First, given that this still caused some confusion, I want to reiterate the difference with numpy's casting levels (the casting keyword in, for example, ndarray.astype, with possible values of "no", "equiv", "safe", "same_kind", "unsafe").
In numpy, those casting levels are purely based on the dtypes, while what I propose here is about behaviour that is based on the values that are being cast.

Concrete example: in numpy, casting int8 to int64 is safe cast, and casting int64 to int8 is not (regardless of whether the actual values fit in the int8 range). That means that you can either ask for a safe cast and always get an error even if the values are in range, or ask for an unsafe cast and always get a silent overflow in case of out of range values:

# actual numpy behaviour

>>> np.array([10], dtype="int64").astype("int8", casting="safe")
TypeError: Cannot cast array data from dtype('int64') to dtype('int8') according to the rule 'safe'
>>> np.array([1000], dtype="int64").astype("int8", casting="unsafe")
array([-24], dtype=int8)

What you can't obtain with numpy's astype and casting levels (without manual checking) it to allow a cast from int64 to int8, but raise an error if there would be overflow. So with a hypothetical example:

# proposed behaviour for pandas

# 10 is within range for int8, so this cast works
>>> pd.Series([10], dtype="int64").astype("int8")
0    10
dtype: int8
# but 1000 would overflow, so we raise an error
>>> pd.Series([1000], dtype="int64").astype("int8")
ValueError: casting from int64 to int8 would overflow / value 1000 not in range for int8

So this kind of value-based behaviour is not part of numpy's "casting levels" concept.

(in addition, there are also some casts that numpy considers "safe" that are not safe at all, such as np.array([1_000_000_0000], dtype="datetime64[s]").astype("datetime64[ns]", casting="safe") converting s to ns resolution and actually overflows)

All to say that what is proposed here in this issue is not an adaptation of numpy's casting levels in pandas

One can argue that this gives "value-dependent behaviour", which is something we are trying to move away from in other contexts. This is true, but two reasons why I think in this case this is fine: 1) it's not the resulting shape or dtype that is values-dependent, but only whether it errors at runtime or not (while for example in concat we have cases where the resulting dtype depends on the values, which is something we want to avoid), and 2) we have such value-dependent behaviour in casting already to some extent.

For this second argument, take for example casting a string to float with current numpy or pandas:

>>> np.array(["1.2"]).astype("float64")
array([1.2])

>>> np.array(["A"]).astype("float64")
...
ValueError: could not convert string to float: 'A'

This already has the "raise ValueError if conversion cannot be done correctly" type of behaviour (and so also numpy has this type of behaviour in this case, it is only not impacted by the casting keyword). And basically this issue proposes the extend the number of cases where we raise such a ValueError (by default).

Some specific aspects that came up in the discussion:

Do we have better terminology to talk about this / for naming keywords?
Above I use the term "safe", but this can cause confusion given that numpy also has a casting level named "safe", while it means something different in practice. On the other hand, for people not familiar with numpy's casting levels, "safe" might actually be one of the best / clearest terms?
Alternatives that were suggested were "strict", "intact" or "save_values".
Could we actually re-use the casting keyword from numpy, and expand it in pandas' astype with additional options?
(Personally, I wouldn't do this: it's a different concept (see the dtype vs value based explanation above), so combining it might not make it clearer I think. In addition, I don't recall much demand in pandas to actually expose the casting levels from numpy)
What is the relationship with the errors keyword?
The default of errors="raise" could become the "safe" casting described here (raising an error if a conversion would fail), and we could have a errors="unsafe" option to get the faster, non-checking version.
The keyword we would add to control this safety could take an Enum as value, to enable fine grained control case-by-case (allow one case but not another, such as alloing float to int truncation but not int overflow). Such an enum could work like the re flags that can be combined with |. We could still have a value that can be passed to the keyword to disable or enable all cases for convenience (such as the proposed safe=True/False)
One additional case of "unsafe casting" that was mentioned and is not included in the examples in the top post, is casting to categorical dtype with values not present in the categories.

bashtage · 2022-02-17T18:00:56Z

I think this is probably the right direction, and I can see the utility of simply recasting an array of int64 to int8 when the data will fit. Would it be better to invent a new conversion type, something like "value_safe" or just "value" which would perform the check. The downside of always checking is that it could be expensive in large arrays. Also, what about converting int64 -> double? While this is considered "safe", it isn't really in the sense that it loses information. f=np.array([2**62-2**32-4-2-1],dtype="i8") g = f.astype(float).astype("i8",casting="unsafe") Something like "value_safe" might include this case where the value couldn't be safely round-tripped. A related function or keyword would be an automatic version of astype that would automatically cast integer values to the smallest type that can represent the data. Might be out-of-scope, but this seems very useful when trying to economize on memory. Kevin

…

On Wed, Feb 16, 2022 at 2:55 PM Joris Van den Bossche < ***@***.***> wrote: We discussed this a bit on the community call last week. Summarizing some take-aways / discussion points from that. First, given that this still caused some confusion, I want to reiterate the difference with numpy's casting levels (the casting keyword in, for example, ndarray.astype <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html>, with possible values of "no", "equiv", "safe", "same_kind", "unsafe"). In numpy, those casting levels are purely *based on the dtypes*, while what I propose here is about behaviour that is *based on the values* that are being cast. Concrete example: in numpy, casting int8 to int64 is safe cast, and casting int64 to int8 is not (regardless of whether the actual values fit in the int8 range). That means that you can either ask for a safe cast and always get an error even if the values are in range, or ask for an unsafe cast and always get a silent overflow in case of out of range values: # actual numpy behaviour >>> np.array([10], dtype="int64").astype("int8", casting="safe")TypeError: Cannot cast array data from dtype('int64') to dtype('int8') according to the rule 'safe'>>> np.array([1000], dtype="int64").astype("int8", casting="unsafe")array([-24], dtype=int8) What you can't obtain with numpy's astype and casting levels (without manual checking) it to allow a cast from int64 to int8, but raise an error if there would be overflow. So with a hypothetical example: # proposed behaviour for pandas # 10 is within range for int8, so this cast works>>> pd.Series([10], dtype="int64").astype("int8")0 10dtype: int8# but 1000 would overflow, so we raise an error>>> pd.Series([1000], dtype="int64").astype("int8")ValueError: casting from int64 to int8 would overflow / value 1000 not in range for int8 So this kind of value-based behaviour is not part of numpy's "casting levels" concept. (in addition, there are also some casts that numpy considers "safe" that are not safe at all, such as np.array([1_000_000_0000], dtype="datetime64[s]").astype("datetime64[ns]", casting="safe") converting s to ns resolution and actually overflows) All to say that what is proposed here in this issue is not an adaptation of numpy's casting levels in pandas ------------------------------ One can argue that this gives "value-dependent behaviour", which is something we are trying to move away from in other contexts. This is true, but two reasons why I think in this case this is fine: 1) it's not the resulting shape or dtype that is values-dependent, but only whether it errors at runtime or not (while for example in concat we have cases where the resulting *dtype* depends on the values, which is something we want to avoid), and 2) we have such value-dependent behaviour in casting already to some extent. For this second argument, take for example casting a string to float with current numpy or pandas: >>> np.array(["1.2"]).astype("float64")array([1.2]) >>> np.array(["A"]).astype("float64") ...ValueError: could not convert string to float: 'A' This already has the *"raise ValueError if conversion cannot be done correctly"* type of behaviour (and so also numpy has this type of behaviour in this case, it is only not impacted by the casting keyword). And basically this issue proposes the extend the number of cases where we raise such a ValueError (by default). ------------------------------ Some specific aspects that came up in the discussion: - Do we have better terminology to talk about this / for naming keywords? Above I use the term "safe", but this can cause confusion given that numpy also has a casting level named "safe", while it means something different in practice. On the other hand, for people not familiar with numpy's casting levels, "safe" might actually be one of the best / clearest terms? Alternatives that were suggested were "strict", "intact" or "save_values". - Could we actually re-use the casting keyword from numpy, and expand it in pandas' astype with additional options? (Personally, I wouldn't do this: it's a different concept (see the dtype vs value based explanation above), so combining it might not make it clearer I think. In addition, I don't recall much demand in pandas to actually expose the casting levels from numpy) - What is the relationship with the errors keyword? The default of errors="raise" could become the "safe" casting described here (raising an error if a conversion would fail), and we could have a errors="unsafe" option to get the faster, non-checking version. - The keyword we would add to control this safety could take an Enum as value, to enable fine grained control case-by-case (allow one case but not another, such as alloing float to int truncation but not int overflow). Such an enum could work like the re flags that can be combined with |. We could still have a value that can be passed to the keyword to disable or enable all cases for convenience (such as the proposed safe=True/False) - One additional case of "unsafe casting" that was mentioned and is not included in the examples in the top post, is casting to categorical dtype with values not present in the categories. — Reply to this email directly, view it on GitHub <#45588 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKTSRNIY5XV5ZZBQRJNOLDU3O3HHANCNFSM5MVF754A> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are on a team that was mentioned.Message ID: ***@***.***>

jorisvandenbossche · 2022-03-23T12:00:09Z

Would it be better to invent a new conversion type, something like "value_safe" or just "value" which would perform the check. The downside of always checking is that it could be expensive in large arrays.

@bashtage thanks for taking a look at this! And sorry for the slow reply.
Can you clarify the above statement a bit? What do you mean exactly with a "new conversion type"? (an additional value for numpy's casting keyword? Or a separate method?) And do you mean that you would rather see it opt-in, than make it (eventually) the default behaviour?

Also, what about converting int64 -> double? While this is considered "safe", it isn't really in the sense that it loses information.

Indeed, such a conversion can also loose information for very large integers. This case is mentioned in the top-post (see "Float truncation" section in "Concrete examples", I can't seem to link to it) in a section about float truncation (so float -> int), I should maybe make the int -> float case its own section as well for visibility.

A related function or keyword would be an automatic version of astype that
would automatically cast integer values to the smallest type that can
represent the data.

We already have this somewhat available in to_numeric (eg pd.to_numeric(pd.Series([1, 2, 3], dtype="int64"), downcast="integer") returns a Series with int8 dtype), but that's a bit hidden / not convenient to use from a dataframe (while useful, I would agree this is out-of-scope for the current discussion).

jbrockmendel · 2023-12-15T20:01:54Z

is the way forward here to add a keyword without yet changing the default behavior?

rhshadrach · 2023-12-15T22:08:51Z

As long as there is a way to disable, not opposed. This falls under int/float overflow but I'll mention string to int/float won't roundtrip for certain values.

jorisvandenbossche added API Design Needs Discussion Requires discussion from core team before further action Astype labels Jan 24, 2022

jorisvandenbossche mentioned this issue Feb 2, 2022

API: astype mechanism for extension arrays #22384

Open

jeff-hernandez mentioned this issue Feb 28, 2022

Values silently change after integer conversion alteryx/woodwork#1320

Closed

jorisvandenbossche mentioned this issue Oct 28, 2022

API: Index vs Series constructor alignment #49372

Closed

8 tasks

jbrockmendel mentioned this issue Nov 9, 2022

API: Series(floaty, dtype=inty) #49599

Closed

jorisvandenbossche mentioned this issue Mar 24, 2023

PDEP-6: Ban upcasting in setitem-like operations #50424

Merged

jbrockmendel mentioned this issue Nov 9, 2023

BUG: read_csv() silently ignores out-of-range integers #55232

Open

3 tasks

This was referenced Nov 13, 2023

BUG: setting 1.02 into float32 Series upcasts to float64 #55679

Open

PDEP-6 (setitem casting): Clarify how do decide when a setitem would upcast (and thus error in the future) or not #55935

Open

jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Dec 15, 2023

lithomas1 mentioned this issue Mar 22, 2024

BUG: Can't cast pyarrow floats to ints #56673

Open

3 tasks

mroeschke mentioned this issue Jul 8, 2024

BUG: pd.concat fails with large index values when using ArrowDtype #58819

Open

3 tasks

jorisvandenbossche mentioned this issue Aug 12, 2024

PDEP-13: The pandas Logical Type System #58455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: use "safe" casting by default in astype() / constructors #45588

API: use "safe" casting by default in astype() / constructors #45588

jorisvandenbossche commented Jan 24, 2022

Dr-Irv commented Jan 24, 2022

attack68 commented Jan 25, 2022

jorisvandenbossche commented Jan 25, 2022

Float truncation successful subsets:

Composition of operations:

jorisvandenbossche commented Feb 16, 2022

bashtage commented Feb 17, 2022 via email

jorisvandenbossche commented Mar 23, 2022

jbrockmendel commented Dec 15, 2023

rhshadrach commented Dec 15, 2023

API: use "safe" casting by default in astype() / constructors #45588

API: use "safe" casting by default in astype() / constructors #45588

Comments

jorisvandenbossche commented Jan 24, 2022

Context

Proposal

Concrete examples

Dr-Irv commented Jan 24, 2022

attack68 commented Jan 25, 2022

Float truncation successful subsets:

Composition of operations:

jorisvandenbossche commented Jan 25, 2022

Float truncation successful subsets:

Composition of operations:

jorisvandenbossche commented Feb 16, 2022

bashtage commented Feb 17, 2022 via email

jorisvandenbossche commented Mar 23, 2022

jbrockmendel commented Dec 15, 2023

rhshadrach commented Dec 15, 2023