-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0 #35656
GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0 #35656
Conversation
I'm looking for early feedback to see if this is the right approach. There are many test cases that will need updating, but I didn't want to tackle them yet in case we take a different approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that approach looks good, and is actually simpler than I thought it would be since we already control this with the single option switch (for the code, the tests will indeed get a bit messier).
I think one question is whether we want to make that option public through the to_pandas
API, so people could still override it to get nanoseconds if they want (to get back the pre-pandas-2.0 behaviour).
I'll expose this! I agree its best to allow continued use of the legacy behavior for awhile. |
python/pyarrow/types.pxi
Outdated
_Type_DATE64: np.dtype('datetime64[ns]'), | ||
_Type_TIMESTAMP: np.dtype('datetime64[ns]'), | ||
_Type_DURATION: np.dtype('timedelta64[ns]'), | ||
_Type_DATE32: np.dtype('datetime64[D]'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas only supports the range of second to nanoseconds, so for dates we should maybe still default to datetime64[s]
? (otherwise I assume this conversion would happen anyway on the pandas side)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Numpy supports [D]ay, but pandas does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I found is that Parquet only support [ms], [us], and [ns]. So now several pyarrow dataset tests are failing because datasets with [D]ay units are being converted to [ms] units. I'm somewhat inclined to convert date32 to [ms] by default so we don't have to add a conversion from [ms] -> [s] when doing a parquet roundtrip. Or.. we just let it happen and modify the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't a problem before when everything was coerced to [ns], which parquet supports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat inclined to convert date32 to [ms] by default so we don't have to add a conversion from [ms] -> [s] when doing a parquet roundtrip
Yes, that sounds as a good idea (then it also gives the same for date32 vs date64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
several pyarrow dataset tests are failing because datasets with [D]ay units are being converted to [ms] units
Can you point to which test is failing? Because this is about conversion from pyarrow to pandas, right? (not arrow<->parquet roundtrip, which should be able to preserve our date32 type because we store the arrow schema)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current tests manipulate the types so test cases pass, but these were the tests that originally were failling (the tests in there current state are a bit of a mess right now, I need to go back and clean them up once the implementation actually appears to work properly) :
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions[True] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions[False] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_schema[True] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_schema[False] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_index_name[True] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_index_name[False] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_no_partitions[True] - AssertionError: Attributes of DataFrame.iloc[:, 3] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_no_partitions[False] - AssertionError: Attributes of DataFrame.iloc[:, 3] (column name="date") are different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pytest.mark.filterwarnings("ignore:'ParquetDataset.schema:FutureWarning")
def _test_write_to_dataset_with_partitions(base_path,
use_legacy_dataset=True,
filesystem=None,
schema=None,
index_name=None):
import pandas as pd
import pandas.testing as tm
import pyarrow.parquet as pq
# ARROW-1400
output_df = pd.DataFrame({'group1': list('aaabbbbccc'),
'group2': list('eefeffgeee'),
'num': list(range(10)),
'nan': [np.nan] * 10,
'date': np.arange('2017-01-01', '2017-01-11',
dtype='datetime64[D]')})
output_df["date"] = output_df["date"]
cols = output_df.columns.tolist()
partition_by = ['group1', 'group2']
output_table = pa.Table.from_pandas(output_df, schema=schema, safe=False,
preserve_index=False)
pq.write_to_dataset(output_table, base_path, partition_by,
filesystem=filesystem,
use_legacy_dataset=use_legacy_dataset)
metadata_path = os.path.join(str(base_path), '_common_metadata')
if filesystem is not None:
with filesystem.open(metadata_path, 'wb') as f:
pq.write_metadata(output_table.schema, f)
else:
pq.write_metadata(output_table.schema, metadata_path)
# ARROW-2891: Ensure the output_schema is preserved when writing a
# partitioned dataset
dataset = pq.ParquetDataset(base_path,
filesystem=filesystem,
validate_schema=True,
use_legacy_dataset=use_legacy_dataset)
# ARROW-2209: Ensure the dataset schema also includes the partition columns
if use_legacy_dataset:
with pytest.warns(FutureWarning, match="'ParquetDataset.schema'"):
dataset_cols = set(dataset.schema.to_arrow_schema().names)
else:
# NB schema property is an arrow and not parquet schema
dataset_cols = set(dataset.schema.names)
assert dataset_cols == set(output_table.schema.names)
input_table = dataset.read(use_pandas_metadata=True)
input_df = input_table.to_pandas()
# Read data back in and compare with original DataFrame
# Partitioned columns added to the end of the DataFrame when read
input_df_cols = input_df.columns.tolist()
assert partition_by == input_df_cols[-1 * len(partition_by):]
input_df = input_df[cols]
# Partitioned columns become 'categorical' dtypes
for col in partition_by:
output_df[col] = output_df[col].astype('category')
# if schema is None and Version(pd.__version__) >= Version("2.0.0"):
# output_df['date'] = output_df['date'].astype('datetime64[ms]')
> tm.assert_frame_equal(output_df, input_df)
E AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
E
E Attribute "dtype" are different
E [left]: datetime64[s]
E [right]: datetime64[ms]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those test failures are related to the fact that, with our defaults, parquet doesn't support nanoseconds, and we actually don't try to preserve the unit when roundtripping from arrow<->parquet:
In [1]: table = pa.table({"col": pa.array([1, 2, 3], pa.timestamp("s")).cast(pa.timestamp("ns"))})
In [2]: import pyarrow.parquet as pq
In [3]: pq.write_table(table, "test_nanoseconds.parquet")
In [4]: pq.read_table("test_nanoseconds.parquet")
Out[4]:
pyarrow.Table
col: timestamp[us]
----
col: [[1970-01-01 00:00:01.000000,1970-01-01 00:00:02.000000,1970-01-01 00:00:03.000000]]
So starting with an arrow table with nanoseconds, the result has microseconds (even though we actually could preserve the original unit, because we store the original arrow schema in the parquet metadata. Although that would not be a zero copy restoration, in contrast to for example restoring the timezone, or restoring duration from int64, which is done in ApplyOriginalStorageMetadata
)
So this means that whenever we start with nanoseconds, we get back microseconds after roundtrip to parquet. And then if the roundtrip actually started from pandas using nanoseconds, we now also get microseconds in the pandas result (while before we still got nanoseconds since we forced using that in the arrow->pandas conversion step) ..
bf472be
to
f0f8cba
Compare
37e120c
to
0fb1c51
Compare
Current update: the tests failing locally for me are 1) parquet dataset roundtrips where date32 days are converted to milliseconds instead of seconds because seconds are not supported in parquet and 2) all TZ-aware timestamps are defaulted to nanoseconds (aka I need to add support for other time units in c++). For (1), I mentioned in another comment that we can convert date32 to millisecond instead of second. For (2), I just need to add support, but it's going to grow this PR even larger unfortunately.. Edit: For (1), I think its actually fine to keep date32 as [s]econd. It's a known limitation that parquet does not support this unit type. |
If PR size is a concern, this is also something that could be done as a precursor. It's actually already an issue that shows in conversion to numpy as well:
While this could also be perfectly zero-copy to microseconds in the case with a timezone (we just return the underlying UTC values anyway) |
For the tz aware update, that also influences the (currently untested) to_numpy behaviour, which you can test with the following change: --- a/python/pyarrow/tests/test_array.py
+++ b/python/pyarrow/tests/test_array.py
@@ -211,9 +211,10 @@ def test_to_numpy_writable():
arr.to_numpy(zero_copy_only=True, writable=True)
+@pytest.mark.parametrize('tz', [None, "UTC"])
@pytest.mark.parametrize('unit', ['s', 'ms', 'us', 'ns'])
-def test_to_numpy_datetime64(unit):
- arr = pa.array([1, 2, 3], pa.timestamp(unit))
+def test_to_numpy_datetime64(unit, tz):
+ arr = pa.array([1, 2, 3], pa.timestamp(unit, tz=tz))
expected = np.array([1, 2, 3], dtype="datetime64[{}]".format(unit))
np_arr = arr.to_numpy()
np.testing.assert_array_equal(np_arr, expected) |
Thank you! Updated and it passes out of the gate 🎉 |
int64_t* out_values = this->GetBlockColumnStart(rel_placement); | ||
if (type == Type::DATE32) { | ||
// Convert from days since epoch to datetime64[ms] | ||
ConvertDatetimeLikeNanos<int32_t, 86400000L>(*data, out_values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering that if we do such naive multiplication here, we can get overflow errors for out-of-bounds timestamps (but this is already the case for the current code converting to nanoseconds, as well, to be clear)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, I think this specific multiplication is fine. INT32_MAX * 86400000 = 1.8554259e+17, while INT64_MAX = 9.223372e+18.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, when milliseconds is the target this is probably fine. For the nanoseconds case, though, this gives wrong results. Opened #36084 about that.
@@ -176,8 +176,8 @@ def alltypes_sample(size=10000, seed=0, categorical=False): | |||
# TODO(wesm): Test other timestamp resolutions now that arrow supports | |||
# them | |||
'datetime': np.arange("2016-01-01T00:00:00.001", size, | |||
dtype='datetime64[ms]').astype('datetime64[ns]'), | |||
'timedelta': np.arange(0, size, dtype="timedelta64[ns]"), | |||
dtype='datetime64[ms]'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can maybe keep both original ns and new ms resolution? (to test both)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quickly enabling it does open up a small can of worms:
FAILED pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[None-True] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[None-False] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[1000-True] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[1000-False] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_schema[True] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_schema[False] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_metadata.py::test_parquet_metadata_api - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_metadata.py::test_compare_schemas - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_custom_metadata - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_column_multiindex[True] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_column_multiindex[False] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_2_0_roundtrip_read_pandas_no_index_written[True] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_2_0_roundtrip_read_pandas_no_index_written[False] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_columns_reader[300] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_columns_reader[1000] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_columns_reader[1300] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_reader[1000] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
Maybe better for a follow up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I am adding and fixing the tests. Just needed to remove coercion to ms
in the test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion!
python/pyarrow/tests/test_pandas.py
Outdated
@@ -1202,7 +1211,7 @@ def test_table_convert_date_as_object(self): | |||
df_datetime = table.to_pandas(date_as_object=False) | |||
df_object = table.to_pandas() | |||
|
|||
tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime, | |||
tm.assert_frame_equal(df.astype('datetime64[ms]'), df_datetime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have coverage for testing that it stays nanoseconds if you specify coerce_temporal_nanoseconds=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, will add!
a65301e
to
6ffb5e5
Compare
@github-actions crossbow submit -g integration |
Revision: 6ffb5e5 Submitted crossbow builds: ursacomputing/crossbow @ actions-40508e3899 |
# Arrow to Pandas v2 will convert date32 to [ms]. Pandas v1 will always | ||
# silently coerce to [ns] due to non-[ns] support. | ||
expected_date_type = 'datetime64[ms]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not fully correct, I think (when converting the pandas dataframe to pyarrow, we actually don't have date32, but timestamp type). But then I also don't understand how this test is passing ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what actually happens with pandas 2.x: when we create a DataFrame with datetime64[D], that gets converted to datetime64[s] (closest supported resolution to "D"). Then roundtripping to parquet turns that into "ms" (because "s" is not supported by Parquet)
With older pandas this gets converted to datetime64[ns], will come back from Parquet as "us", and converted back to "ns" when converting to pandas. But this astype("datetime64[ms]")
essentially doesn't do anything, i.e. pandas does preserve the "ns" because it doesn't support "ms", and hence the test also passes for older pandas.
Maybe it's simpler to just test with a DataFrame of nanoseconds, which now works the same with old and new pandas, and then we don't have to add any comment or astype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's simpler to just test with a DataFrame of nanoseconds, which now works the same with old and new pandas, and then we don't have to add any comment or astype.
Hmm, trying that out locally fails (but only with the non-legacy code path), and digging in, it seems that we are still writing Parquet v 1 files with the dataset API ...
Will open a separate issue and PR to quickly fix that separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> #36538
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a small clean-up on top of my PR to fix the Parquet version for the dataset writer. Further checked all changes to the parquet tests as well, and all looks good!
@github-actions crossbow submit -g integration |
And with this PR and the Parquet v2.6 update combined, the failures in the dask builds are now much smaller (just one failure that was testing that a timestamp would overflow by being casted to nanoseconds) |
Revision: a6487c2 Submitted crossbow builds: ursacomputing/crossbow @ actions-5c8f3f6cad |
Thanks @danepitkin! |
Thanks @jorisvandenbossche for the collaboration and support! |
[pd.period_range("2012-01-01", periods=3, freq="D").array, | ||
pd.interval_range(1, 4).array]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche @danepitkin We can't use pd
here because pandas may not be available.
This causes an error on "no pandas" environment: https://github.com/apache/arrow/actions/runs/5496447565/jobs/10016477233
This PR's CI succeeded because our "Without Pandas" job installed pandas implicitly. It has been fixed by #36542.
Could you open an issue for this and fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed -> #36586
Conbench analyzed the 6 benchmark runs on commit There were 7 benchmark results indicating a performance regression:
The full Conbench report has more details. |
Do not coerce temporal types to nanosecond when pandas >= 2.0 is imported, since pandas now supports s/ms/us time units.
This PR adds support for the following Arrow -> Pandas conversions, which previously all defaulted to
datetime64[ns]
ordatetime64[ns, <TZ>]
:Rationale for this change
Pandas 2.0 introduces proper support for temporal types.
Are these changes tested?
Yes. Pytests added and updated.
Are there any user-facing changes?
Yes, arrow-to-pandas default conversion behavior will change when users have pandas >= 2.0, but a legacy option is exposed to provide backwards compatibility.
This PR includes breaking changes to public APIs.