-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Support converting to non-nano datetime64 for pandas >= 2.0 #33321
Comments
Can you provide some more detail about this paragraph? Does pandas metadata mean the |
The "pandas metadata" is custom metadata that we store in the pyarrow schema whenever the data is created from a pandas.DataFrame: >>> df = pd.DataFrame({"col": pd.date_range("2012-01-01", periods=3, freq="D")})
>>> df
col
0 2012-01-01
1 2012-01-02
2 2012-01-03
>>> table = pa.table(df)
>>> table.schema
col: timestamp[ns]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 413
# easier way to access this (and converted to a dict)
>>> table.schema.pandas_metadata
{'index_columns': [{'kind': 'range',
'name': None,
'start': 0,
'stop': 3,
'step': 1}],
'column_indexes': [{'name': None,
'field_name': None,
'pandas_type': 'unicode',
'numpy_type': 'object',
'metadata': {'encoding': 'UTF-8'}}],
'columns': [{'name': 'col',
'field_name': 'col',
'pandas_type': 'datetime',
'numpy_type': 'datetime64[ns]',
'metadata': None}],
'creator': {'library': 'pyarrow', 'version': '13.0.0.dev106+gfbe5f641d'},
'pandas_version': '2.1.0.dev0+484.g7187e67500'} So this indicates that the original data in the pandas.DataFrame had "datetime64[ns]" type. In this case that matches the arrow type, but for example after a roundtrip through Parquet, this might no longer be the case: >>> pq.write_table(table, "test.parquet")
>>> table2 = pq.read_table("test.parquet")
>>> table2.schema
Out[33]:
col: timestamp[us] # <--- now us instead of ns
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 413
>>> table2.schema.pandas_metadata
{ ...
'columns': [{'name': 'col',
'field_name': 'col',
'pandas_type': 'datetime',
'numpy_type': 'datetime64[ns]', # <--- but this still indicates ns
'metadata': None}],
... So the question is here what (note that this is also a consequence of the default parquet version we write not yet supporting nanoseconds, while we should probably bump that default version we write, and then the nanoseconds would be preserved in the Parquet roundtrip) Now, I am not sure if it would be easy to use the information of the pandas metadata to influence the conversion, as we typically only use the metadata after converting the actual data to finalize the resulting pandas DataFrame (eg set the index, cast the column names, ..). |
Excellent write-up and example! I'd prefer supporting nanosecond in parquet and not converting based off of metadata since it seems like its not guaranteed to be reliable. Filed #35746 to track Parquet support for ns. |
This PR fixes parquet pytest failures, mostly working around two upstream issues: 1. pandas-dev/pandas#53345 2. apache/arrow#33321 Thus fixes 20 pytest failure: This PR: ``` = 231 failed, 95767 passed, 2045 skipped, 764 xfailed, 300 xpassed in 426.65s (0:07:06) = ``` On `pandas_2.0_feature_branch`: ``` = 251 failed, 95747 passed, 2045 skipped, 764 xfailed, 300 xpassed in 433.50s (0:07:13) = ```
…as >= 2.0 (#35656) Do not coerce temporal types to nanosecond when pandas >= 2.0 is imported, since pandas now supports s/ms/us time units. This PR adds support for the following Arrow -> Pandas conversions, which previously all defaulted to `datetime64[ns]` or `datetime64[ns, <TZ>]`: ``` date32 -> datetime64[ms] date64 -> datetime64[ms] datetime64[s] -> datetime64[s] datetime64[ms] -> datetime64[ms] datetime64[us] -> datetime64[us] datetime64[s, <TZ>] -> datetime64[s, <TZ>] datetime64[ms, <TZ>] -> datetime64[ms, <TZ>] datetime64[us, <TZ>] -> datetime64[us, <TZ>] ``` ### Rationale for this change Pandas 2.0 introduces proper support for temporal types. ### Are these changes tested? Yes. Pytests added and updated. ### Are there any user-facing changes? Yes, arrow-to-pandas default conversion behavior will change when users have pandas >= 2.0, but a legacy option is exposed to provide backwards compatibility. * Closes: #33321 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…36586) See https://github.com/apache/arrow/pull/35656/files#r1257540254 #36542 fixed the no-pandas build by actually not having pandas installed, but some PRs merged at the same time introduced new failures for this build. * Closes: #36541 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…ges in arrow->pandas conversion (#36630) ### Rationale for this change Due to the changes on #33321 a dask test started failing. ### What changes are included in this PR? Skip the test in the meantime ### Are these changes tested? Yes, with crossbow ### Are there any user-facing changes? No * Closes: #36629 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…ges in arrow->pandas conversion (#36630) ### Rationale for this change Due to the changes on #33321 a dask test started failing. ### What changes are included in this PR? Skip the test in the meantime ### Are these changes tested? Yes, with crossbow ### Are there any user-facing changes? No * Closes: #36629 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…d changes in arrow->pandas conversion (apache#36630) ### Rationale for this change Due to the changes on apache#33321 a dask test started failing. ### What changes are included in this PR? Skip the test in the meantime ### Are these changes tested? Yes, with crossbow ### Are there any user-facing changes? No * Closes: apache#36629 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…d changes in arrow->pandas conversion (apache#36630) ### Rationale for this change Due to the changes on apache#33321 a dask test started failing. ### What changes are included in this PR? Skip the test in the meantime ### Are these changes tested? Yes, with crossbow ### Are there any user-facing changes? No * Closes: apache#36629 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Lower pyarrow pin from 16 to 13, and added a skipif pytest marker to the `test_to_numpy_pyarrow_array_pyarrow_dtypes_string[string_view]` unit test to not run when pyarrow<16 is installed. Setting a pin on pyarrow>=13 so that datetime64 unit preservation is handled, xref apache/arrow#33321.
Pandas is adding capabilities to store non-nanosecond datetime64 data. At the moment, we however always do convert to nanosecond, regardless of the timestamp resolution of the arrow table (and regardless of the pandas metadata).
Using the development version of pandas:
This is because we have a
coerce_temporal_nanoseconds
conversion option which we hardcode to True (for top-level columns, we hardcode it to False for nested data).When users have pandas >= 2, we should support converting with preserving the resolution. We should certainly do so if the pandas metadata indicates which resolution was originally used (to ensure correct roundtrip).
We could (and at some point also should) also do that by default if there is no pandas metadata (but maybe only later depending on how stable this new feature is in pandas, as it is potentially a breaking change for our users if you use eg pyarrow to read a parquet file).
Reporter: Joris Van den Bossche / @jorisvandenbossche
Related issues:
Note: This issue was originally created as ARROW-18124. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: