Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Support converting to non-nano datetime64 for pandas >= 2.0 #33321

Closed
asfimport opened this issue Oct 21, 2022 · 3 comments · Fixed by #35656
Closed

[Python] Support converting to non-nano datetime64 for pandas >= 2.0 #33321

asfimport opened this issue Oct 21, 2022 · 3 comments · Fixed by #35656
Assignees
Labels
Milestone

Comments

@asfimport
Copy link
Collaborator

asfimport commented Oct 21, 2022

Pandas is adding capabilities to store non-nanosecond datetime64 data. At the moment, we however always do convert to nanosecond, regardless of the timestamp resolution of the arrow table (and regardless of the pandas metadata).

Using the development version of pandas:

In [1]: df = pd.DataFrame({"col": np.arange("2012-01-01", 10, dtype="datetime64[s]")})

In [2]: df.dtypes
Out[2]: 
col    datetime64[s]
dtype: object

In [3]: table = pa.table(df)

In [4]: table.schema
Out[4]: 
col: timestamp[s]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 423

In [6]: table.to_pandas().dtypes
Out[6]: 
col    datetime64[ns]
dtype: object

This is because we have a coerce_temporal_nanoseconds conversion option which we hardcode to True (for top-level columns, we hardcode it to False for nested data).

When users have pandas >= 2, we should support converting with preserving the resolution. We should certainly do so if the pandas metadata indicates which resolution was originally used (to ensure correct roundtrip).
We could (and at some point also should) also do that by default if there is no pandas metadata (but maybe only later depending on how stable this new feature is in pandas, as it is potentially a breaking change for our users if you use eg pyarrow to read a parquet file).

Reporter: Joris Van den Bossche / @jorisvandenbossche

Related issues:

Note: This issue was originally created as ARROW-18124. Please see the migration documentation for further details.

@danepitkin
Copy link
Member

When users have pandas >= 2, we should support converting with preserving the resolution. We should certainly do so if the pandas metadata indicates which resolution was originally used (to ensure correct roundtrip).
We could (and at some point also should) also do that by default if there is no pandas metadata (but maybe only later depending on how stable this new feature is in pandas, as it is potentially a breaking change for our users if you use eg pyarrow to read a parquet file).

Can you provide some more detail about this paragraph? Does pandas metadata mean the pyarrow/array.pxi::PandasOptions class? When would pandas metadata not be present and why would that break parquet reading?

@jorisvandenbossche
Copy link
Member

The "pandas metadata" is custom metadata that we store in the pyarrow schema whenever the data is created from a pandas.DataFrame:

>>> df = pd.DataFrame({"col": pd.date_range("2012-01-01", periods=3, freq="D")})
>>> df
         col
0 2012-01-01
1 2012-01-02
2 2012-01-03
>>> table = pa.table(df)
>>> table.schema
col: timestamp[ns]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 413
# easier way to access this (and converted to a dict)
>>> table.schema.pandas_metadata
{'index_columns': [{'kind': 'range',
   'name': None,
   'start': 0,
   'stop': 3,
   'step': 1}],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'col',
   'field_name': 'col',
   'pandas_type': 'datetime',
   'numpy_type': 'datetime64[ns]',
   'metadata': None}],
 'creator': {'library': 'pyarrow', 'version': '13.0.0.dev106+gfbe5f641d'},
 'pandas_version': '2.1.0.dev0+484.g7187e67500'}

So this indicates that the original data in the pandas.DataFrame had "datetime64[ns]" type. In this case that matches the arrow type, but for example after a roundtrip through Parquet, this might no longer be the case:

>>> pq.write_table(table, "test.parquet")
>>> table2 = pq.read_table("test.parquet")
>>> table2.schema
Out[33]: 
col: timestamp[us]                 # <--- now us instead of ns
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 413
>>>  table2.schema.pandas_metadata
{ ...
 'columns': [{'name': 'col',
   'field_name': 'col',
   'pandas_type': 'datetime',
   'numpy_type': 'datetime64[ns]',   # <--- but this still indicates ns
   'metadata': None}],
...

So the question is here what table2.to_pandas() should do? Use the microsecond resolution of the data, of the nanosecond resolution of the metadata?

(note that this is also a consequence of the default parquet version we write not yet supporting nanoseconds, while we should probably bump that default version we write, and then the nanoseconds would be preserved in the Parquet roundtrip)

Now, I am not sure if it would be easy to use the information of the pandas metadata to influence the conversion, as we typically only use the metadata after converting the actual data to finalize the resulting pandas DataFrame (eg set the index, cast the column names, ..).
And I am also not fully sure if it would actually be desirable to follow the pandas metadata, since that would involve an extra conversion step (and effectively all existing pandas metadata (eg in already written parquet files) will always say that it is nanoseconds, since until recently that was the only supported resolution by pandas).

@danepitkin
Copy link
Member

Excellent write-up and example! I'd prefer supporting nanosecond in parquet and not converting based off of metadata since it seems like its not guaranteed to be reliable. Filed #35746 to track Parquet support for ns.

galipremsagar added a commit to rapidsai/cudf that referenced this issue May 31, 2023
This PR fixes parquet pytest failures, mostly working around two upstream issues:

1. pandas-dev/pandas#53345
2. apache/arrow#33321

Thus fixes 20 pytest failure:
This PR:
```
= 231 failed, 95767 passed, 2045 skipped, 764 xfailed, 300 xpassed in 426.65s (0:07:06) =
```
On `pandas_2.0_feature_branch`:
```
= 251 failed, 95747 passed, 2045 skipped, 764 xfailed, 300 xpassed in 433.50s (0:07:13) =
```
danepitkin added a commit to danepitkin/arrow that referenced this issue Jun 5, 2023
danepitkin added a commit to danepitkin/arrow that referenced this issue Jun 6, 2023
danepitkin added a commit to danepitkin/arrow that referenced this issue Jun 8, 2023
danepitkin added a commit to danepitkin/arrow that referenced this issue Jun 14, 2023
danepitkin added a commit to danepitkin/arrow that referenced this issue Jun 26, 2023
danepitkin added a commit to danepitkin/arrow that referenced this issue Jun 26, 2023
danepitkin added a commit to danepitkin/arrow that referenced this issue Jun 30, 2023
danepitkin added a commit to danepitkin/arrow that referenced this issue Jul 6, 2023
@raulcd raulcd modified the milestones: 13.0.0, 14.0.0 Jul 7, 2023
jorisvandenbossche added a commit that referenced this issue Jul 7, 2023
…as >= 2.0 (#35656)

Do not coerce temporal types to nanosecond when pandas >= 2.0 is imported, since pandas now supports s/ms/us time units.

This PR adds support for the following Arrow -> Pandas conversions, which previously all defaulted to `datetime64[ns]` or `datetime64[ns, <TZ>]`:
```
date32 -> datetime64[ms]
date64 -> datetime64[ms]
datetime64[s] -> datetime64[s]
datetime64[ms] -> datetime64[ms]
datetime64[us] -> datetime64[us]
datetime64[s, <TZ>] -> datetime64[s, <TZ>]
datetime64[ms, <TZ>] -> datetime64[ms, <TZ>]
datetime64[us, <TZ>] -> datetime64[us, <TZ>]
```
### Rationale for this change

Pandas 2.0 introduces proper support for temporal types.

### Are these changes tested?

Yes. Pytests added and updated.

### Are there any user-facing changes?

Yes, arrow-to-pandas default conversion behavior will change when users have pandas >= 2.0, but a legacy option is exposed to provide backwards compatibility.
* Closes: #33321

Lead-authored-by: Dane Pitkin <dane@voltrondata.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche modified the milestones: 14.0.0, 13.0.0 Jul 7, 2023
jorisvandenbossche added a commit that referenced this issue Jul 10, 2023
…36586)

See https://github.com/apache/arrow/pull/35656/files#r1257540254

#36542 fixed the no-pandas build by actually not having pandas installed, but some PRs merged at the same time introduced new failures for this build.
* Closes: #36541

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
jorisvandenbossche pushed a commit that referenced this issue Jul 13, 2023
…ges in arrow->pandas conversion (#36630)

### Rationale for this change

Due to the changes on #33321 a dask test started failing.

### What changes are included in this PR?

Skip the test in the meantime

### Are these changes tested?

Yes, with crossbow

### Are there any user-facing changes?

No
* Closes: #36629

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
raulcd added a commit that referenced this issue Jul 13, 2023
…ges in arrow->pandas conversion (#36630)

### Rationale for this change

Due to the changes on #33321 a dask test started failing.

### What changes are included in this PR?

Skip the test in the meantime

### Are these changes tested?

Yes, with crossbow

### Are there any user-facing changes?

No
* Closes: #36629

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
chelseajonesr pushed a commit to chelseajonesr/arrow that referenced this issue Jul 20, 2023
…d changes in arrow->pandas conversion (apache#36630)

### Rationale for this change

Due to the changes on apache#33321 a dask test started failing.

### What changes are included in this PR?

Skip the test in the meantime

### Are these changes tested?

Yes, with crossbow

### Are there any user-facing changes?

No
* Closes: apache#36629

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this issue Aug 20, 2023
…d changes in arrow->pandas conversion (apache#36630)

### Rationale for this change

Due to the changes on apache#33321 a dask test started failing.

### What changes are included in this PR?

Skip the test in the meantime

### Are these changes tested?

Yes, with crossbow

### Are there any user-facing changes?

No
* Closes: apache#36629

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@amoeba amoeba added the Breaking Change Includes a breaking change to the API label Nov 22, 2023
weiji14 added a commit to GenericMappingTools/pygmt that referenced this issue Nov 22, 2024
Lower pyarrow pin from 16 to 13, and added a skipif pytest marker to the `test_to_numpy_pyarrow_array_pyarrow_dtypes_string[string_view]` unit test to not run when pyarrow<16 is installed. Setting a pin on pyarrow>=13 so that datetime64 unit preservation is handled, xref apache/arrow#33321.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants