(fix): allow all extension array data types in pandas adapters #1

ilan-gold · 2024-10-23T13:53:52Z

This probably needs some work, but since pandas datetime handling often goes through extension arrays, I think these two issues are completely linked

Closes Failure in pandas TestDataFrameToXArray.test_to_xarray_index_types pydata/xarray#9661
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

cc: @shoyer @kmuehlbauer

ilan-gold · 2024-10-23T14:10:29Z

It's tough to do this without CI - I will push a few of the sorts of things re: datetimes that will change so you get a sense, but generally, things are even cleaner (more datetimes preserved).

ilan-gold · 2024-10-23T14:45:49Z

Not sure we want to keep allowing pandas in-memory date time stuff + dask: 59b03f2 maybe should start blanket converting extension arrays before dask?

…rray` series creating an extension array when `.array` is accessed

kmuehlbauer · 2024-10-24T09:21:00Z

@ilan-gold Tried to activate CI, but seems it doesn't work. I'm back to desk next week, won't have time to check this now.

shoyer · 2024-10-24T15:28:53Z

Can you open this up as a pull request against pydata/xarray?

Co-authored-by: Stephan Hoyer <shoyer@google.com> Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> Co-authored-by: Spencer Clark <spencerkclark@gmail.com> Co-authored-by: Spencer Clark <spencerkclark@gmail.com> Co-authored-by: Stephan Hoyer <shoyer@google.com> Co-authored-by: Spencer Clark <spencerkclark@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix scalar handling for timedelta based indexer * remove stale error message and "ignore:Converting non-default" in testsuite * add per review suggestions * add/remove todo * rename timeunit -> format * return "ns" resolution per default for timedeltas, if not specified * Be specific on types/dtpyes * add comment * add suggestions from code review * fix docs * fix test which isn't run for numpy2 atm * add notes on to_datetime section, update examples showing usage of 'as_unit' * use np.timedelta64 for to_timedelta example, update as_unit example, update note * remove note * Apply suggestions from code review Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * refactor timedelta decoding to _numbers_to_timedelta and res-use it within decode_cf_timedelta * fix conventions test, add todo * run times through pd.Timestamp to catch possible overflows * fix tests for cftime_to_nptime * fix cftime_to_nptime in cftimeindex * introduce pd.Timestamp instance check * warn if out-of-bound datetimes are encoded with standard calendar, fall back to cftime encoding, add fix for cftime issue where python datetimes are not encoded correctly with date2num. * fix time-coding.rst, add reference to time-series.rst. * try to fix typing, ignore one * try to fix docs * revert doc-changes * Add a non-ns test for polyval, polyfit * more doc cosmetics * add whats-new.rst entry * add/fix coder docstring * add xr.date_range example as suggested per review * Apply suggestions from code review Co-authored-by: Spencer Clark <spencerkclark@gmail.com> * Implement `time_unit` option for `decode_cf_timedelta` (pydata#3) * Fix timedelta encoding overflow issue; always decode to ns resolution * Implement time_unit for decode_cf_timedelta * Reduce diff * fix typing * use nanmin/nanmax, catch numpy RuntimeWarnings * Apply suggestions from code review Co-authored-by: Kai Mühlbauer <kmuehlbauer@wradlib.org> --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Stephan Hoyer <shoyer@google.com> Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> Co-authored-by: Spencer Clark <spencerkclark@gmail.com> Co-authored-by: Deepak Cherian <deepak@cherian.net>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…10016) * `map_over_datasets`: fix error message for wrong result type * newline for result

…#10017) When lazily encoding non-nanosecond times, the appropriate optimal integer encoding units are resolution-dependent. This PR updates our encoding pipeline accordingly. Note that due to our internal reliance on pandas for date string parsing, we are still not able to round trip times outside the range -9999-01-01 to 9999-12-31 with pandas / NumPy, but this at least should pick more natural default units than nanoseconds for chunked arrays of non-nanosecond precision times. This gives users another way of addressing pydata#9154 (i.e. use non-nanosecond time arrays).

…pydata#10035) * use mean of min/max years as offset in caclulation of datetime64 mean * reinstate _datetime_nanmin as it is used downstream in flox<0.10.0 * add whats-new.rst entry * add whats-new.rst entry

* Fix DataArray().drop_attrs(deep=False) * Add DataArray().drop_attrs() tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply small cosmetics * Add support for attrs to DataArray()._replace * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove testing relict * Fix (try) incompatible types mypy error * Fix (2.try) incompatible types mypy error * Update whats-new * Fix replacing simultaneously passed variable * Add DataArray()._replace() tests --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Start splitting up `dataset.py` Currently, `dataset.py` is 10956 lines long. This makes doing any work with current LLMs basically impossible — with Claude's tokenizer, the file is 104K tokens, or >2.5x the size of the _per-minute_ rate limit for basic accounts. Most of xarray touches it in some way, so you generally want to give it the file for context. Even if you don't think "LLMs are the future, let's code with vibes!", the file is still really long; can be difficult to navigate (though OTOH it can be easy to just grep, to be fair...). So I would propose: - We start breaking it up, while also being cognizant that big changes can cause merge conflicts - Start with the low-hanging fruit - For example, this PR moves code outside the class (but that's quite limited) - Then move some of the code from the big methods into functions in other files, like `curve_fit` - Possibly (has tradeoffs; needs discussion) build some mixins so we can split up the class, if we want to have much smaller files - We can also think about other files: `dataarray.py` is 7.5K lines. The tests are also huge (`test_dataset` is 7.5K lines), but unlike with the library code, we can copy out & in chunks of tests when developing. (Note that I don't have any strong views on exactly what code should go in which file; I made a quick guess — very open to any suggestions; also easy to change later, particularly since this code doesn't change much so is less likely to cause conflicts) * .

Seems easy locally; maybe too easy and something will break in CI...

* add kwargs to map_over_datasets (similar to apply_ufunc), add test. * try to fix typing * improve typing and simplify kwargs-handling per review suggestions * apply changes to DataTree.map_over_datasets * add whats-new.rst entry * Update xarray/core/datatree_mapping.py Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com> * add suggestions from review. --------- Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

* don't install `dask` reason: `dask-expr` depends on `pyarrow`, which doesn't support python 3.13 yet * don't install `pydap` reason: depends on `webob`, which makes use of `cgi`, a stdlib that got removed in python 3.13 * run CI on python 3.13 * same for windows * classifier * whats-new * fix bad merge * try installing `dask` + `distributed` * move the whats-new entry * Update .github/workflows/ci.yaml * explicitly install `pyarrow` * install `numba` and packages depending on it * More to 3.13, prep for 3.14, bump all-but-dask to 3.12 * comment out sparse * fix whats-new --------- Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> Co-authored-by: Deepak Cherian <deepak@cherian.net>

* add Coordinates.from_xindex method * doc: refactor Coordinates API reference Make it more consistent with ``Dataset``, ``DataArray`` and ``DataTree``. The ``Coordinates`` class is 2nd order compared to the former ones, but it is public API and useful (for creating coordinates from indexes and merging coordinates together) so it deserves its own (expanded) section + summary tables in the API reference doc. * add tests * update what's new * fix doc build? * docstring tweaks * doc (api): add missing Coordinates.sizes property * update what's new (documentation) * improve docstrings on Coordinates creation * doc: move what's new entries after last release --------- Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

* Add types stubs to optional dependencies * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add coordinate transform classes from prototype * lint, public API and docstrings * missing import * sel: convert inverse transform results to ints * sel: add todo note about rounding decimal pos * rename create_coordinates -> create_coords More consistent with the rest of Xarray API where `coords` is used everywhere. * add a Coordinates.from_transform convenient method * fix repr (extract subset values of any n-d array) * Apply suggestions from code review Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com> * remove specific create coordinates methods In favor of the more generic `Coordinates.from_xindex()`. * fix more typing issues * remove public imports: not ready yet for public use * add experimental notice in docstrings * add coordinate transform tests * typing fixes * update what's new --------- Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>

* Upgrade mypy to 1.15 Mypy 1.15 includes fix for <python/mypy#9031>, allowing several "type: ignore" comments to be removed. * Add type annotations to DataTree.pipe tests * More precisely type `pipe` methods. In addition, enhance mypy job configuration to support running it locally via `act`. Fixes pydata#9997 * Pin mypy to 1.15 in CI * Revert mypy CI job changes * Add pytest-mypy-plugin and typestub packages * Add pytest-mypy-plugins to all conda env files * Remove dup pandas-stubs dep * Revert pre-commit config changes * Place mypy tests behind pytest mypy marker * Set default pytest numprocesses to 4 * Ignore pytest-mypy-plugins for min version check

* Default to phony_dims="access" in open_datatree for h5ntecdf-backend. Warn user about behaviour change. * relocate * duplicate as needed in both places * ignore warning * conditionally warn users if phony_dims are found * add test for warning * add whats-new.rst entry * remove unneeded assignment to fix typing * Update doc/whats-new.rst * use phony_dims="access" per default also in open_dataset for h5netcdf backend * fix test * fix whats-new.rst

* Index.isel: more permissive return type This allows an index to return a new index of another type, e.g., a 1-dimensional CoordinateTransformIndex to return a PandasIndex when a new transform cannot be computed (selection at arbitrary locations). * Index.equals: more permissive `other` type. Xarray alignment logic is such that Xarray indexes are always compared with other indexes of the same type. However, this is not necessarily the case for "meta" indexes (i.e., indexes encapsulating one or more index objects that may have another type) that are dispatching `equals` to their wrapped indexes.

…#10066)

(fix): allow all extension array data types in pandas adapters

c2bc4df

(fix): dataframes have no array attr

84569bc

ilan-gold added 2 commits October 24, 2024 09:25

(fix): allow chunked numpy extension arrays because of `test_pandas_a…

90e390d

…rray` series creating an extension array when `.array` is accessed

(fix): dtypes for PandasIndex

7c32bd0

ilan-gold force-pushed the ig/fix_extension_indexer branch from a9c9386 to 7c32bd0 Compare October 24, 2024 07:25

ilan-gold added 2 commits October 24, 2024 09:29

(chore): remove test for unnecessary conversion

795ecf6

(revert): don't let through so much in as_compatible_data

8eca6e9

kmuehlbauer closed this Oct 24, 2024

kmuehlbauer reopened this Oct 24, 2024

ilan-gold mentioned this pull request Oct 24, 2024

(fix): extension array indexers pydata/xarray#9671

Open

4 tasks

ilan-gold and others added 16 commits October 25, 2024 08:40

(fix): account for series -> numpy conversions

fb91812

(fix): ensure dtype check is for numpy type

a06f2b1

(fix): convert pandas IntervalArray

14027e8

Merge branch 'main' into ig/fix_extension_indexer

a47a96f

Merge branch 'any-time-resolution-2' into ig/fix_extension_indexer

ae49850

Merge branch 'main' into ig/fix_extension_indexer

e8f5aa8

Merge branch 'any-time-resolution-2' into ig/fix_extension_indexer

a405f03

(fix): dtype type handling

503b313

(fix): move out of type checking block

c8ab8f3

(fix): satisfy mypy

66e5b06

(fix): doctest

f9fde3a

(fix): nbytes test?

8a3e834

Merge branch 'any-time-resolution-2' into ig/fix_extension_indexer

80dc10b

Merge branch 'any-time-resolution-2' into ig/fix_extension_indexer

9d77885

Merge branch 'any-time-resolution-2' into ig/fix_extension_indexer

b120917

pre-commit-ci bot and others added 30 commits February 3, 2025 10:50

Update pre-commit hooks (pydata#10021)

d924d93

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

(refactor): remove super class casting

2068b8c

Merge branch 'main' into ig/fix_extension_indexer

13fc8fe

(fix): namedarray mypy issues

067f8f2

(fix): typing patch

459f56b

(fix): revert dataarray dtype change

0aa5862

(fix): just ignore dtype issue

4a73535

(fix): ignore redefinition issue

9b4ce62

spelling fix (pydata#10023)

2658c00

Duck array ops for all and any (pydata#9883)

4b48cf7

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

map_over_datasets: fix error message for wrong result type (pydata#…

160cced

…10016) * `map_over_datasets`: fix error message for wrong result type * newline for result

DOC: Fix 404 (pydata#10029)

d57f05c

Merge branch 'main' into ig/fix_extension_indexer

1fee9d4

use mean of min/max years as offset in calculation of datetime64 mean (…

df2ecf4

…pydata#10035) * use mean of min/max years as offset in caclulation of datetime64 mean * reinstate _datetime_nanmin as it is used downstream in flox<0.10.0 * add whats-new.rst entry * add whats-new.rst entry

Merge branch 'main' into ig/fix_extension_indexer

1a7650c

Upgrade mypy to 1.15 (pydata#10041)

c8f7dc6

Seems easy locally; maybe too easy and something will break in CI...

Merge branch 'main' into ig/fix_extension_indexer

62f3892

Restrict fastpath isel indexes to the case of all PandasIndex (pydata…

2475d49

…#10066)

Merge branch 'main' into ig/fix_extension_indexer

36dcbb6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix): allow all extension array data types in pandas adapters #1

(fix): allow all extension array data types in pandas adapters #1

ilan-gold commented Oct 23, 2024 •

edited

Loading

ilan-gold commented Oct 23, 2024

ilan-gold commented Oct 23, 2024

kmuehlbauer commented Oct 24, 2024

shoyer commented Oct 24, 2024

(fix): allow all extension array data types in pandas adapters #1

Are you sure you want to change the base?

(fix): allow all extension array data types in pandas adapters #1

Conversation

ilan-gold commented Oct 23, 2024 • edited Loading

ilan-gold commented Oct 23, 2024

ilan-gold commented Oct 23, 2024

kmuehlbauer commented Oct 24, 2024

shoyer commented Oct 24, 2024

ilan-gold commented Oct 23, 2024 •

edited

Loading