-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up isel and __getitem__ #3375
Conversation
Thanks @crusaderky. Do you think the indexing benchmarks we have in https://github.com/pydata/xarray/blob/master/asv_bench/benchmarks/indexing.py are sufficient? Anything you think would be worth adding to cover performance regressions here? |
@jhamman hm. I'm looking at it now for the first time.
|
I see that all tests in benchmarks/indexing.py use arrays with 2~6 million points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks a lot @crusaderky .
I don't know this code 100% so I may have missed bits, but reading through the code I think I can follow it all, and looks more efficient + better code now.
And I'm learning more mypy
!
I added one comment re a missing keyword. Seems odd that tests pass.
Would you be willing to add a few of these cases to the benchmarks? |
Co-Authored-By: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
Yes, in due time. It's out of scope for this PR though. |
commit b0c336f Author: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> Date: Mon Oct 21 04:52:36 2019 -0400 Whatsnew for #3419 (#3422) * pyupgrade --py36-plus * Update xarray/core/nputils.py Co-Authored-By: crusaderky <crusaderky@gmail.com> * Update xarray/core/parallel.py Co-Authored-By: crusaderky <crusaderky@gmail.com> * Update xarray/tests/test_cftime_offsets.py Co-Authored-By: crusaderky <crusaderky@gmail.com> * Update xarray/tests/test_cftime_offsets.py Co-Authored-By: crusaderky <crusaderky@gmail.com> * whatsnew commit 2984415 Author: Rhys Doyle <48983334+rdoyle45@users.noreply.github.com> Date: Mon Oct 21 01:17:47 2019 +0100 Revert changes made in #3358 (#3411) * Revert #3358 * Revision * Code review * Merge from master * Obsolescence note commit 3c462b9 Author: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> Date: Sun Oct 20 20:16:58 2019 -0400 Python3.6 idioms (#3419) * pyupgrade --py36-plus * Update xarray/core/nputils.py Co-Authored-By: crusaderky <crusaderky@gmail.com> * Update xarray/core/parallel.py Co-Authored-By: crusaderky <crusaderky@gmail.com> * Update xarray/tests/test_cftime_offsets.py Co-Authored-By: crusaderky <crusaderky@gmail.com> * Update xarray/tests/test_cftime_offsets.py Co-Authored-By: crusaderky <crusaderky@gmail.com> commit 9886e3c Author: crusaderky <crusaderky@gmail.com> Date: Sun Oct 20 23:42:36 2019 +0100 Temporarily mark pseudonetcdf-3.1 as incompatible (#3420) commit 0f7ab0e Author: Dan Nowacki <dnowacki@usgs.gov> Date: Thu Oct 17 14:13:44 2019 -0700 Fix and add test for groupby_bins() isnan TypeError. (#3405) * Fix and add test for groupby_bins() isnan TypeError. * Better testing * black commit 6cd50cc Author: pmallas <6051395+pmallas@users.noreply.github.com> Date: Thu Oct 17 10:15:47 2019 -0400 Update where docstring to make return value type more clear (#3408) * Update where docstring to make return value type more clear The where docstring was a little confusing to me. I misunderstood "Same type as caller' to mean the values in the xarray not the xarray itself. I think this small change will clean it up for most users. Thanks. * Update xarray/core/common.py Co-Authored-By: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> commit 55b1ac0 Author: keewis <keewis@users.noreply.github.com> Date: Thu Oct 17 05:13:39 2019 +0200 tests for arrays with units (#3238) * create the empty test file * add tests for data array aggregation functions * include pint in the ci * ignore missing type annotations for pint * really skip the tests if pint is not available * remove the reason from the importorskip call * test that the dataarray constructor does not strip the unit * convert every unit stripped warning to an error * work around pint not implementing np.allclose yet * remove the now unnecessary filterwarnings decorator * xfail all tests that depend on pint having a __array_function__ * treat nans as equal * implement tests for simple arithmetic operations * use param's id argument to assign readable names * add tests for sel() and isel() * add more readable names for the unary arithmetics * xfail every test that is not yet xfailing These don't pass because the constructor raises a unit stripped warning - fixed in pint#764. * only xfail if pint is not the current development version This is test is not really reliable, but sufficient for now. * always use lists instead of tuples for indexing * add tests for loc and squeeze * black * add names and xfail marks to the parameters * add tests for interp and interp_like * implement tests for reindex * remove the xfail marks where it is not clear yet that pint is to blame * add tests for reindex_like * don't pass the new DataArray to a kwarg * xfail if not on pint dev * refactor the tests * add tests for univariate and bivariate ufuncs * black * xfail aggregation only if pint does not implement __array_function__ yet * remove the global filterwarnings mark apparently, this caused the tests to change behavior, resulting in different errors, or causing tests to pass that should actually fail. * add a test case for the repr * create a pytest mark that explicitly requires pint's __array_function__ * also check the string representation in addition to the repr * add helpers for creating method tests * check that simple aggregation methods work * use format() instead of format strings * make sure the repr of method calls is different from functions * merge the two aggregation tests * explicitly check whether pint supports __array_function__ relying on versions is somewhat fragile. * provide a fallback for the new base quantity * check that no warning is raised for both with and without coords * also check that the repr works both with and without coords * wrap all aggregation function calls * xfail every call that fails because of something outside xarray * xfail tests related to dimension coordinates and indexes * use the dimensions from the original array * allow passing arguments to the method on call * add tests for comparisons * add tests for detecting, filling and dropping missing values * mark the missing value tests as requiring pint to support duck arrays * add tests for isin, where and interpolate_na * reformat unit ids and add a test parameter for compatible units * remove an unnecessary xfail * add tests for the top-level replication functions (*_like) * check for whatever pint does with *_like functions * add tests for combine_first * xfail the bivariate ufunc tests * xfail the call to np.median * move the top-level function tests out of the DataArray namespace class * add cumsum and cumprod to the list of aggregation functions * add tests for the numpy methods * check for equal units directly after checking the magnitude * add tests for content manipulation methods * add tests for comparing DataArrays (equals, indentical) * add a test for broadcast_equals * refactor the comparison operation tests * rewrite the strip, attach and assert_equal functions and add extract * preserve multiindex in strip and attach * attach the unit from element "data" as fallback * fix some small typos * compare QuantityScalar and QuantitySequence based on their values * make the isel test more robust * add tests for reshaping and reordering * unify the structure of the tests * mark the remaining tests as requiring a recent pint version, too * explicitly handle quantities as parameters * change the repr of the function / method wrappers * check whether __init__ and repr / str handle units in data and coords * generalize array_attach_units * move the redefinition of DimensionalityError * identify quantities using isinstance * typo * skip tests with a pint version without __array_function__ * compare DataArrays where possible * mark only the compatible unit as xfailing * preserve the name of data arrays * also attach units to x_mm * Test in more CI environments; documentation * What's New * remove a stale function * use Quantity directly for instance tests * explicitly set roll_coords to silence a deprecation warning * skip the whole module if pint does not implement __array_function__ the advantage is that now forgetting to decorate a test case is not possible. * allow to attach units using the mapping from extract_units * add tests for computation methods resampling fails until I figure out how to use it with non-datetime coords. * add tests for grouped operations * add a test for rolling_exp * add a todo note for the module level skip on __array_function__ * add a test for dot * use attach_units instead of manually attaching * modify the resample test to actually work * add a test for to_unstacked_dataset * update whats-new.rst and installing.rst * reformat the whats-new.rst entry * What's New commit 1f81338 Author: keewis <keewis@users.noreply.github.com> Date: Wed Oct 16 20:54:27 2019 +0200 Fixes to the resample docs (#3400) * add a missing newline to make sphinx detect the code block * update the link to the pandas documentation * explicitly state that this only works with datetime dimensions * also put the datetime dim requirement into the function description * add Series.resample and DataFrame.resample as reference * add the changes to whats-new.rst * move references to the bottom of the docstring commit 3f9069b Author: Joseph Hamman <jhamman@ucar.edu> Date: Mon Oct 14 14:22:08 2019 -0700 Revert to dev version commit 62943e2 Author: Joseph Hamman <jhamman@ucar.edu> Date: Mon Oct 14 12:51:46 2019 -0700 Release v0.14.0 commit 30472ec Author: Joseph Hamman <jhamman@ucar.edu> Date: Mon Oct 14 12:24:05 2019 -0700 updates for 0.14 release [black only] commit 4519843 Author: Joseph Hamman <jhamman@ucar.edu> Date: Mon Oct 14 12:18:54 2019 -0700 updates for 0.14 release commit 4f5ca73 Author: Deepak Cherian <dcherian@users.noreply.github.com> Date: Mon Oct 14 18:06:53 2019 +0000 Make concat more forgiving with variables that are being merged. (#3364) * Make concat more forgiving with variables that are being merged. * rename test. * simplify test. * make diff smaller. commit ae1d8c7 Author: Crypto Jerônimo <36364045+crypto-jeronimo@users.noreply.github.com> Date: Sun Oct 13 15:38:36 2019 +0100 Fix documentation typos (#3396) commit 863e490 Author: Joe Hamman <jhamman@ucar.edu> Date: Sat Oct 12 17:33:33 2019 -0400 OrderedDict --> dict, some python3.5 cleanup too (#3389) * OrderedDict --> dict, some python3.5 cleanup too * respond to part of @shoyer's review * fix set attr syntax on netcdf4 vars * fix typing errors * update whats new and todo comments * Typing annotations * Typing annotations * Fix regression * More substantial changes * More polish * Typing annotations * Rerun notebooks commit 6851e3e Author: crusaderky <crusaderky@gmail.com> Date: Sat Oct 12 21:05:32 2019 +0100 Annotate LRUCache (#3395) commit 4c05d38 Author: Stephan Hoyer <shoyer@gmail.com> Date: Fri Oct 11 08:47:57 2019 -0700 BUG: overrides to a dimension coordinate do not get aligned (#3393) Fixes GH3377 commit 3f29551 Author: Deepak Cherian <dcherian@users.noreply.github.com> Date: Thu Oct 10 23:44:19 2019 +0000 map_blocks (#3276) * map_block attempt 2 * Address reviews: errors, args + kwargs support. * Works with datasets! * remove wrong comment. * Support chunks. * infer template. * cleanup * cleanup2 * api.rst * simple shape change error check. * Make test more complicated. * Fix for when user function doesn't set DataArray.name * Now _to_temp_dataset works. * Add whats-new * chunks kwarg makes no sense right now. * review feedback: 1. skip index graph nodes. 2. var → name 3. quicker dataarray creation. 4. Add restrictions to docstring. 5. rename chunk construction task. 6. error when non-xarray object is returned. 7. restore non-coord dims. review * Support nondim coords in make_meta. * Add Dataset.unify_chunks * doc updates. * minor. * update comment. * More complicated test dataset. Tests fail :X * Don't know why compute is needed. * work with DataArray nondim coords. * fastpath unify_chunks * comment. * much improved tests. * Change args, kwargs syntax. * Add dataset, dataarray methods. * api.rst * docstrings. * Fix unify_chunks. * Move assert_chunks_equal to xarray.testing. * minor changes. * Better error handling when inferring returned object * wip * wip * better to_array * Docstrings + nicer error message. * remove unify_chunks in map_blocks + better tests. * typing for unify_chunks * address more review comments. * more unify_chunks tests. * Just use dask.core.utils.meta_from_array * get tests working. assert_equal needs a lot of fixing. * more unify_chunks test. * assert_chunks_equal fixes. * copy over meta_from_array. * minor fixes. * raise chunks error earlier and test for map_blocks raising chunk error * fix. * Type annotations * py35 compat * make sure unify_chunks does not compute. * Make tests functional by call compute before assert_equal * Update whats-new * Work with attributes. * Support attrs and name changes. * more assert_equal * test changing coord attribute * fix whats new * rework tests to use fixtures (kind of) * more review changes. * cleanup * more review feedback. * fix unify_chunks. * read dask_array_compat :) * Dask 1.2.0 compat. * documentation polish * make_meta reflow * cosmetic * polish * Fix tests * isort * isort * Add func call to docstrings. commit 291cb80 Author: Deepak Cherian <dcherian@users.noreply.github.com> Date: Thu Oct 10 18:23:20 2019 +0000 Add groupby.dims & Fix groupby reduce for DataArray (#3338) * Fix groupby reduce for DataArray * bugfix. * another bugfix. * bugfix unique_and_monotonic for object indexes (uniqueness is enough) * Add groupby.dims property. * update reduce docstring to point to xarray.ALL_DIMS * test for object index dims. * test reduce dimensions error. * Add whats-new * fix docs build * sq whats-new * one more test. * fix test. * undo monotonic change. * Add dimensions to repr. * Raise error if no bins. * Raise nice error if no groups were formed. * Some more error raising and testing. * Add dataset tests. * update whats-new. * fix tests. * make dims a cached lazy property. * fix whats-new. * whitespace * fix whats-new commit 3f0049f Author: crusaderky <crusaderky@gmail.com> Date: Wed Oct 9 19:01:29 2019 +0100 Speed up isel and __getitem__ (#3375) * Variable.isel cleanup/speedup * Dataset.isel code cleanup * Speed up isel * What's New * Better error checks * Speedup * type annotations * Update doc/whats-new.rst Co-Authored-By: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> * What's New * What's New * Always shallow-copy variables commit 132733a Author: Deepak Cherian <dcherian@users.noreply.github.com> Date: Tue Oct 8 22:13:47 2019 +0000 Fix concat bug when concatenating unlabeled dimensions. (#3362) * Fix concat bug when concatenating unlabeled dimensions. * Add whats-new * Add back older test. * fix test * Revert "fix test" This reverts commit c33ca34. * better fix commit 6fb272c Author: crusaderky <crusaderky@gmail.com> Date: Tue Oct 8 22:23:46 2019 +0100 Rolling minimum dependency versions policy (#3358) * - Downgrade numpy to 1.14, pandas to 0.20, scipy to 0.19 (24 months old) - Downgrade dask to 1.1 (6 months old) - Don't pin patch versions * Apply rolling policy (see #3222) * Automated tool to verify the minimum versions * Drop Python 3.5 * lint * Trivial cosmetic * Cosmetic * (temp) debug CI failure * Parallelize versions check script * Remove hacks for legacy dask * Documentation * Assorted cleanup * Assorted cleanup * Fix regression * Cleanup * type annotations upgraded to Python 3.6 * count_not_none backport * pd.Index.equals on legacy pandas returned False when comparing vs. a ndarray * Documentation * pathlib cleanup * Slide deprecations from 0.14 to 0.15 * More cleanups * More cleanups * Fix min_deps_check * Fix min_deps_check * Set policy of 12 months for pandas and scipy * Cleanup * Cleanup * Sphinx fix * Overhaul readthedocs environment * Fix test crash * Fix test crash * Prune readthedocs environment * Cleanup * Hack around versioneer bug on readthedocs CI * Code review * Prevent random timeouts in the readthedocs CI * What's New polish * Merge from Master * Trivial cosmetic * Reimplement pandas.core.common.count_not_none commit 3e2a754 Author: Alan D. Snow <alansnow21@gmail.com> Date: Tue Oct 8 09:36:52 2019 -0500 added geocube and rioxarray to related projects (#3383) commit 4254b4a Author: crusaderky <crusaderky@gmail.com> Date: Fri Oct 4 23:17:56 2019 +0100 Lint (#3373) * raise exception instance, not class * isort * isort * Bump mypy version commit 283b4fe Author: Deepak Cherian <dcherian@users.noreply.github.com> Date: Fri Oct 4 17:04:36 2019 +0000 Docs/more fixes (#2934) * Move netcdf to beginning of io.rst * Better indexing example. * Start de-emphasizing pandas * misc. * compute, load, persist docstrings + text. * split-apply-combine. * np.newaxis. * misc. * some dask stuff. * Little more dask. * undo index.rst changes. * link to dask docs on chunks * Fix io.rst. * small changes. * rollingupdate. * joe's review commit dfdeef7 Author: Stephan Hoyer <shoyer@gmail.com> Date: Thu Oct 3 21:42:50 2019 -0700 Explicitly keep track of indexes with merging (#3234) * Explicitly keep track of indexes in merge.py * Typing fixes * More tying fixes * more typing fixes * fixup commit 86fb71d Author: Deepak Cherian <dcherian@users.noreply.github.com> Date: Thu Oct 3 15:41:50 2019 +0000 groupby repr (#3344) * groupby repr * add test. * test datetime and nondim coord * rename test_da → repr_da * Add whats-new * Update doc/whats-new.rst commit dd2b803 Author: Ryan May <rmay@ucar.edu> Date: Wed Oct 2 15:43:44 2019 -0600 Remove setting of universal wheels (#3367) Universal wheels indicate that one wheel supports Python 2 and 3. This is no longer the case for xarray. This causes builds to generate files with names like xarray-0.13.0-py2.py3-none-any.whl, which can cause pip to incorrectly install the wheel when installing from a list of wheel files. commit 21705e6 Author: crusaderky <crusaderky@gmail.com> Date: Tue Oct 1 19:13:55 2019 +0100 Revisit # noqa annotations (#3359) commit fb575eb Author: crusaderky <crusaderky@gmail.com> Date: Tue Oct 1 15:11:21 2019 +0100 Fix codecov.io upload on Windows (#3360) commit 1ab2279 Author: Deepak Cherian <dcherian@users.noreply.github.com> Date: Mon Sep 30 21:12:22 2019 +0000 Add how do I ... section (#3357) * Add how do I ... section * Bbugfix. * Update doc/howdoi.rst Co-Authored-By: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> * Update doc/howdoi.rst Co-Authored-By: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> * small updates. * Add more. commit bd1069b Author: Gregory Gundersen <ggundersen@gmail.com> Date: Sun Sep 29 19:39:53 2019 -0400 Add glossary to documentation (#3352) * First draft at terminology glossary. * Made name matching rules more explicit and hopefully clearer. * Amended what's new. * Changes based on feedback. * More changed based on feedback. commit b51683f Author: Anderson Banihirwe <axbanihirwe@ualr.edu> Date: Sun Sep 29 07:50:21 2019 -0600 Documentation improvements (#3328) * Add examples for full_like, zeros_like, ones_like * Add examples for xr.align * Add examples for xr.merge * Update xr.where docstring * Update xr.dot docstring * Update xarray/core/common.py Co-Authored-By: Deepak Cherian <dcherian@users.noreply.github.com> * Update xarray/core/common.py Co-Authored-By: Deepak Cherian <dcherian@users.noreply.github.com> * Update xr.combine_by_coords docstring * Apply black formatting only * More black formatting * Remove unnecessary pandas bits * Fix indentation issues * Update assign and pipe * Update `Dataset.reindex` with examples * Update `Dataset.fillna` with examples * Address styling issues * Update docstring Co-Authored-By: Deepak Cherian <dcherian@users.noreply.github.com> commit f3c7da6 Author: Gregory Gundersen <ggundersen@gmail.com> Date: Sat Sep 28 15:57:36 2019 -0400 Remove `complex.nc` from built docs (#3353) * Rolling back to prevent a different issue from leaking into this one. * Amended what's new. commit 6ece6a1 Author: Tony Tung <ttung@chanzuckerberg.com> Date: Thu Sep 26 22:45:26 2019 -0700 Fix DataArray.to_netcdf type annotation (#3325) It calls DataSet.to_netcdf, which returns Union[bytes, "Delayed", None]. So this should as well. commit 16fdac9 Author: crusaderky <crusaderky@gmail.com> Date: Thu Sep 26 10:38:46 2019 +0100 CI test suites with pinned minimum dependencies (#3346) * CI test suites with pinned minimum dependencies * code review * Clarity re lxml commit ea101f5 Author: Tom Nicholas <35968931+TomNicholas@users.noreply.github.com> Date: Thu Sep 26 10:51:58 2019 +0200 Bugfix/plot accept coord dim (#3345) Bug in plot.line fixed by ensuring 1D coords are cast down to their associated dims. There was previously two particular cases where this would not happen. commit 85c9a40 Author: crusaderky <crusaderky@gmail.com> Date: Wed Sep 25 02:40:54 2019 +0100 CI environments overhaul (#3340) * Rationalize and align CI environments. Add many optional dependencies to individual CI suites. * pynio and cdms2 are not available on Windows * cfgrib causes Python interpreter crash on Windows * dtype of np.arange defaults to int64 on Linux and int32 on Windows * Suppress failure to delete file on Windows * Mark hypotesis tests as @slow
First iterative improvement for #2799.
Speed up Dataset.isel up to 33% and DataArray.isel up to 25% (when there are no indices and the numpy array is small). 15% speedup when there are indices.
Benchmarks can be found in #2799.