Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized lazy indexing #1899

Merged
merged 37 commits into from
Mar 6, 2018
Merged

Conversation

fujiisoup
Copy link
Member

@fujiisoup fujiisoup commented Feb 9, 2018

  • Closes Vectorized indexing with cache=False #1897
  • Tests added (for all bug fixes or enhancements)
  • Tests passed (for all non-documentation changes)
  • Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

I tried to support lazy vectorised indexing inspired by #1897.
More tests would be necessary but I want to decide whether it is worth to continue.

My current implementation is

  • For outer/basic indexers, we combine successive indexers (as we are doing now).
  • For vectorised indexers, we just store them as is and index sequentially when the evaluation.

The implementation was simpler than I thought, but it has a clear limitation.
It requires to load array before the vectorised indexing (I mean, the evaluation time).
If we make a vectorised indexing for a large array, the performance significantly drops and it is not noticeable until the evaluation time.

I appreciate any suggestions.

@fujiisoup
Copy link
Member Author

I noticed the lazy vectorized indexing can be (sometimes) optimized by decomposing the vectorized indexers into successive outer and vectorized indexers, so that the size of the array to be loaded into memory is minimized.

@shoyer
Copy link
Member

shoyer commented Feb 9, 2018

I figured out how to consolidate two vectorized indexers, as long as they don't include any slice objects:

import numpy as np

def index_vectorized_indexer(old_indexer, applied_indexer):
    return tuple(o[applied_indexer] for o in np.broadcast_arrays(*old_indexer))

for x, old, applied in [
    (np.arange(10), (np.arange(2, 7),), (np.array([3, 2, 1]),)),
    (np.arange(10), (np.arange(6).reshape(2, 3),), (np.arange(2), np.arange(1, 3))),
    (-np.arange(1, 21).reshape(4, 5), (np.arange(3)[:, None], np.arange(4)[None, :]), (np.arange(3), np.arange(3))),
]:
    new_key = index_vectorized_indexer(old, applied)
    np.testing.assert_array_equal(x[old][applied], x[new_key])

We could probably make this work with VectorizedIndexer if we converted the slice objects to arrays. I think we might even already have some code to do that conversion somewhere. So another option would be to convert BasicIndexer and OuterIndexer -> VectorizedIndexer if necessary and then use this path.

@fujiisoup
Copy link
Member Author

Thanks, @shoyer
Do you think it is possible to consolidate transpose also?
We need it to keep our logic in Variable._broadcast_indexing.

I am wondering what computation cost we want to avoid by the lazy indexing.

  1. The indexing itself is expensive so we want to minimize the number of indexing operation?
  2. The original data is too large to fit into memory, and we want to load the smallest subset of the original array by the lazy indexing?

If the reason 2 is the common case, I think it is not a good idea to consolidate all the lazy indexing as VectorizedIndexer, since most of the backend does not support vectorized indexing, which means we need to load all the array into memory before any indexing operation.
(But still it would be valuable to consolidate all the indexers after the first vectorized indexer, since we can decompose any VectorizedIndexer into successive outer- and smaller vectorized-indexers pair.)

And I am also wondering as pointed out in #1725, what I am doing now was already implemented in dask.

@shoyer
Copy link
Member

shoyer commented Feb 9, 2018

Reason 2 is the primary one. We want to load the minimum amount of data possible into memory, mostly because pulling data from disk is slow.

@shoyer
Copy link
Member

shoyer commented Feb 9, 2018

I think the design choice here really comes down to whether we want to enable VectorizedIndexing on arbitrary data on disk or not:

Is it better to:

  1. Always allow vectorized indexing by means of (lazily) loading all indexed data into memory as a single chunk. This could potentially be very expensive for IO or memory in hard to predict ways.
  2. Or to only allow vectorized indexing if a backend supports it directly. This ensures that when vectorized indexing works it works efficiently. Vectorized indexing is still possibly but you have to explicitly write .compute()/.load().

I think I slightly prefer option (2) but I can see the merits in either decision.

@fujiisoup
Copy link
Member Author

I am inclined to the option 1, as there are some benefit even for backend without the vectorized-indexing support,
e.g.
in case we want to get three diagonal elements (1, 1), (2, 2), (3, 3) from a 1000x1000 array. What we want is
array[[1, 2, 3], [1, 2, 3]].
It can be decomposed to
array[1: 4, 1:4][[0, 1, 2], [0, 1, 2]].
We only need to load 3 x 3 part of the 1000 x 1000 array, then take its diagonal elements.

A drawback is that it is difficult for users to predict how large memory is necessary.

@shoyer
Copy link
Member

shoyer commented Feb 10, 2018

in case we want to get three diagonal elements (1, 1), (2, 2), (3, 3) from a 1000x1000 array. What we want is
array[[1, 2, 3], [1, 2, 3]].
It can be decomposed to
array[1: 4, 1:4][[0, 1, 2], [0, 1, 2]].
We only need to load 3 x 3 part of the 1000 x 1000 array, then take its diagonal elements.

OK, this is pretty clever.

There are some obvious fail cases, e.g., if they want to pull out indices array[[1, -1], [1, -1]], in which case the entire array needs to be sliced. I wonder if we should try to detect these with some heuristics, e.g., if the size of the result is much (maybe 10x or 100x) smaller than the size of sliced arrays.

Also, we would want to avoid separating basic/vectorized for backends that support efficient vectorized indexing (scipy and zarr).

@fujiisoup
Copy link
Member Author

There are some obvious fail cases, e.g., if they want to pull out indices array[[1, -1], [1, -1]], in which case the entire array needs to be sliced.

If the backend supports the orthogonal indexing (not only the basic indexing),
we can do array[[1, -1]][:, [1, -1]], load the 2x2 array, then apply the vectorized indexing [[0, 1], [0, 1]].

But if we want a full diagonal, we need a full slice anyway...

Also, we would want to avoid separating basic/vectorized for backends that support efficient vectorized indexing (scipy and zarr).

OK. Agreed. We may need a flag that can be accessed from the array wrapper.

@fujiisoup
Copy link
Member Author

fujiisoup commented Feb 11, 2018

Based on the suggestion, I implemented the lazy vectorized indexing with index-consolidation.

Now, every backend is virtually compatible to all the indexer types, i.e. basic-, outer- and vectorized-indexers.

It sometimes consume large amount of memory if the indexer is unable to decompose efficiently, but it is always better than loading the full slice.
The drawback is the unpredictability of how many data will be loaded.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fujiisoup - this is impressive. Thanks for putting this together.

The failing test here seems to be dask related and I don't think you need to worry about it right now.

My comments are mostly cosmetic. I wish I had reviewed your prior Indexing PRs so I had better context to what you are doing but I generally like the approach you've taken. The memory issue is potentially a problem but my thought is that consistency between backends and in memory datasets is more important right now. Am I correct in my understanding that certain backends could/do support vectorized indexing directly and in those cases, the memory issue wouldn't be as prevalent (depending on how the specific backend did their implementation)?

return type(self)(self.array, key)

def __setitem__(self, key, value):
raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a more informative error message here. Even NotImplementedError('__setitem__ has not yet been implemented for LazilyVectorizedIndexedArray') would be okay.

shape : tuple
Shape of the array subject to the indexing.

Returns
-------
tuple
Tuple suitable for use to index a NumPy array with vectorized indexing.
Each element is an integer or array: broadcasting them together gives
the shape of the result.
Each element is array: broadcasting them together gives the shape of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each element is an array

def _decompose_vectorized_indexer(indexer):
""" Decompose vectorized indexer to outer and vectorized indexers,
array[indexer] == array[oindex][vindex]
such that array[oindex].shape becomes smallest.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

becomes smallest what?

Just trying to make sure I understand the intent here. Is this algorithm trying to optimize (minimize) the size of array[oindex]?

Copy link
Member Author

@fujiisoup fujiisoup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, every backend uses the most efficient indexing method it supports for any given indexers.

"""
Decompose vectorized indexer to the successive two indexers, where the
first indexer will be used to index backend arrays, while the second one
is used to index loaded on-memory np.ndarray.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A detailed description of decompose_indexer is given here.

@@ -41,8 +41,10 @@ def shape(self):
return self._shape

def __getitem__(self, key):
key = indexing.unwrap_explicit_indexer(
key, self, allow=(indexing.BasicIndexer, indexing.OuterIndexer))
# TODO support indexing.decompose_indexer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need someone's help to update indexing scheme for rasterio backend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rasterio is strictly a slicing approach for fetching data, or pulling out individual data points. In this case, you want to avoid converting the slices into numpy arrays of indexes. Maybe it would make sense to have a function that could re-compose numpy arrays of indexes into groups of slices? So, if we were indexing [1, 3, 4, 5, 2, 6], we would get back a slice(1, 7)? Maybe even make it smart enough to take the chunk hints so that indexes like [1, 4, 5] is kept as slice(1, 6), rather than (slice(1, 2), slice(4, 6))?

Copy link
Member Author

@fujiisoup fujiisoup Feb 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeatherGod Thanks for the review.

Maybe it would make sense to have a function that could re-compose numpy arrays of indexes into groups of slices?

Agreed, it would be an improvement. But I am wondering the combination will explode if the dimension increases. Some heuristics would be also necessary how many chunks we used.
I do not have yet a clear solution. I am open to any idea for the improvement.

EDIT: the chunking would also improve the efficiency if we want to get a full diagonal from a large 2d-array. In my current implementation, we need to load a full slice, but it could be reduced by the factor 2 if we use 2x2 chunks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we certainly have time to think about it. This feature doesn't seem to change anything for indexing that worked before, so I don't think we have to worry about regressions. Improvements to indexing can be thought of holistically as a separate task.

For our immediate needs though, for rasterio, we probably should just create a failback indexer that would just load up that entire dimension and index it as given if the original indexer is anything but scalar integers or slices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmaussion can you help me for this?
What I want to do here is the 2-staged indexing.

  1. index backend array by the 1st indexer.
  2. index the loaded np.ndarray by the 2nd vectorized indexer.

These two indexers (for backend and np.ndarray) can be computed by

key, np_inds = indexing.decompose_indexer(
    key, self.shape, indexing.IndexingSupport.OUTER)

I am not familiar with rasterio data structure (I do not yet understand what is going on below...),
I would appreciate if you could instruct me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rasterio (as wrapped by xarray) supports a strange form of indexing:

  • Outer indexing for the first indexer element only.
  • Basic indexing for remaining two indexer elements (every rasterio array is 3-dimensional).

I think the simplest fix would be to extend our rasterio wrapper to support arbitrary outer indexing. This would entail changing the final clause below, where rasterio currently raises an error:

else:
k = np.asarray(k)
start = k[0]
stop = k[-1] + 1
ids = np.arange(start, stop)
if not ((k.shape == ids.shape) and np.all(k == ids)):
raise IndexError(_ERROR_MSG)

Instead, we would want to create a slice object based on the minimum/maximum elements, and then add another indexing to np_inds to pull out the listed elements (after subtracting the start from the slice).

@WeatherGod
Copy link
Contributor

Oh, wow... this worked like a charm for the netcdf4 backend! I have a ~13GB (uncompressed) 4-D netcdf4 variable that was giving me trouble for slicing a 2D surface out of. Here is a snippet where I am grabbing data at random indices in the last dimension. First for a specific latitude, then for the entire domain.

>>> CD_subset = rough['CD'][0]
>>> wind_inds_decorated
<xarray.DataArray (latitude: 3501, longitude: 7001)>
array([[33, 15, 25, ..., 52, 66, 35],
       [ 6,  8, 55, ..., 59,  6, 50],
       [54,  2, 40, ..., 32, 19,  9],
       ..., 
       [53, 18, 23, ..., 19,  3, 43],
       [ 9, 11, 66, ..., 51, 39, 58],
       [21, 54, 37, ...,  3,  0, 65]])
Dimensions without coordinates: latitude, longitude
>>> foo = CD_subset.isel(latitude=0, wind_direction=wind_inds_decorated[0])
>>> foo
<xarray.DataArray 'CD' (longitude: 7001)>
array([ 0.004052,  0.005915,  0.002771, ...,  0.005604,  0.004715,  0.002756], dtype=float32)
Coordinates:
    scales          int16 60
    latitude        float64 54.99
  * longitude       (longitude) float64 -130.0 -130.0 -130.0 -130.0 -130.0 ...
    wind_direction  (longitude) int16 165 75 125 5 235 345 315 175 85 35 290 ...
>>> foo = CD_subset.isel(wind_direction=wind_inds_decorated)
>>> foo
<xarray.DataArray 'CD' (latitude: 3501, longitude: 7001)>
[24510501 values with dtype=float32]
Coordinates:
    scales          int16 60
  * latitude        (latitude) float64 54.99 54.98 54.97 54.96 54.95 54.95 ...
  * longitude       (longitude) float64 -130.0 -130.0 -130.0 -130.0 -130.0 ...
    wind_direction  (latitude, longitude) int64 165 75 125 5 235 345 315 175 ...

All previous attempts at this would result in having to load the entire 13GB array into memory just to get 93.5 MB out. Or, I would try to fetch each individual point, which took way too long. This worked faster than loading the entire thing into memory, and it used less memory, too (I think I maxed out at about 1.2GB of total usage, which is totally acceptable for my use case).

I will try out similar things with the pynio and rasterio backends, and get back to you. Thanks for this work!

Copy link
Contributor

@WeatherGod WeatherGod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool beans! Going to continue testing this stuff out. Would be nice to get rasterio support going, but this is certainly an improvement over the pre-existing situation.

return np.broadcast(*self.key.tuple).shape

def __array__(self, dtype=None):
return np.asarray(self.array[self.key], dtype=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be np.asanyarray()? Could this interfere with dask arrays? Pint support?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be okay, neither xarray or dask support using masked arrays under the hood. See #1194.

mode: str, one of ['basic' | 'outer' | 'outer_1vector' | 'vectorized']
basic: for backends that support onle basic indexer
outer: for backends that support basic / outer indexer
outer_1vector: for backends that support outer indexer inluding at most
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"inluding" --> "including"


backend_indexer = []
np_indexer = []
# posify
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"posify"???

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take the blame for this one! I think I used this made-up word elsewhere :).

for k in indexer:
if isinstance(k, slice):
backend_indexer.append(k)
np_indexer.append(slice(None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might help for future readers to add a bit of prose to each of these. Like: "If it is a slice, then we will slice it as-is (k) in the backend, and then use all of it (slice(None)) for the in-memory portion."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job with the prose above, but the other spots are going to need similar prose as well.

"""
Decompose ouer indexer to the successive two indexers, where the
first indexer will be used to index backend arrays, while the second one
is ised to index the loaded on-memory np.ndarray.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ised" --> "used"


def _decompose_outer_indexer(indexer, shape, mode):
"""
Decompose ouer indexer to the successive two indexers, where the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ouer" --> "outer"

Also, the phrasing is odd. I would go with "Decompose outer indexer into two indexers" instead, as it is more concise.

>>> array = array[backend_indexer] # load subslice of the array
>>> np_indexer = OuterIndexer([0, 2, 1], [0, 1, 0])
>>> array[np_indexer] # outer indexing for on-memory np.ndarray.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modes aren't documented. Maybe a reference to a list of them?

@WeatherGod
Copy link
Contributor

Hmm, came across a bug with the pynio backend. Working on making a reproducible example, but just for your own inspection, here is some logging output:

<xarray.Dataset>
Dimensions:    (xgrid_0: 1799, ygrid_0: 1059)
Coordinates:
    lv_HYBL0   float32 8.0
    longitude  (ygrid_0, xgrid_0) float32 ...
    latitude   (ygrid_0, xgrid_0) float32 ...
Dimensions without coordinates: xgrid_0, ygrid_0
Data variables:
    UGRD       (ygrid_0, xgrid_0) float32 ...
    VGRD       (ygrid_0, xgrid_0) float32 ...
DEBUG:hiresWind.downscale:shape of a data: (50, 1059, 1799)

The first bit is the repr of my DataSet. The last line is output of ds['UGRD'].values.shape. It is supposed to be 3D, not 2D.

If I revert back to v0.10.0, then the shape is (1059, 1799}, just as expected.

@WeatherGod
Copy link
Contributor

I can also confirm that the shape comes out correctly using master, so this is definitely isolated to this PR.

@WeatherGod
Copy link
Contributor

WeatherGod commented Feb 14, 2018

Ah, interesting... so, this dataset was created by doing an isel() on the original:

>>> ds['UGRD_P0_L105_GLC0']
<xarray.DataArray 'UGRD_P0_L105_GLC0' (lv_HYBL0: 50, ygrid_0: 1059, xgrid_0: 1799)>
[95257050 values with dtype=float32]
Coordinates:
  * lv_HYBL0   (lv_HYBL0) float32 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
    gridlat_0  (ygrid_0, xgrid_0) float32 ...
    gridlon_0  (ygrid_0, xgrid_0) float32 ...
Dimensions without coordinates: ygrid_0, xgrid_0

So, the original data has a 50x1059x1799 grid, and the new indexer isn't properly composing the indexer so that it fetches [7, slice(None), slice(None)] when I grab it's .values.

key = indexing.unwrap_explicit_indexer(
key, target=self, allow=indexing.BasicIndexer)
key, np_inds = indexing.decompose_indexer(key, self.shape,
mode='outer')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyNIO actually has a very robust indexing interface. We might be over-simplifying here, and this could be leading to bugs. I am not exactly sure we even need to do all of this indirection. PyNIO allows for the creation of a "selection object" that can be passed to the pynio array objects. Perhaps we should be creating that object instead of forcing the pynio array object into a numby index adapter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.
I am not familiar with pynio, but it looks their selection object is very specific to pynio...
I want to leave this improvement to another PR and focus on the common feature in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how selection objects would really help us here. I guess they can used to do a form of outer indexing, but putting integer indices into a string seems very inefficient. I agree that we should save it for later.

Are we actually sure that PyNIO even supports "outer" indexing? I had it marked down as only supporting basic indexing before. "Outer" indexing would include indexing with 1D arrays like array[[0, 1, 2], :], but I don't see that support documented anywhere. (Maybe it does work, but just isn't documented.)

----------
indexer: VectorizedIndexer
mode: str, one of ['basic' | 'outer' | 'outer_1vector' | 'vectorized']
basic: for backends that support onle basic indexer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"onle" --> "only"?

@WeatherGod
Copy link
Contributor

Just did some more debugging, putting in some debug statements within NioArrayWrapper.__getitem__():

diff --git a/xarray/backends/pynio_.py b/xarray/backends/pynio_.py
index c7e0ddf..b9f7151 100644
--- a/xarray/backends/pynio_.py
+++ b/xarray/backends/pynio_.py
@@ -27,16 +27,24 @@ class NioArrayWrapper(BackendArray):
         return self.datastore.ds.variables[self.variable_name]
 
     def __getitem__(self, key):
+        import logging
+        logger = logging.getLogger(__name__)
+        logger.addHandler(logging.NullHandler())
+        logger.debug("initial key: %s", key)
         key, np_inds = indexing.decompose_indexer(key, self.shape,
                                                   mode='outer')
+        logger.debug("Decomposed indexers:\n%s\n%s", key, np_inds)
 
         with self.datastore.ensure_open(autoclose=True):
             array = self.get_array()
+            logger.debug("initial array: %r", array)
             if key == () and self.ndim == 0:
                 return array.get_value()
 
             for ind in np_inds:
+                logger.debug("indexer: %s", ind)
                 array = indexing.NumpyIndexingAdapter(array)[ind]
+                logger.debug("intermediate array: %r", array)
 
             return array
 

And here is the test script (data not included):

import logging
import xarray as xr
logging.basicConfig(level=logging.DEBUG)
fname1 = '../hrrr.t12z.wrfnatf02.grib2'
ds = xr.open_dataset(fname1, engine='pynio')
subset_isel = ds.isel(lv_HYBL0=7)
sp = subset_isel['UGRD_P0_L105_GLC0'].values.shape

And here is the relevant output:

DEBUG:xarray.backends.pynio_:initial key: BasicIndexer((slice(None, None, None),))
DEBUG:xarray.backends.pynio_:Decomposed indexers:
BasicIndexer((slice(None, None, None),))
()
DEBUG:xarray.backends.pynio_:initial array: <Nio.NioVariable object at 0x7f0f3c339210>
DEBUG:xarray.backends.pynio_:initial key: BasicIndexer((slice(None, None, None),))
DEBUG:xarray.backends.pynio_:Decomposed indexers:
BasicIndexer((slice(None, None, None),))
()
DEBUG:xarray.backends.pynio_:initial array: <Nio.NioVariable object at 0x7f0f3c339b90>
DEBUG:xarray.backends.pynio_:initial key: BasicIndexer((slice(None, None, None),))
DEBUG:xarray.backends.pynio_:Decomposed indexers:
BasicIndexer((slice(None, None, None),))
()
DEBUG:xarray.backends.pynio_:initial array: <Nio.NioVariable object at 0x7f0f3c339d50>
DEBUG:xarray.backends.pynio_:initial key: BasicIndexer((slice(None, None, None),))
DEBUG:xarray.backends.pynio_:Decomposed indexers:
BasicIndexer((slice(None, None, None),))
()
DEBUG:xarray.backends.pynio_:initial array: <Nio.NioVariable object at 0x7f0f3c339d90>
DEBUG:xarray.backends.pynio_:initial key: BasicIndexer((7, slice(None, None, None), slice(None, None, None)))
DEBUG:xarray.backends.pynio_:Decomposed indexers:
BasicIndexer((7, slice(None, None, None), slice(None, None, None)))
()
DEBUG:xarray.backends.pynio_:initial array: <Nio.NioVariable object at 0x7f0f3c339190>
DEBUG:xarray.backends.pynio_:initial key: BasicIndexer((7, slice(None, None, None), slice(None, None, None)))
DEBUG:xarray.backends.pynio_:Decomposed indexers:
BasicIndexer((7, slice(None, None, None), slice(None, None, None)))
()
DEBUG:xarray.backends.pynio_:initial array: <Nio.NioVariable object at 0x7f0f3c339190>
(50, 1059, 1799)

So, the BasicIndexer((7, slice(None, None, None), slice(None, None, None))) isn't getting decomposed correctly, it looks like?

for ind in np_inds:
array = indexing.NumpyIndexingAdapter(array)[ind]

return array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah-hah! I see the bug! key never gets used!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this should probably be something like:

array = array[key.tuple]
for ind in np_inds:
    array = indexing.NumpyIndexingAdapter(array)[ind]

return array

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the debugging this!!
We certainly need test coverage for this.

with self.datastore.ensure_open(autoclose=True):
return self.get_array()[key]
array = np.asarray(self.get_array()[key], dtype=self.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind doing asarray here, but not for other backends?

@@ -543,6 +543,10 @@ def _updated_key(self, new_key):
return _combine_indexers(self.key, self.shape, new_key)

def __getitem__(self, indexer):
# load if the indexed array becomes a scalar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually a good idea? I think it's a little surprising to load data immediately just because the result is a scalar.

@shoyer shoyer mentioned this pull request Feb 19, 2018
4 tasks
@fujiisoup
Copy link
Member Author

This looks some backends do not support negative step slices.
I'm going to wrap this maybe this weekend.

@WeatherGod
Copy link
Contributor

I did some more investigation into the memory usage problem I was having. I had assumed that the vectorized indexed result of a lazily indexed data array would be an in-memory array. So, when I then started to use the result, it was then doing a read of all the data at once, resulting in a near-complete load of the data into memory.

I have adjusted my code to chunk out the indexing in order to keep the memory usage under control at reasonable performance penalty. I haven't looked into trying to identify the ideal chunking scheme to follow for an arbitrary dataarray and indexing. Perhaps we can make that a task for another day. At this point, I am satisfied with the features (negative step-sizes aside, of course).

@jhamman
Copy link
Member

jhamman commented Feb 26, 2018

@fujiisoup - is this ready for a final review? I see you have all the tests passing 💯 !

@fujiisoup
Copy link
Member Author

I think it's ready :)

# Conflicts:
#	asv_bench/benchmarks/dataset_io.py
@@ -483,13 +466,6 @@ def __init__(self, array, key=None):
self.key = key

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should we rename the original LazilyIndexedArray to LazilyOuterIndexedArray? Also, should we add a transpose method?

shape : tuple
Shape of the array subject to the indexing.

Returns
-------
tuple
VectorizedInexer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VectorizedIndexer

return (OuterIndexer(tuple(backend_indexer)),
OuterIndexer(tuple(np_indexer)))

# basic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add an explicit error check or at least an assertion here, e.g., assert indexing_support == IndexingSupport.BASIC.

OuterIndexer(tuple(np_indexer)))


def _arrayize_vectorized_indexer(indexer, shape):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops -- I wrote a basically identical helper function xarray.backends.zarr._replace_slices_with_arrays along with some unit tests.

Can you confirm that we can remove the zarr helper function, and that all its unit tests work with this function instead? I don't have a preference on which implementation to use, assuming that all tests pass with both.

@@ -277,7 +279,7 @@ def open(cls, filename, mode='r', format='NETCDF4', group=None,
def open_store_variable(self, name, var):
with self.ensure_open(autoclose=False):
dimensions = var.dimensions
data = indexing.LazilyIndexedArray(NetCDF4ArrayWrapper(name, self))
data = indexing.LazilyOuterIndexedArray(NetCDF4ArrayWrapper(name, self))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (84 > 79 characters)

return _combine_indexers(self.key, self.shape, new_key)

def __getitem__(self, indexer):
# If the indexed array becomes a scalar, return LazilyOuterIndexedArray.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (80 > 79 characters)

@@ -77,7 +77,7 @@ def get_variables(self):
def lazy_inaccessible(k, v):
if k in self._indexvars:
return v
data = indexing.LazilyIndexedArray(InaccessibleArray(v.values))
data = indexing.LazilyOuterIndexedArray(InaccessibleArray(v.values))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (80 > 79 characters)

@@ -16,7 +16,7 @@
from xarray.core import indexing
from xarray.core.common import full_like, ones_like, zeros_like
from xarray.core.indexing import (
BasicIndexer, CopyOnWriteArray, DaskIndexingAdapter, LazilyIndexedArray,
BasicIndexer, CopyOnWriteArray, DaskIndexingAdapter, LazilyOuterIndexedArray,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (81 > 79 characters)

@@ -1798,7 +1798,7 @@ def test_rolling_window(self):

class TestAsCompatibleData(TestCase):
def test_unchanged_types(self):
types = (np.asarray, PandasIndexAdapter, indexing.LazilyIndexedArray)
types = (np.asarray, PandasIndexAdapter, indexing.LazilyOuterIndexedArray)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (82 > 79 characters)

# doubly wrapping
v = Variable(dims=('x', 'y'),
data=LazilyIndexedArray(LazilyIndexedArray(self.d)))
data=LazilyOuterIndexedArray(LazilyOuterIndexedArray(self.d)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (83 > 79 characters)

self.check_orthogonal_indexing(v)
# hierarchical wrapping
v = Variable(dims=('x', 'y'),
data=LazilyIndexedArray(NumpyIndexingAdapter(self.d)))
data=LazilyOuterIndexedArray(NumpyIndexingAdapter(self.d)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (80 > 79 characters)

@fujiisoup
Copy link
Member Author

All done :)

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is finally ready! Lazy transpose is a nice bonus :).

Let's wait a day or two before merging in case anyone has other comments to add.

@shoyer shoyer merged commit 54468e1 into pydata:master Mar 6, 2018
@shoyer
Copy link
Member

shoyer commented Mar 6, 2018

OK, in it goes. Thanks @fujiisoup !

@fujiisoup
Copy link
Member Author

Thanks, @WeatherGod , for your feedback.
This is finally merged!

@WeatherGod
Copy link
Contributor

🎉

@dopplershift
Copy link
Contributor

So did this remove/rename LazilyIndexedArray in 0.10.2? Because I'm getting an attribute in the custom xarray backend I wrote: https://github.com/Unidata/siphon/blob/master/siphon/cdmr/xarray_support.py

I don't mind updating, but I wanted to make sure this was intentional.

@fujiisoup
Copy link
Member Author

Yes,
LazilyIndexedArray was renamed to LazilyOuterIndexedArray and LazilyVectorizedIndexedArray was newly added.
These two backend arrays are selected depending on what kind of indexer is used.

shoyer pushed a commit that referenced this pull request Jun 1, 2018
* Added PNC backend to xarray

PNC is used for GEOS-Chem, CAMx, CMAQ and other atmospheric data formats
that have their own file formats and meta-data conventions. It can provide a CF compliant netCDF-like interface.

* Added whats-new documentation

* Updating pnc_ to remove DunderArrayMixin dependency

* Adding basic tests for pnc

Right now, pnc is simply being tested as a reader for NetCDF3 files

* Updating for flake8 compliance

* flake does not like unused e

* Updating pnc to PseudoNetCDF

* Remove outer except

* Updating pnc to PseudoNetCDF

* Added open and updated init

Based on shoyer review

* Updated indexing and test fix

Indexing supports #1899

* Added PseudoNetCDF to doc/io.rst

* Changing test subtype

* Changing test subtype
removing pdb

* pnc test case requires netcdf3only

For now, pnc is only supporting the classic data model

* adding backend_kwargs default as dict

This ensures **mapping is possible.

* Upgrading tests to CFEncodedDataTest

Some tests are bypassed. PseudoNetCDF string treatment is not currently
compatible with xarray. This will be addressed soon.

* Not currently supporting autoclose

I do not fully understand the usecase, so I have not implemented these tests.

* Minor updates for flake8

* Explicit skipping

Using pytest.mark.skip to skip unsupported tests

* removing trailing whitespace from pytest skip

* Adding pip support

* Addressing comments

* Bypassing pickle, mask/scale, and object

These tests cause errors that do not affect desired backend performance.

* Added uamiv test

PseudoNetCDF reads other formats. This adds a test
of uamiv to the standard test for a backend

and skips mask/scale, object, and boolean tests

* Adding support for autoclose

ensure open must be called before accessing variable data

* Adding bakcend_kwargs to all backends

Most backends currently take no keywords, so an empty ditionary is appropriate.

* Small tweaks to PNC backend

* remove warning and update whats-new

* Separating isntall and io pnc doc and updating whats new

* fixing line length in test

* Tests now use non-netcdf files

* Removing unknown meta-data netcdf support.

* flake8 cleanup

* Using python 2 and 3 compat testing

* Disabling mask_and_scale by default

prevents inadvertent double scaling in PNC formats

* consistent with 3.0.0

Updates in 3.0.1 will fix close in uamiv.

* Updating readers and line length

* Updating readers and line length

* Updating readers and line length

* Adding open_mfdataset test

Testing by opening same file twice and stacking it.

* Using conda version of PseudoNetCDF

* Removing xfail for netcdf

Mask and scale with PseudoNetCDF and NetCDF4 is not supported, but
not prevented.

* Moving pseudonetcdf to v0.15

* Updating what's new

* Fixing open_dataarray CF options

mask_and_scale is None (diagnosed by open_dataset) and decode_cf should be True
@fujiisoup fujiisoup deleted the vectorized_lazy_indexing branch June 8, 2018 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vectorized indexing with cache=False
6 participants