Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized lazy indexing #1899

Merged
merged 37 commits into from
Mar 6, 2018
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
dceb298
Start working
fujiisoup Feb 9, 2018
c1b4b60
First support of lazy vectorized indexing.
fujiisoup Feb 9, 2018
d989a15
Some optimization.
fujiisoup Feb 9, 2018
218763c
Use unique to decompose vectorized indexing.
fujiisoup Feb 10, 2018
541fca3
Consolidate vectorizedIndexing
fujiisoup Feb 10, 2018
3e05a16
Support vectorized_indexing in h5py
fujiisoup Feb 11, 2018
030a2c4
Refactoring backend array. Added indexing.decompose_indexers. Drop un…
fujiisoup Feb 13, 2018
b9f97b4
typo
fujiisoup Feb 14, 2018
850f29c
bugfix and typo
fujiisoup Feb 14, 2018
943ec78
Fix based on @WeatherGod comments.
fujiisoup Feb 14, 2018
91aae64
Use enum-like object to indicate indexing-support types.
fujiisoup Feb 15, 2018
991c1da
Update test_decompose_indexers.
fujiisoup Feb 15, 2018
936954a
Bugfix and benchmarks.
fujiisoup Feb 15, 2018
9144965
fix: support outer/basic indexer in LazilyVectorizedIndexedArray
fujiisoup Feb 16, 2018
d1cb976
Merge branch 'master' into vectorized_lazy_indexing
fujiisoup Feb 16, 2018
95e1f1c
More comments.
fujiisoup Feb 16, 2018
180c4f5
Fixing style errors.
stickler-ci Feb 16, 2018
dbbe531
Remove unintended dupicate
fujiisoup Feb 16, 2018
c2e61ad
combine indexers for on-memory np.ndarray.
fujiisoup Feb 16, 2018
b545c3e
fix whats new
fujiisoup Feb 16, 2018
872de73
fix pydap
fujiisoup Feb 16, 2018
bb5d1f6
Update comments.
fujiisoup Feb 16, 2018
cfe29bb
Support VectorizedIndexing for rasterio. Some bugfix.
fujiisoup Feb 17, 2018
17a7dac
Merge branch 'master' into vectorized_lazy_indexing
fujiisoup Feb 18, 2018
2dff278
flake8
fujiisoup Feb 18, 2018
ead6327
More tests
fujiisoup Feb 18, 2018
a90ac05
Use LazilyIndexedArray for scalar array instead of loading.
fujiisoup Feb 18, 2018
73f4958
Support negative step slice in rasterio.
fujiisoup Feb 18, 2018
fd04966
Make slice-step always positive
fujiisoup Feb 18, 2018
b3c3d80
Bugfix in slice-slice
fujiisoup Feb 25, 2018
259f36c
Add pydap support.
fujiisoup Feb 25, 2018
0e7eb2e
Merge branch 'master' into vectorized_lazy_indexing
fujiisoup Feb 26, 2018
0c2e31b
Merge branch 'master' into vectorized_lazy_indexing
fujiisoup Mar 1, 2018
d8421a5
Merge branch 'master' into vectorized_lazy_indexing
fujiisoup Mar 1, 2018
7e0959c
Rename LazilyIndexedArray -> LazilyOuterIndexedArray. Remove duplicat…
fujiisoup Mar 2, 2018
4fccdee
flake8
fujiisoup Mar 2, 2018
8e96710
Added transpose to LazilyOuterIndexedArray
fujiisoup Mar 3, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,10 @@ Documentation

Enhancements
~~~~~~~~~~~~
- Support lazy vectorized-indexing. After this change, flexible indexing such
as orthogonal/vectorized indexing, becomes possible for all the backend
arrays. (:issue:`1897`)
By `Keisuke Fujii <https://github.com/fujiisoup>`_.
- reduce methods such as :py:func:`DataArray.sum()` now accepts ``dtype``
arguments. (:issue:`1838`)
By `Keisuke Fujii <https://github.com/fujiisoup>`_.
Expand Down
17 changes: 11 additions & 6 deletions xarray/backends/h5netcdf_.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,20 @@

class H5NetCDFArrayWrapper(BaseNetCDF4Array):
def __getitem__(self, key):
key = indexing.unwrap_explicit_indexer(
key, self, allow=(indexing.BasicIndexer, indexing.OuterIndexer))
key, np_inds = indexing.decompose_indexer(key, self.shape,
mode='outer_1vector')

# h5py requires using lists for fancy indexing:
# https://github.com/h5py/h5py/issues/992
# OuterIndexer only holds 1D integer ndarrays, so it's safe to convert
# them to lists.
key = tuple(list(k) if isinstance(k, np.ndarray) else k for k in key)
key = tuple(list(k) if isinstance(k, np.ndarray) else k for k in
key.tuple)
with self.datastore.ensure_open(autoclose=True):
return self.get_array()[key]
array = np.asarray(self.get_array()[key], dtype=self.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind doing asarray here, but not for other backends?


for ind in np_inds:
array = indexing.NumpyIndexingAdapter(array)[ind]

return array


def maybe_decode_bytes(txt):
Expand Down
12 changes: 7 additions & 5 deletions xarray/backends/netCDF4_.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,17 +50,16 @@ def get_array(self):

class NetCDF4ArrayWrapper(BaseNetCDF4Array):
def __getitem__(self, key):
key = indexing.unwrap_explicit_indexer(
key, self, allow=(indexing.BasicIndexer, indexing.OuterIndexer))

key, np_inds = indexing.decompose_indexer(key, self.shape,
mode='outer')
if self.datastore.is_remote: # pragma: no cover
getitem = functools.partial(robust_getitem, catch=RuntimeError)
else:
getitem = operator.getitem

with self.datastore.ensure_open(autoclose=True):
try:
data = getitem(self.get_array(), key)
array = getitem(self.get_array(), key.tuple)
except IndexError:
# Catch IndexError in netCDF4 and return a more informative
# error message. This is most often called when an unsorted
Expand All @@ -73,7 +72,10 @@ def __getitem__(self, key):
msg += '\n\nOriginal traceback:\n' + traceback.format_exc()
raise IndexError(msg)

return data
for ind in np_inds:
array = indexing.NumpyIndexingAdapter(array)[ind]

return array


def _encode_nc4_variable(var):
Expand Down
12 changes: 8 additions & 4 deletions xarray/backends/pydap_.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,18 +24,22 @@ def dtype(self):
return self.array.dtype

def __getitem__(self, key):
key = indexing.unwrap_explicit_indexer(
key, target=self, allow=indexing.BasicIndexer)
key, np_inds = indexing.decompose_indexer(key, self.shape,
mode='basic')

# pull the data from the array attribute if possible, to avoid
# downloading coordinate data twice
array = getattr(self.array, 'array', self.array)
result = robust_getitem(array, key, catch=ValueError)
result = robust_getitem(array, key.tuple, catch=ValueError)
# pydap doesn't squeeze axes automatically like numpy
axis = tuple(n for n, k in enumerate(key)
axis = tuple(n for n, k in enumerate(key.tuple)
if isinstance(k, integer_types))
if len(axis) > 0:
result = np.squeeze(result, axis)

for ind in np_inds:
result = indexing.NumpyIndexingAdapter(result)[ind]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F821 undefined name 'ind'


return result


Expand Down
10 changes: 7 additions & 3 deletions xarray/backends/pynio_.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,18 @@ def get_array(self):
return self.datastore.ds.variables[self.variable_name]

def __getitem__(self, key):
key = indexing.unwrap_explicit_indexer(
key, target=self, allow=indexing.BasicIndexer)
key, np_inds = indexing.decompose_indexer(key, self.shape,
mode='outer')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyNIO actually has a very robust indexing interface. We might be over-simplifying here, and this could be leading to bugs. I am not exactly sure we even need to do all of this indirection. PyNIO allows for the creation of a "selection object" that can be passed to the pynio array objects. Perhaps we should be creating that object instead of forcing the pynio array object into a numby index adapter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.
I am not familiar with pynio, but it looks their selection object is very specific to pynio...
I want to leave this improvement to another PR and focus on the common feature in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how selection objects would really help us here. I guess they can used to do a form of outer indexing, but putting integer indices into a string seems very inefficient. I agree that we should save it for later.

Are we actually sure that PyNIO even supports "outer" indexing? I had it marked down as only supporting basic indexing before. "Outer" indexing would include indexing with 1D arrays like array[[0, 1, 2], :], but I don't see that support documented anywhere. (Maybe it does work, but just isn't documented.)


with self.datastore.ensure_open(autoclose=True):
array = self.get_array()
if key == () and self.ndim == 0:
return array.get_value()
return array[key]

for ind in np_inds:
array = indexing.NumpyIndexingAdapter(array)[ind]

return array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah-hah! I see the bug! key never gets used!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this should probably be something like:

array = array[key.tuple]
for ind in np_inds:
    array = indexing.NumpyIndexingAdapter(array)[ind]

return array

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the debugging this!!
We certainly need test coverage for this.



class NioDataStore(AbstractDataStore, DataStorePickleMixin):
Expand Down
6 changes: 4 additions & 2 deletions xarray/backends/rasterio_.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,10 @@ def shape(self):
return self._shape

def __getitem__(self, key):
key = indexing.unwrap_explicit_indexer(
key, self, allow=(indexing.BasicIndexer, indexing.OuterIndexer))
# TODO support indexing.decompose_indexer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need someone's help to update indexing scheme for rasterio backend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rasterio is strictly a slicing approach for fetching data, or pulling out individual data points. In this case, you want to avoid converting the slices into numpy arrays of indexes. Maybe it would make sense to have a function that could re-compose numpy arrays of indexes into groups of slices? So, if we were indexing [1, 3, 4, 5, 2, 6], we would get back a slice(1, 7)? Maybe even make it smart enough to take the chunk hints so that indexes like [1, 4, 5] is kept as slice(1, 6), rather than (slice(1, 2), slice(4, 6))?

Copy link
Member Author

@fujiisoup fujiisoup Feb 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeatherGod Thanks for the review.

Maybe it would make sense to have a function that could re-compose numpy arrays of indexes into groups of slices?

Agreed, it would be an improvement. But I am wondering the combination will explode if the dimension increases. Some heuristics would be also necessary how many chunks we used.
I do not have yet a clear solution. I am open to any idea for the improvement.

EDIT: the chunking would also improve the efficiency if we want to get a full diagonal from a large 2d-array. In my current implementation, we need to load a full slice, but it could be reduced by the factor 2 if we use 2x2 chunks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we certainly have time to think about it. This feature doesn't seem to change anything for indexing that worked before, so I don't think we have to worry about regressions. Improvements to indexing can be thought of holistically as a separate task.

For our immediate needs though, for rasterio, we probably should just create a failback indexer that would just load up that entire dimension and index it as given if the original indexer is anything but scalar integers or slices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmaussion can you help me for this?
What I want to do here is the 2-staged indexing.

  1. index backend array by the 1st indexer.
  2. index the loaded np.ndarray by the 2nd vectorized indexer.

These two indexers (for backend and np.ndarray) can be computed by

key, np_inds = indexing.decompose_indexer(
    key, self.shape, indexing.IndexingSupport.OUTER)

I am not familiar with rasterio data structure (I do not yet understand what is going on below...),
I would appreciate if you could instruct me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rasterio (as wrapped by xarray) supports a strange form of indexing:

  • Outer indexing for the first indexer element only.
  • Basic indexing for remaining two indexer elements (every rasterio array is 3-dimensional).

I think the simplest fix would be to extend our rasterio wrapper to support arbitrary outer indexing. This would entail changing the final clause below, where rasterio currently raises an error:

else:
k = np.asarray(k)
start = k[0]
stop = k[-1] + 1
ids = np.arange(start, stop)
if not ((k.shape == ids.shape) and np.all(k == ids)):
raise IndexError(_ERROR_MSG)

Instead, we would want to create a slice object based on the minimum/maximum elements, and then add another indexing to np_inds to pull out the listed elements (after subtracting the start from the slice).

if isinstance(key, indexing.VectorizedIndexer):
raise NotImplementedError
key = key.tuple

# bands cannot be windowed but they can be listed
band_key = key[0]
Expand Down
Loading