Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation about multi_index and query #347

Open
michael-imbeault opened this issue Jun 23, 2020 · 2 comments
Open

Documentation about multi_index and query #347

michael-imbeault opened this issue Jun 23, 2020 · 2 comments
Assignees
Milestone

Comments

@michael-imbeault
Copy link

I can't find mentions of multi_index nor for the query() method in the official docs - been using multi_index but it is outputting a lot more information that I need (about positions in the array, then the values themselves). Is there a parameter to output just a list of results containing only values following the order of the slices? And what is the purpose of .query, is there any more to it than just another way to read results instead of using A[:] ?

Is there any efficiency benefit at using multi_index vs looping through intervals using query or the normal [] read mode?

@ihnorton
Copy link
Member

Hi @michael-imbeault, I will be taking a pass through the API docs this week to add some missing items, as well as fix a rendering issue preventing some docstrings from displaying. We also have documentation of multi_index specifically at https://docs.tiledb.com/main/api-usage/reading-arrays/multi-range-subarrays

Here is a summary for multi_index and query:

multi_index:

  • supports multiple sub-range queries per dimension and returns the cross-product of the specified ranges. Here is an example from the doc link above:
# slice subarrays [1,2]x[1,4] and [4,4]x[1,4]
A.multi_index[ [slice(1,2), 4], 1:4 ]
  • to expand on this: multi_index accepts a range (start:end), slice(start,end), or a list of slice objects or scalar index. For example:
A.multi_index[ [slice(1,2), 4], [slice(3,4), slice(5,6), 8] ]`
  • multi_index operates over the full, inclusive domain of the array
  • ...results are endpoint inclusive, like TileDB core -- and unlike standard python slicing (TileDB arrays may be defined with dimensions that have arbitrary float or int start/end-points, and multi_index allows to query such intervals)
  • ... also meaning that there is no wrap around for negative indexes to access the "last" element in the array
  • multi_index returns result coordinates for all dimensions, as separate named arrays (corresponding to the Dimension name)

.query:

  • the main purpose is to allow sub-selection on attributes, by passing a list of attributes and only querying those attributes. For example, this query will only return values for a and b, excluding any other attributes
A.query(attrs=['a','b']).multi_index[...]

Is there any efficiency benefit at using multi_index vs looping through intervals using query or the normal [] read mode?

For large multi-ranged queries, there can be a significant benefit to using multi_index, because TileDB is designed to efficiently fulfill such a query even for a very large number of ranges (parallelizing operations across multiple threads; storing range bounding boxes for tiles to optimize retrieval; selectively decompressing tiles; and other optimizations).

There can be an efficiency benefit to using .query if you know that some attribute results will not be needed, because core TileDB will not retrieve data for those attributes at all, reducing i/o and memory usage, etc.

@michael-imbeault
Copy link
Author

Ok that's helpful - I did find https://docs.tiledb.com/main/api-usage/reading-arrays/multi-range-subarrays but its a little barebones at the moment - no mention of either multi_index nor query in https://tiledb-inc-tiledb-py.readthedocs-hosted.com/en/stable/python-api.html.

I'll be using multi_index - my initial expectation was that it would return a list of numpy arrays corresponding to the slices, not a dict with a single array encompassing all the slices I have to parse using the coordinate arrays. Is there plans to include a simple, already parsed output? The current way make sense for sparse arrays but seems suboptimal for dense arrays - creating those (potentially very large) coord arrays and keeping them in memory seems wasteful for some use cases.

@ihnorton ihnorton self-assigned this Jun 25, 2020
ihnorton added a commit that referenced this issue Jun 26, 2020
- Don't return coords for dense multi_index by default (#347)
  Fix and test coords exclusion for sparse array queries
ihnorton added a commit that referenced this issue Jun 26, 2020
- Don't return coords for dense multi_index by default (#347)
- Fix and test coords exclusion for sparse array queries
ihnorton added a commit that referenced this issue Jun 26, 2020
- Don't return coords for dense multi_index by default (#347)
- Fix and test coords exclusion for sparse array queries
antalakas pushed a commit that referenced this issue Jul 6, 2020
- Don't return coords for dense multi_index by default (#347)
- Fix and test coords exclusion for sparse array queries
@ihnorton ihnorton added this to the 0.6.6 milestone Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants