Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-array views? #545

Open
Hoeze opened this issue Apr 17, 2021 · 3 comments
Open

Sub-array views? #545

Hoeze opened this issue Apr 17, 2021 · 3 comments

Comments

@Hoeze
Copy link

Hoeze commented Apr 17, 2021

Hi, is there some way to get a subarray view of a TileDB store?

My use case would be the following:

import tiledb as tdb
A = tdb.open("path")
print(A.domain)
# Domain(
#     Dim(name='chrom', domain=(None,None), tile=1, dtype=np.bytes_'),
#     Dim(name='start', domain=(0, 18446744073709551614), tile=100000, dtype='uint64'),
#     Dim(name='gene_start', domain=(0, 18446744073709551614), tile=10000000, dtype='uint64'),
# )
sub = A.method_that _returns_subarray(chrom="chr13", start=slice(0, 1000000), gene_start=slice(0, 1000000))

# now get unique dimension labels in the subarray:
chrom_idx = start_idx = sub.unique_dim_values("chrom")
start_idx = sub.unique_dim_values("start")
gene_idx = sub.unique_dim_values("gene_start")
@ihnorton
Copy link
Member

Hi @Hoeze, if I understand correctly, you want an object that can be indexed (eg multi_index) within only the specified range(s), or call other tiledb.Array methods like unique_dim_values? Would you expect that the whole array is read into memory when this object is created, or only when indexed? The situation is different here than NumPy views, because NumPy arrays are already in memory.

I'm trying to understand the goal/use-case here, in order to prioritize.

@Hoeze
Copy link
Author

Hoeze commented Apr 23, 2021

I would like to avoid loading the array into memory.
My main goal is to get existing coordinates inside a range and load the data in this range on-demand with dask.

@ihnorton
Copy link
Member

get existing coordinates inside a range and load the data in this range on-demand with dask.

If I am understanding correctly, you can do the first part (get only coords) with A.query(attrs=[]).multi_index[<your ranges>] which will only return the coordinates of data matching the index ranges (excludes all attribute data). Then you'll have to partition those coordinates, and do full reads for each partition on each node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants