Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing sparse arrays with variable length attributes bug #494

Open
lunaroverlord opened this issue Mar 7, 2021 · 8 comments
Open

Writing sparse arrays with variable length attributes bug #494

lunaroverlord opened this issue Mar 7, 2021 · 8 comments
Assignees

Comments

@lunaroverlord
Copy link

lunaroverlord commented Mar 7, 2021

Consider this:

array_name = "test"
ctx = tiledb.Ctx()
dom = tiledb.Domain(
    tiledb.Dim(name="id", domain=(0, 10), dtype=np.int64),
    ctx=ctx
)
attr = tiledb.Attr(name="val", var=True, dtype=np.int64, ctx=ctx)
schema = tiledb.ArraySchema(domain=dom, sparse=True, attrs=[attr], ctx=ctx)
tiledb.SparseArray.create(array_name, schema)

vals = np.array([
    np.array([1, 2, 9], dtype=np.int64), 
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')

with tiledb.open(array_name, "w") as array:
    array[[1, 2]] = dict(val=vals)

>>> ValueError: value length (6) does not match coordinate length (2)

Only happens when the attribute dimensions in vals form a block shape. There's no issue with either of the following:

vals = np.array([
    np.array([1, 2], dtype=np.int64), 
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')


vals = np.array([
    np.array([1, 2, 9, 3], dtype=np.int64), 
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')

I think it's because numpy coalesces object types containing homogeneous subarrays.

vals_hetero = np.array([
    np.array([1, 2], dtype=np.int64),
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')

vals_homo = np.array([
    np.array([1, 2, 9], dtype=np.int64),
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')

print(vals_hetero)
>>> [array([1, 2]) array([3, 4, 5])]

print(vals_homo)
>>> [[1 2 9]
     [3 4 5]]

print(vals_hetero.size, vals_homo.size) 
>>> 2 6

The exception is raised because TileDB relies on attr_val.size checks in libtiledb.pyx#L5241.

Is there a workaround or an alternative way of constructing the object?

@nguyenv nguyenv self-assigned this Mar 7, 2021
@nguyenv
Copy link
Collaborator

nguyenv commented Mar 10, 2021

Hi @lunaroverlord,

Apologies for the delayed reply. A workaround for now to prevent the NumPy array from automatically coalescing into a multi-dimensional array is by appending None (or an empty array or non-homogenous array) at the end:

vals = np.array(
   [np.array([1, 2, 9], dtype=np.int64), np.array([3, 4, 5], dtype=np.int64), None],
   dtype="O",
)

Then slice the last element out when writing to the TileDB array:

with tiledb.open(array_name, "w") as array:
    array[[1, 2]] = dict(val=vals[:-1])

We are going to see if we can add better support for this in the future so that we don't have to use this workaround.

Please let us know if you have any questions or comments.

@nguyenv nguyenv closed this as completed Mar 12, 2021
@KalyanPalepu
Copy link

KalyanPalepu commented Jan 19, 2024

Encountering this bug now in 2024. Do you have a sense about whether this will be fixed soon?

@nguyenv
Copy link
Collaborator

nguyenv commented Jan 19, 2024

This has not been high priority to look at as there's a workaround as commented above. However, we can bump the priority on this given that a few users have run into the problem now.

@llDev-Rootll
Copy link

Friendly bump +1. Also experiencing this bug @nguyenv

@ihnorton ihnorton reopened this Jan 28, 2024
@ihnorton
Copy link
Member

Re-opening, although I can't give a timeline to provide an alternative solution. AFAICT there's no way to handle this through numpy (b/c of the "coalescing") so we'll probably need to provide some other input mechanism.

@llDev-Rootll
Copy link

Trying to write some multi-attribute data to tileDB for tensorflow model training. The model input/output contains a combination of variable size sequential data and fixed size image data. Currently the only way that works is to store every modality in a separate tileDB array b/c of the coalescing issue, which makes creating a TensorflowTileDBDataset slow, do you have any other suggestions?

I cannot enforce my data to be of type object, as it is not under my control

@ihnorton
Copy link
Member

ihnorton commented Feb 2, 2024

Are you able to use this workaround?

@llDev-Rootll
Copy link

@ihnorton I am not, as I do not have control over the dataset generation process, and my dataset is very large to pre-process as I have image data as well which is also homogenous

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants