Writing sparse arrays with variable length attributes bug #494

lunaroverlord · 2021-03-07T10:12:59Z

Consider this:

array_name = "test"
ctx = tiledb.Ctx()
dom = tiledb.Domain(
    tiledb.Dim(name="id", domain=(0, 10), dtype=np.int64),
    ctx=ctx
)
attr = tiledb.Attr(name="val", var=True, dtype=np.int64, ctx=ctx)
schema = tiledb.ArraySchema(domain=dom, sparse=True, attrs=[attr], ctx=ctx)
tiledb.SparseArray.create(array_name, schema)

vals = np.array([
    np.array([1, 2, 9], dtype=np.int64), 
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')

with tiledb.open(array_name, "w") as array:
    array[[1, 2]] = dict(val=vals)

>>> ValueError: value length (6) does not match coordinate length (2)

Only happens when the attribute dimensions in vals form a block shape. There's no issue with either of the following:

vals = np.array([
    np.array([1, 2], dtype=np.int64), 
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')


vals = np.array([
    np.array([1, 2, 9, 3], dtype=np.int64), 
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')

I think it's because numpy coalesces object types containing homogeneous subarrays.

vals_hetero = np.array([
    np.array([1, 2], dtype=np.int64),
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')

vals_homo = np.array([
    np.array([1, 2, 9], dtype=np.int64),
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')

print(vals_hetero)
>>> [array([1, 2]) array([3, 4, 5])]

print(vals_homo)
>>> [[1 2 9]
     [3 4 5]]

print(vals_hetero.size, vals_homo.size) 
>>> 2 6

The exception is raised because TileDB relies on attr_val.size checks in libtiledb.pyx#L5241.

Is there a workaround or an alternative way of constructing the object?

The text was updated successfully, but these errors were encountered:

nguyenv · 2021-03-10T14:40:18Z

Hi @lunaroverlord,

Apologies for the delayed reply. A workaround for now to prevent the NumPy array from automatically coalescing into a multi-dimensional array is by appending None (or an empty array or non-homogenous array) at the end:

vals = np.array(
   [np.array([1, 2, 9], dtype=np.int64), np.array([3, 4, 5], dtype=np.int64), None],
   dtype="O",
)

Then slice the last element out when writing to the TileDB array:

with tiledb.open(array_name, "w") as array:
    array[[1, 2]] = dict(val=vals[:-1])

We are going to see if we can add better support for this in the future so that we don't have to use this workaround.

Please let us know if you have any questions or comments.

KalyanPalepu · 2024-01-19T01:15:12Z

Encountering this bug now in 2024. Do you have a sense about whether this will be fixed soon?

nguyenv · 2024-01-19T19:32:53Z

This has not been high priority to look at as there's a workaround as commented above. However, we can bump the priority on this given that a few users have run into the problem now.

llDev-Rootll · 2024-01-27T08:42:45Z

Friendly bump +1. Also experiencing this bug @nguyenv

ihnorton · 2024-01-28T02:57:02Z

Re-opening, although I can't give a timeline to provide an alternative solution. AFAICT there's no way to handle this through numpy (b/c of the "coalescing") so we'll probably need to provide some other input mechanism.

llDev-Rootll · 2024-01-28T03:09:16Z

Trying to write some multi-attribute data to tileDB for tensorflow model training. The model input/output contains a combination of variable size sequential data and fixed size image data. Currently the only way that works is to store every modality in a separate tileDB array b/c of the coalescing issue, which makes creating a TensorflowTileDBDataset slow, do you have any other suggestions?

I cannot enforce my data to be of type object, as it is not under my control

ihnorton · 2024-02-02T02:34:25Z

Are you able to use this workaround?

llDev-Rootll · 2024-02-05T23:51:58Z

@ihnorton I am not, as I do not have control over the dataset generation process, and my dataset is very large to pre-process as I have image data as well which is also homogenous

nguyenv self-assigned this Mar 7, 2021

nguyenv closed this as completed Mar 12, 2021

ihnorton reopened this Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing sparse arrays with variable length attributes bug #494

Writing sparse arrays with variable length attributes bug #494

lunaroverlord commented Mar 7, 2021 •

edited

Loading

nguyenv commented Mar 10, 2021 •

edited

Loading

KalyanPalepu commented Jan 19, 2024 •

edited

Loading

nguyenv commented Jan 19, 2024

llDev-Rootll commented Jan 27, 2024

ihnorton commented Jan 28, 2024

llDev-Rootll commented Jan 28, 2024

ihnorton commented Feb 2, 2024

llDev-Rootll commented Feb 5, 2024

Writing sparse arrays with variable length attributes bug #494

Writing sparse arrays with variable length attributes bug #494

Comments

lunaroverlord commented Mar 7, 2021 • edited Loading

nguyenv commented Mar 10, 2021 • edited Loading

KalyanPalepu commented Jan 19, 2024 • edited Loading

nguyenv commented Jan 19, 2024

llDev-Rootll commented Jan 27, 2024

ihnorton commented Jan 28, 2024

llDev-Rootll commented Jan 28, 2024

ihnorton commented Feb 2, 2024

llDev-Rootll commented Feb 5, 2024

lunaroverlord commented Mar 7, 2021 •

edited

Loading

nguyenv commented Mar 10, 2021 •

edited

Loading

KalyanPalepu commented Jan 19, 2024 •

edited

Loading