Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python/r] Nullability tracking (parent) #2858

Open
johnkerl opened this issue Aug 8, 2024 · 7 comments
Open

[python/r] Nullability tracking (parent) #2858

johnkerl opened this issue Aug 8, 2024 · 7 comments
Assignees
Labels

Comments

@johnkerl
Copy link
Member

johnkerl commented Aug 8, 2024

Context

This is split out from #2822. #2822 had a couple questions: one was answered there conclusivel, and the other turns out to be multi-faceted. This issue tracks the second.

Also note nullability for all attribute/column types is well-supported in TileDB Core; bugs here are strictly at the TileDB-SOMA level.

Purpose

Characterize and isolate nullability-related issues within TileDB-SOMA.

Individual issues will be split out, prioritized, assigned, and scheduled.

Coverage matrix

What does "null" mean in source data:

Surfaces to check:

  • Are we creating TileDB core schemas with attribute-level nullable=True in all cases where we should
  • Are we doing the right thing with writing data values

Who writes, and with from what source formats:

Column types:

References

@mdylan2
Copy link

mdylan2 commented Aug 8, 2024

One question related to the case where the Arrow schema is provided. How would one specify nullable=True for an attribute?

It doesn't seem like the Python API implementation for PlatformConfig supports nullable: https://github.com/single-cell-data/TileDB-SOMA/tree/main/apis/python#platform_config-format.

Repro script:

import tiledbsoma as soma
import tiledb
import os
import shutil
import pyarrow as pa

obs_schema = pa.schema([("soma_joinid", pa.int64()), ("barcode", pa.large_string()) ])
platform_config = {
    "tiledb": {
        "create": {
            "attrs": {
                "barcode": {
                    "filters": [{"_type": "ZstdFilter", "level": 9}],
                    "nullable": True # also tried string "true"                }
            }
        }
    }
}
exp_path = "./test"
exp = soma.Experiment.create(exp_path)
exp.add_new_dataframe(
    "obs",
    schema=obs_schema,
    index_column_names=["soma_joinid"],
    platform_config=platform_config
)
exp.close()

with tiledb.open(os.path.join(exp_path, "obs")) as arr:
    print(arr.schema)

shutil.rmtree(exp_path)

prints

ArraySchema(
  domain=Domain(*[
    Dim(name='soma_joinid', domain=(0, 2147483646), tile=2048, dtype='int64', filters=FilterList([ZstdFilter(level=3), ])),
  ]),
  attrs=[
    Attr(name='barcode', dtype='<U0', var=True, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=9), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=False,
)

@johnkerl
Copy link
Member Author

johnkerl commented Aug 8, 2024

Hi @mdylan2 ! Great question! At the moment I'm typing up issues for the Python tiledbsoma.io cases. I don't have a write-up yet for the add_new_dataframe / "bring your own Arrow schema" case, but I do know there is something extra going on there. I've verified that (in a certain case) the core schema created does have the nullability flag set, and I believe there's just an intermediate miswiring. I'll be sure to directly address your case in an upcoming child issue umbrellaed under this parent task. And if we need to surface nullability at the PlatformConfig level, we'll have docs on that as well. And also we need clearer guidance to users on setting nullability at the Arrow-schema level.

@johnkerl
Copy link
Member Author

johnkerl commented Aug 8, 2024

A bit more info @mdylan2 : re https://gist.github.com/johnkerl/3a7473dc24974bcc47f7b8257a19bbdb

  • This is in Python
  • Although I haven't gisted it, I've verified the same in R using R's arrow package
  • It turns out in Python & R, fields are nullable by default -- you have to explicitly say nullable=False (and ☝️ you haven't)

So this is a bug of ours -- I'll isolate it -- thanks again for the repro script!

@johnkerl
Copy link
Member Author

johnkerl commented Aug 8, 2024

@mdylan2 found it -- this will be a quick fix -- more tomorrow!

@johnkerl
Copy link
Member Author

johnkerl commented Aug 9, 2024

@mdylan2 the issue is #2869 with PR #2868. This fix will go out with TileDB-SOMA 1.13.1 (if we do one) or else 1.14.0.

I've now established that the workaround for now is to set metadata like

pa.schema(
    [
        pa.field("x", pa.int32()),
    ],
    metadata={
        "x": "nullable",
    }
)

Please let me know if this resolves everything for you in your add_new_dataframe use-case.

@mdylan2
Copy link

mdylan2 commented Aug 13, 2024

That worked, thank you @johnkerl!

@johnkerl
Copy link
Member Author

johnkerl commented Aug 14, 2024

@mdylan2 -- update at #2857 (comment) -- regarding how to set up nullable booleans (same goes for ints/floats too I believe, & will test explicitly) -- at the point in time when you set up your obs/var Pandas dataframes for using tiledbsoma.io.. This is because the Pandas DataFrame objects we get as input data need to have been constructed following Pandas nullability conventions, and NumPy array-of-int has its own non-masked and masked versions.

For the path using tiledbsoma.Experiment.add_new_dataframe, where you bring your own Arrow schema and Arrow table, there of course the input data we get follows Arrow's nullability conventions.

Also, there's more work to do here -- which I'll track on this current issue -- even for things which are not bugs, involving:

  • More unit-test coverage for things which we are doing correctly
  • More document/notebook example material around nullability-handling for TileDB-SOMA's Python and R APIs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants