Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot append Pandas dataframe to existing array #592

Open
Hoeze opened this issue Jun 11, 2021 · 12 comments
Open

Cannot append Pandas dataframe to existing array #592

Hoeze opened this issue Jun 11, 2021 · 12 comments

Comments

@Hoeze
Copy link

Hoeze commented Jun 11, 2021

Hi, I'm trying to write an array like this:

# +
import json

import tiledb
import numpy as np
import pandas as pd
import random
# -

test_df = pd.DataFrame.from_records(json.loads('{"chrom":{"0":"chr1","1":"chr1","2":"chr1","3":"chr1","4":"chr1","5":"chr1","8":"chr1","9":"chr1"},"log10_len":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":0,"9":0},"start":{"0":10108,"1":10108,"2":10108,"3":10108,"4":10108,"5":10108,"8":10143,"9":10143},"end":{"0":10114,"1":10114,"2":10114,"3":10114,"4":10114,"5":10114,"8":10144,"9":10144},"ref":{"0":"AACCCT","1":"AACCCT","2":"AACCCT","3":"AACCCT","4":"AACCCT","5":"AACCCT","8":"T","9":"T"},"alt":{"0":"A","1":"A","2":"A","3":"A","4":"A","5":"A","8":"C","9":"C"},"sample_id":{"0":"A","1":"B","2":"C","3":"D","4":"E","5":"F","8":"A","9":"B"},"GT":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":1,"9":1},"GQ":{"0":79,"1":39,"2":60,"3":99,"4":26,"5":62,"8":22,"9":65},"DP":{"0":12,"1":9,"2":39,"3":26,"4":9,"5":9,"8":35,"9":34}}'))
test_df

output_path="test.tdb"

ctx = tiledb.default_ctx()
ctx

# +
genotype_domain = tiledb.Domain(
    tiledb.Dim(name="chrom", domain=(None,None), tile=1, dtype=np.bytes_, ctx=ctx),
    tiledb.Dim(name="log10_len", domain=(0, np.iinfo(np.int8).max), tile=1, dtype=np.int8, ctx=ctx),
    tiledb.Dim(name="start", domain=(0, np.iinfo(np.int32).max), tile=100000, dtype=np.int32, ctx=ctx),
    tiledb.Dim(name="alt", domain=(None,None), tile=None, dtype=np.bytes_, ctx=ctx),
#     tiledb.Dim(name="end", domain=(1, np.iinfo(np.int32).max), dtype=np.int32, ctx=ctx),
    tiledb.Dim(name="sample_id", domain=(None,None), tile=None, dtype=np.bytes_, ctx=ctx),
    ctx=ctx,
)

string_filters = tiledb.FilterList([tiledb.ZstdFilter(level=-1),])
int_filters = tiledb.FilterList([tiledb.ByteShuffleFilter(), tiledb.ZstdFilter(level=-1),])
attrs = [
    tiledb.Attr(name='end', dtype='int32', var=False, nullable=False, filters=int_filters),
    tiledb.Attr(name='ref', dtype='S', nullable=False, filters=string_filters),
    tiledb.Attr(name='GT', dtype='int8', var=False, nullable=False, filters=int_filters),
    tiledb.Attr(name='GQ', dtype='int32', var=False, nullable=True, filters=int_filters),
    tiledb.Attr(name='DP', dtype='int32', var=False, nullable=True, filters=int_filters),
]
# -

schema = tiledb.ArraySchema(
    domain=genotype_domain,
    attrs=attrs,
    sparse=True,
    cell_order="hilbert",
#     capacity=10000,
    ctx=ctx,
)
schema

if not tiledb.array_exists(output_path):
    print("Creating array at '%s'..." % output_path)
    tiledb.array.SparseArray.create(output_path, schema, ctx=ctx)

tiledb.from_dataframe(output_path, test, sparse=True, mode="append")

However, the last line causes the following error:

---------------------------------------------------------------------------
TileDBError                               Traceback (most recent call last)
<ipython-input-84-d6af3de39a7d> in <module>
----> 1 tiledb.from_dataframe(output_path, test, sparse=True, mode="append")

/opt/anaconda/envs/tiledb/lib/python3.8/site-packages/tiledb/dataframe_.py in from_dataframe(uri, dataframe, **kwargs)
    485     )
    486 
--> 487     from_pandas(uri, dataframe, **kwargs)
    488 
    489 

/opt/anaconda/envs/tiledb/lib/python3.8/site-packages/tiledb/dataframe_.py in from_pandas(uri, dataframe, **kwargs)
    575                 dataframe, column_infos, tiledb_args.get("fillna")
    576             )
--> 577             _write_array(
    578                 uri,
    579                 dataframe,

/opt/anaconda/envs/tiledb/lib/python3.8/site-packages/tiledb/dataframe_.py in _write_array(uri, df, write_dict, nullmaps, create_array, row_start_idx, timestamp)
    649                     coords.append(df.index.get_level_values(k))
    650             # TODO ensure correct col/dim ordering
--> 651             libtiledb._setitem_impl_sparse(A, tuple(coords), write_dict, nullmaps)
    652 
    653         else:

tiledb/libtiledb.pyx in tiledb.libtiledb._setitem_impl_sparse()

tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()

tiledb/libtiledb.pyx in tiledb.libtiledb._raise_ctx_err()

tiledb/libtiledb.pyx in tiledb.libtiledb._raise_tiledb_error()

TileDBError: [TileDB::Writer] Error: Cannot set buffer; Input attribute/dimension 'GQ' is nullable

Is there some mistake in my code?

PS: I had to set sparse=True in from_dataframe to be able to write, although the schema is already present.

@ihnorton
Copy link
Member

Hi Florian, currently, nullable attributes require using Pandas nullable types for input columns. The following diff works for me:

git diff py592.py.orig py592.py                                                                                            ✘ 1
diff --git a/py592.py.orig b/py592.py
index 42c86b4..dacabb5 100644
--- a/py592.py.orig
+++ b/py592.py
@@ -10,6 +10,8 @@ import random
 test_df = pd.DataFrame.from_records(json.loads('{"chrom":{"0":"chr1","1":"chr1","2":"chr1","3":"chr1","4":"chr1","5":"chr1","8":"chr1","9":"chr1"},"log10_len":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":0,"9":0},"start":{"0":10108,"1":10108,"2":10108,"3":10108,"4":10108,"5":10108,"8":10143,"9":10143},"end":{"0":10114,"1":10114,"2":10114,"3":10114,"4":10114,"5":10114,"8":10144,"9":10144},"ref":{"0":"AACCCT","1":"AACCCT","2":"AACCCT","3":"AACCCT","4":"AACCCT","5":"AACCCT","8":"T","9":"T"},"alt":{"0":"A","1":"A","2":"A","3":"A","4":"A","5":"A","8":"C","9":"C"},"sample_id":{"0":"A","1":"B","2":"C","3":"D","4":"E","5":"F","8":"A","9":"B"},"GT":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":1,"9":1},"GQ":{"0":79,"1":39,"2":60,"3":99,"4":26,"5":62,"8":22,"9":65},"DP":{"0":12,"1":9,"2":39,"3":26,"4":9,"5":9,"8":35,"9":34}}'))
 test_df

+test_df = test_df.astype({'GQ': pd.Int64Dtype(), 'DP': pd.Int64Dtype()})
+
 output_path="test.tdb"

 ctx = tiledb.default_ctx()
@@ -51,4 +53,4 @@ if not tiledb.array_exists(output_path):
     print("Creating array at '%s'..." % output_path)
     tiledb.array.SparseArray.create(output_path, schema, ctx=ctx)

-tiledb.from_dataframe(output_path, test, sparse=True, mode="append")
+tiledb.from_pandas(output_path, test_df, sparse=True, mode="append")

(I've looked at numpy masked arrays a bit as well, but they don't seem to be widely-used, so we probably won't support unless there's a strong use-case)

I had to set sparse=True in from_dataframe to be able to write, although the schema is already present.

Thanks for pointing this out, will fix.

@ihnorton
Copy link
Member

I had to set sparse=True in from_dataframe to be able to write, although the schema is already present.

Will be fixed by #593.

@Hoeze
Copy link
Author

Hoeze commented Jun 11, 2021

Ah, thanks for the hint!
Would it be possible to have automatic dtype conversion?
Some df.astype(dtypes_from_tiledb_schema) included in from_dataframe would give a huge comfort gain here 😄
Otherwise, a hint in the error message would be useful:
Input attribute/dimension 'GQ' is nullable but 'np.int32' is not!

(I've looked at numpy masked arrays a bit as well, but they don't seem to be widely-used, so we probably won't support unless there's a strong use-case)

Yes, they will anyway be refactored after NEP 47 is done.
(For now, they're still useful in case you do not want to carry a separate mask array with you, because pandas does not provide multidimensional arrays.)

@ihnorton
Copy link
Member

ihnorton commented Jun 11, 2021

Yes, I will improve the error message at very least.

Would it be possible to have automatic dtype conversion?
Some df.astype(dtypes_from_tiledb_schema) included in from_dataframe would give a huge comfort gain here

These typically induce a copy, which can be expensive (memory wise) for large input dataframes. I'm curious what is the use-case for storing a plain int64 array in a nullable attribute?

Yes, they will anyway be refactored after NEP 47 is done.

Thanks - I don't see any discussion of nullability/mask functionality in that document, so hopefully it is not too much of an afterthought (haven't read all the links though).

@ihnorton
Copy link
Member

plain int64 array

(in other words, no way to represent nullability)

@Hoeze
Copy link
Author

Hoeze commented Jun 11, 2021

These typically induce a copy, which can be expensive (memory wise) for large input dataframes. I'm curious what is the use-case for storing a plain int64 array in a nullable attribute?

In this very concrete example, we might have variant data without corresponding genotype quality assigned.
This means, I have to somehow represent a missing value here:

  • store float NA instead of int
  • have a special integer that I interpret as "missing" in my code, e.g. (-1)
  • Add a separate "GT_missing" column

All of those solutions are not really nice.
For example, I'm always going for the third solution, but it requires special handling everywhere in my code.

Another example from xarray:
pydata/xarray#1194

Bottom line, I very rarely end up having a dataset with no missing values at all.
People just keep to implicitly store those as "None", "float.NA" or simiar because nullable types are not yet well supported in python, in contrast to e.g. Spark.

That's also why people with "Null" as last name might have a bad time 😁
https://www.wired.com/2015/11/null/

@ihnorton
Copy link
Member

Thanks for the explanation, I agree it's a tricky issue. What I'm still unclear about is why to set the TileDB attribute as nullable if the input is always going to be np.int64 (because that's the only representation available, for the reasons you listed). If you are always casting to/from np.int64 then the nullability of this attribute/column is a no-op because the null/validity status will just be discarded.

@Hoeze
Copy link
Author

Hoeze commented Jun 11, 2021

What I'm still unclear about is why to set the TileDB attribute as nullable if the input is always going to be np.int64 (because that's the only representation available, for the reasons you listed).

Hm, I'm not sure if I got you right:
If I would have pd.Series([1, 2, 3, 4, None], dtype="Int32", name="GQ") I would be able to write this to the array, right?

Or is your point on how to store plain numpy arrays in conjunction with a boolean mask?

If you are always casting to/from np.int64 then the nullability of this attribute/column is a no-op because the null/validity status will just be discarded.

Exactly, when the dtype anyway matches, it's not doing anything.
That's why an implicit call to .astype() inside tiledb would make sense IMO.

@Hoeze
Copy link
Author

Hoeze commented Jun 11, 2021

In general, I believe you should aim for first-class TileDB support with Apache Arrow.
If you have comprehensive interop with it, people can figure out on how to represent their data with Arrow types themselves.
=> no fiddling with nullable Numpy types.
Also, you get multi-language support for free.

For example, a very big advantage of parquet is that you can read/write the same dataframe in literally every language that supports Apache Arrow. Now imagine replacing Parquet with TileDB 😁

@ihnorton
Copy link
Member

What I'm specifically trying to understand is why you want to create Attr(name="GQ", nullable=True) when the input is (only?) np.int64. How do you expect us to read it back? As an np.int64 array - then why set nullable=True? Currently we read back nullable attributes as Pandas nullable types, because otherwise it would be a (potentially) lossy read, dropping the semantic nulls (if any).


Re Arrow, yes, agreed. We have to/from support for Arrow buffers in TileDB core, which we use internally by default in TileDB-Py for operations creating a Pandas dataframe. (Array.open_dataframe / Array.df[]). We are working on exposing this in general (to_arrow/ from_arrow in Python). There are some types we can't directly represent right now (eg lists and structs), although there are ways to somewhat work around that pretty efficiently, once the lower-level buffers are exposed; and we will continually expose more features as they are added in TileDB core.

@Hoeze
Copy link
Author

Hoeze commented Jun 11, 2021

What I'm specifically trying to understand is why you want to create Attr(name="GQ", nullable=True) when the input is (only?) np.int64. How do you expect us to read it back? As an np.int64 array - then why set nullable=True? Currently we read back nullable attributes as Pandas nullable types, because otherwise it would be a (potentially) lossy read, dropping the semantic nulls (if any).

As I mentioned previously, "GQ" can be missing depending on my data source.
JSON is schema-free, so in my code example it happens that Pandas automatically infers int64 as data type for GQ.
E.g. if one "GQ" would be missing, then the dtype of "GQ" would be inferred as float:

import json
test_df = pd.DataFrame.from_records(json.loads('{"chrom":{"0":"chr1","1":"chr1","2":"chr1","3":"chr1","4":"chr1","5":"chr1","8":"chr1","9":"chr1"},"log10_len":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":0,"9":0},"start":{"0":10108,"1":10108,"2":10108,"3":10108,"4":10108,"5":10108,"8":10143,"9":10143},"end":{"0":10114,"1":10114,"2":10114,"3":10114,"4":10114,"5":10114,"8":10144,"9":10144},"ref":{"0":"AACCCT","1":"AACCCT","2":"AACCCT","3":"AACCCT","4":"AACCCT","5":"AACCCT","8":"T","9":"T"},"alt":{"0":"A","1":"A","2":"A","3":"A","4":"A","5":"A","8":"C","9":"C"},"sample_id":{"0":"A","1":"B","2":"C","3":"D","4":"E","5":"F","8":"A","9":"B"},"GT":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":1,"9":1},"GQ":{"0":79,"2":60,"3":99,"4":26,"5":62,"8":22,"9":65},"DP":{"0":12,"1":9,"2":39,"3":26,"4":9,"5":9,"8":35,"9":34}}'))
test_df

image

However, I defined in the TileDB schema that the column should be int32(nullable=True).
That's why I'd expect that I can write anything to the array that can be casted to pd.Series(dtype="Int32").
When I read it back, I expect to also obtain some pd.Series(dtype="Int32") in case of a dataframe.

With numpy, it's more tricky. There you need some special handling, e.g. returning a Tuple[array[int32], array[bool]].
Otherwise, I could also work with pyarrow.array() as a return type :)

@ihnorton
Copy link
Member

Got it! Apologies for belaboring the point, and I appreciate the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants