You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Objects created by R or Python bindings should have identical metadata, but currently the R and Python packages tag SOMA objects with different and incompatible metadata tags for dataset_type, soma_encoding_version and soma_object_type.
Python creates with Unicode strings (e.g., "dataset_type": "soma")
R package creates objects with byte arrays (e.g., "dataset_type": b"soma")
I also checked directly reading from S3, i.e., not using the tiledb:// URI, and the result is the same.
Where a "string" or "byte array" is right, I think it is reasonably clear that there is a bug here - the mandatory metadata tags should be identical no matter which ingestion system is used, and which package is used to read it back.
Side note: the current Python package seems to have a work-around for this, as it detects byte array metadata and converts it to utf-8. This is nice, but doesn't seem like the right answer, as it requires any other user of that metadata (e.g., end-user code) to do the same encoding/decoding step for any/all metadata values.
In my opinion, we should be using utf-8 everywhere (and document that in the SOMA spec), but at a minimum, we should have common behavior across all reader/writer code.
tiledbsoma.__version__ 1.11.4
TileDB-Py version 0.29.0
TileDB core version (tiledb) 2.23.0
TileDB core version (libtiledbsoma) 2.23.0
python version 3.11.9.final.0
OS version Linux 6.8.0-76060800daily20240311-generic
The text was updated successfully, but these errors were encountered:
johnkerl
changed the title
R and Python create groups with incompatible SOMA metadata
[r/python] R and Python create groups with incompatible SOMA metadata
Aug 15, 2024
Objects created by R or Python bindings should have identical metadata, but currently the R and Python packages tag SOMA objects with different and incompatible metadata tags for
dataset_type
,soma_encoding_version
andsoma_object_type
."dataset_type": "soma"
)"dataset_type": b"soma"
)Using TileDB-Py to inspect two arrays.
When array created by Python (array info):
When array created by R (array info):
I also checked directly reading from S3, i.e., not using the
tiledb://
URI, and the result is the same.Where a "string" or "byte array" is right, I think it is reasonably clear that there is a bug here - the mandatory metadata tags should be identical no matter which ingestion system is used, and which package is used to read it back.
Side note: the current Python package seems to have a work-around for this, as it detects byte array metadata and converts it to utf-8. This is nice, but doesn't seem like the right answer, as it requires any other user of that metadata (e.g., end-user code) to do the same encoding/decoding step for any/all metadata values.
In my opinion, we should be using utf-8 everywhere (and document that in the SOMA spec), but at a minimum, we should have common behavior across all reader/writer code.
The text was updated successfully, but these errors were encountered: