Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong type conversion for columns of DaskDataFrame in Xenium #25

Closed
LucaMarconato opened this issue Mar 15, 2023 · 1 comment · Fixed by #40
Closed

Wrong type conversion for columns of DaskDataFrame in Xenium #25

LucaMarconato opened this issue Mar 15, 2023 · 1 comment · Fixed by #40

Comments

@LucaMarconato
Copy link
Member

LucaMarconato commented Mar 15, 2023

Working on a xenium notebook I noticed this:

xe_rep1_sdata.points['transcripts']['feature_name'].cat.as_known().compute()

gives

0                              b'RUNX1'
1                              b'ITGAX'
2                               b'TCIM'
3                               b'TCIM'
4                                b'LUM'
                        ...            
126857385                      b'HMGA1'
126857386                      b'CLIC6'
126857387                       b'SQLE'
126857388                        b'DST'
126857389    b'NegControlCodeword_0529'
Name: feature_name, Length: 126857390, dtype: category
Categories (541, object): ['b'ABCC11'', 'b'ACTA2'', 'b'ACTG2'', 'b'ADAM9'', ..., 'b'WARS'', 'b'ZEB1'', 'b'ZEB2'', 'b'ZNF562'']

I haven't checked, but I suspect that the xenium() function in this repo converts the categories from bytes object to strings. We should call .decode('utf-8') or something similar instead.

The bug could also be in the io when writing/reading to/from parquet. We need to check.

@LucaMarconato
Copy link
Member Author

LucaMarconato commented May 20, 2023

The problem is in this conversion to str of a columns containing bytes representation (from the points parser):

if feature_key is not None:
    feature_categ = dd.from_pandas(
        data[feature_key].astype(str).astype("category"),
        **kwargs,
    )  # type: ignore[attr-defined]
    table[feature_key] = feature_categ

I will fix this by converting decoding the bytes in spatialdata-io (in _get_points() from xenium.py), so that the parser receives strings already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant