Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎨 Self-contained definition of CellxGene schema #2412

Merged
merged 6 commits into from
Feb 1, 2025
Merged

Conversation

falexwolf
Copy link
Member

@falexwolf falexwolf commented Feb 1, 2025

Is part of a sequence of PRs that refactors the curators:


This PR simplifies the definition of constraints to validate against the CxG schema:

Before After
It was implicitly defined by the state of another database instance passed via the using_key parameter to AnnDataCurator. It is now self-contained in the CellxGeneAnnDataCurator class.

It also brings down the run time of the CxG test further down from 50s to 20s:

============================= slowest 50 durations =============================
37.28s call     tests/curators/test_pert_curator.py::test_pert_curator
34.08s call     tests/curators/test_cat_curators.py::test_spatialdata_curator
19.95s call     tests/curators/test_cxg_curator.py::test_cxg_curator
6.86s call     tests/curators/test_cat_curators.py::test_df_curator
6.01s call     tests/curators/test_cat_curators.py::test_soma_curator
5.71s call     tests/curators/test_cat_curators.py::test_soma_curator_genes_columns

Collateral:

  • I'm removing the _check_valid_keys() utils function because throwing an error upon Curator init upon missing columns/features is problematic: we do want to be able to validate datasets with missing columns and then receive this as a SchemaError during .validate(). This is a common type of error.
  • Fixed the signatures for Feature, ULabel & Schema to now include type and is_type.

@falexwolf falexwolf changed the title 🎨 Eliminate the need to pass laminlabs/cellxene via using_key to Cell… 🎨 Eliminate the need to pass laminlabs/cellxene via using_key to CellxGene Curator Feb 1, 2025
@falexwolf falexwolf changed the title 🎨 Eliminate the need to pass laminlabs/cellxene via using_key to CellxGene Curator 🎨 Eliminate the need to pass laminlabs/cellxgene via using_key to CellxGene Curator Feb 1, 2025
)
len(nonval_keys)
# throwing an error here is problematic because my dataset might have missing columns
# and I still want to define sources mappings on the Curator level
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know who wrote this, so FYI @Zethson @sunnyosun -- I don't think this makes sense for the reason defined in the comment

We definitely want to be able to pass datasets with missing columns to the Curator constructor without raising an error

@falexwolf falexwolf changed the title 🎨 Eliminate the need to pass laminlabs/cellxgene via using_key to CellxGene Curator 🎨 Self-contained definition of CellxGene schema / validation constraints Feb 1, 2025
@falexwolf falexwolf merged commit d7069ae into main Feb 1, 2025
14 of 15 checks passed
@falexwolf falexwolf deleted the cxgcurator branch February 1, 2025 11:54
@falexwolf falexwolf changed the title 🎨 Self-contained definition of CellxGene schema / validation constraints 🎨 Self-contained definition of CellxGene schema Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant