Add `sublayers` #707

adamgayoso · 2022-02-11T17:28:59Z

Tl;dr -- add adata.sublayers that acts just like layers, but each sublayer["my_sublayer"] uses a subset of the var names of adata.var_names.

Intermediate cached matrices

In a lot of workflows, there are intermediate matrices that are used for one-time applications. As an example:

scanpy.pp.scale(adata) standard scales the var dimension, but this transformed data is only used for PCA
scale creates a heavy dense matrix, but PCA almost always uses a subset of vars as input, there's no need to keep all dimensions of scale output

Multimodal

We also have a common use case where we measure two aspects of our data (i.e., protein and RNA measurements). These measurements are made for every cell. Currently:

Paired secondary modalities are put in .obsm out of convenience
.obsm is not supported for plotting by scanpy

With sublayers, users can explicitly choose which adata var dimensions should be used for certain tasks like PCA. This effectively caches a subset of the data (along var dimension).

API example (with imagined scanpy api changes):

# hvg_genes is a list/mask/etc representing a subset of `adata.var_names`
adata.add_sublayer(name="hvg_scaled", layer="unfiltered_unscaled", var_names=hvg_genes)
adata.sublayers["hvg_scaled"] # the actual data as dense/sparse/pandas etc
sc.pp.scale(adata, layer="hvg_scaled") # no name clashes are allowed between layers and sublayers
sc.tl.pca(adata, layer="hvg_scaled")
adata.sublayers.var_names["hvg_scaled"] # pd.Index for sublayer associated var names

This could effectively also provide a way out of adata.raw. In the future, read_h5ad could detect if max(len(var_names), len(raw.var_names)) and then have adata.raw be an alias to the correct (sub)layer (most likely a layer as raw is often longer in scanpy workflows).

For scvi-tools, this would be great, as we only want the count data on a subset of var_names, but don't want users creating a whole new anndata just to work with us.

Just use MuData?

on the other hand, we could defer to mudata, though there are probably ways to make that API smoother when we know we have these constraints. Related: scverse/mudata#13

The text was updated successfully, but these errors were encountered:

Munfred · 2022-02-11T19:41:21Z

I think this would be useful. The same way it is suggested there should be a var sublayer, there could be an obs sublayer. My feeling is that obs and var should be symmetrical.

I would suggest that in adata.add_sublayer you have to specify obs or var when creaitng a sublayer (var default), to or have two distinct kinds of sublayers (eg adata.sobs and adata.svar for obs sublayers and var sublayers).

For my workflow in particular having obs sublayers would make it handy to access subsets of cells. Frequently what I end up doing is performing a query to select groups of interest and then storing the results as True/False in another obs column. Having obs sublayers would be another way to do this without cluttering the obs dataframe itself.

vitkl · 2022-02-12T17:26:55Z

I also find this proposal useful for two reasons:

Handling multimodal data in scvi-tools.
I find that most collaborators don't store the raw integer counts for all genes (or any raw counts at all) because the current workflow requires all layers to have the same number of features and encourages storing normalised/scaled data in .raw. If I understand correctly, this proposal also solves that long-standing issue of keeping different feature subsets in the same object.

adamgayoso · 2022-05-09T15:38:34Z

Just adding to this, sublayers could also be lazily generated when they are accessed, or cached .add_sublayer(..., cache=False).

With the lazy generation, this could be seen as a way to better store relevant subsets of data for users when e.g., plotting as @Munfred said

Sublayers could also be spun out into new anndatas:

bdata = adata.sublayer_to_adata(sublayer_key) # new anndata appropriately subset according to sublayer

github-actions bot added the stale label Jun 21, 2023

flying-sheep added enhancement and removed stale labels Jun 22, 2023

scverse deleted a comment from github-actions bot Jun 22, 2023

flying-sheep added the topic: api label Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `sublayers` #707

Add `sublayers` #707

adamgayoso commented Feb 11, 2022 •

edited

Loading

Munfred commented Feb 11, 2022

vitkl commented Feb 12, 2022

adamgayoso commented May 9, 2022 •

edited

Loading

Add sublayers #707

Add sublayers #707

Comments

adamgayoso commented Feb 11, 2022 • edited Loading

Intermediate cached matrices

Multimodal

API example (with imagined scanpy api changes):

Just use MuData?

Munfred commented Feb 11, 2022

vitkl commented Feb 12, 2022

adamgayoso commented May 9, 2022 • edited Loading

Add `sublayers` #707

Add `sublayers` #707

adamgayoso commented Feb 11, 2022 •

edited

Loading

adamgayoso commented May 9, 2022 •

edited

Loading