Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sublayers #707

Open
adamgayoso opened this issue Feb 11, 2022 · 3 comments
Open

Add sublayers #707

adamgayoso opened this issue Feb 11, 2022 · 3 comments

Comments

@adamgayoso
Copy link
Member

adamgayoso commented Feb 11, 2022

Tl;dr -- add adata.sublayers that acts just like layers, but each sublayer["my_sublayer"] uses a subset of the var names of adata.var_names.

Intermediate cached matrices

In a lot of workflows, there are intermediate matrices that are used for one-time applications. As an example:

  • scanpy.pp.scale(adata) standard scales the var dimension, but this transformed data is only used for PCA
  • scale creates a heavy dense matrix, but PCA almost always uses a subset of vars as input, there's no need to keep all dimensions of scale output

Multimodal

We also have a common use case where we measure two aspects of our data (i.e., protein and RNA measurements). These measurements are made for every cell. Currently:

  • Paired secondary modalities are put in .obsm out of convenience
  • .obsm is not supported for plotting by scanpy

With sublayers, users can explicitly choose which adata var dimensions should be used for certain tasks like PCA. This effectively caches a subset of the data (along var dimension).

API example (with imagined scanpy api changes):

# hvg_genes is a list/mask/etc representing a subset of `adata.var_names`
adata.add_sublayer(name="hvg_scaled", layer="unfiltered_unscaled", var_names=hvg_genes)
adata.sublayers["hvg_scaled"] # the actual data as dense/sparse/pandas etc
sc.pp.scale(adata, layer="hvg_scaled") # no name clashes are allowed between layers and sublayers
sc.tl.pca(adata, layer="hvg_scaled")
adata.sublayers.var_names["hvg_scaled"] # pd.Index for sublayer associated var names

This could effectively also provide a way out of adata.raw. In the future, read_h5ad could detect if max(len(var_names), len(raw.var_names)) and then have adata.raw be an alias to the correct (sub)layer (most likely a layer as raw is often longer in scanpy workflows).

For scvi-tools, this would be great, as we only want the count data on a subset of var_names, but don't want users creating a whole new anndata just to work with us.

Just use MuData?

on the other hand, we could defer to mudata, though there are probably ways to make that API smoother when we know we have these constraints. Related: scverse/mudata#13

@Munfred
Copy link

Munfred commented Feb 11, 2022

I think this would be useful. The same way it is suggested there should be a var sublayer, there could be an obs sublayer. My feeling is that obs and var should be symmetrical.

I would suggest that in adata.add_sublayer you have to specify obs or var when creaitng a sublayer (var default), to or have two distinct kinds of sublayers (eg adata.sobs and adata.svar for obs sublayers and var sublayers).

For my workflow in particular having obs sublayers would make it handy to access subsets of cells. Frequently what I end up doing is performing a query to select groups of interest and then storing the results as True/False in another obs column. Having obs sublayers would be another way to do this without cluttering the obs dataframe itself.

@vitkl
Copy link

vitkl commented Feb 12, 2022

I also find this proposal useful for two reasons:

  1. Handling multimodal data in scvi-tools.

  2. I find that most collaborators don't store the raw integer counts for all genes (or any raw counts at all) because the current workflow requires all layers to have the same number of features and encourages storing normalised/scaled data in .raw. If I understand correctly, this proposal also solves that long-standing issue of keeping different feature subsets in the same object.

@adamgayoso
Copy link
Member Author

adamgayoso commented May 9, 2022

Just adding to this, sublayers could also be lazily generated when they are accessed, or cached .add_sublayer(..., cache=False).

With the lazy generation, this could be seen as a way to better store relevant subsets of data for users when e.g., plotting as @Munfred said

Sublayers could also be spun out into new anndatas:

bdata = adata.sublayer_to_adata(sublayer_key) # new anndata appropriately subset according to sublayer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants