-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sublayers
#707
Comments
I think this would be useful. The same way it is suggested there should be a I would suggest that in For my workflow in particular having obs sublayers would make it handy to access subsets of cells. Frequently what I end up doing is performing a query to select groups of interest and then storing the results as True/False in another obs column. Having obs sublayers would be another way to do this without cluttering the obs dataframe itself. |
I also find this proposal useful for two reasons:
|
Just adding to this, sublayers could also be lazily generated when they are accessed, or cached With the lazy generation, this could be seen as a way to better store relevant subsets of data for users when e.g., plotting as @Munfred said Sublayers could also be spun out into new anndatas:
|
Tl;dr -- add
adata.sublayers
that acts just like layers, but eachsublayer["my_sublayer"]
uses a subset of the var names ofadata.var_names
.Intermediate cached matrices
In a lot of workflows, there are intermediate matrices that are used for one-time applications. As an example:
scanpy.pp.scale(adata)
standard scales the var dimension, but this transformed data is only used for PCAscale
creates a heavy dense matrix, but PCA almost always uses a subset of vars as input, there's no need to keep all dimensions ofscale
outputMultimodal
We also have a common use case where we measure two aspects of our data (i.e., protein and RNA measurements). These measurements are made for every cell. Currently:
.obsm
out of convenience.obsm
is not supported for plotting byscanpy
With sublayers, users can explicitly choose which adata var dimensions should be used for certain tasks like PCA. This effectively caches a subset of the data (along var dimension).
API example (with imagined scanpy api changes):
This could effectively also provide a way out of
adata.raw
. In the future,read_h5ad
could detect ifmax(len(var_names), len(raw.var_names))
and then haveadata.raw
be an alias to the correct (sub)layer (most likely a layer as raw is often longer in scanpy workflows).For scvi-tools, this would be great, as we only want the count data on a subset of
var_names
, but don't want users creating a whole new anndata just to work with us.Just use MuData?
on the other hand, we could defer to mudata, though there are probably ways to make that API smoother when we know we have these constraints. Related: scverse/mudata#13
The text was updated successfully, but these errors were encountered: