Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement var- and obs- aligned multidimensional arrays (obsm, varm) #3

Closed
ambrosejcarr opened this issue Dec 18, 2021 · 8 comments
Closed

Comments

@ambrosejcarr
Copy link
Member

ambrosejcarr commented Dec 18, 2021

Additionally, explore whether there are opportunities to specify metadata standards for derived analysis results (e.g. reduced dimensionality representations)

From @LTLA

Suggest developing some metadata standards on the auxiliaries so that downstream tools can meaningfully use them. Otherwise e.g. if someone sees an array in there they need to ask "is this reducedDims? Or gene signature scores? Or FACS data?" etc. We have several visualization tools that poke around in the SCE's reducedDims to figure out what to plot.

From "open questions" in the gdoc version:

Auxiliary arrays need more thought as they are commonly indexed by one of "var" or "obs" labels. Do we need to have the API enforce this (probably depends on whether or not there are features dependent on it). Use case: if UMAP embedding is stored as an aux array, it wants to be indexed by same labels as "obs"

@ambrosejcarr
Copy link
Member Author

ambrosejcarr commented Mar 7, 2022

Problem

I think there is agreement thatobsm and varm are needed to store multidimensional annotations.

Multidimensional annotations may be created from an individual sc_group or via multiplex methods that leverage more than one group. Most examples (e.g. Figure B in the Muon publication) use two modalities, but poly-modal assays will arise, so I recommend we use examples of with three modalities to drive decision making.

Consider an experiment with three data modalities:

sc_dataset
- sc_group (RNA)
- sc_group (ATAC)
- sc_group (Protein)

A common analysis that stores results in obsm is the estimation of a dimension-reduced latent space. Methods can estimate latent spaces based on one or more modalities. Here, there are several options:

- RNA
- ATAC
- Protein
- RNA + ATAC
- RNA + Protein
- ATAC + Protein
- RNA + ATAC + Protein

I see two requirements:

  1. Each type of latent space above must be storable
  2. Users must be able to understand how the latent spaces should be used, and how it should not be used.

Possible Solutions

I see two, but am interested if clever individuals can do better.

1. Create obsm and varm slots for sc_dataset and sc_group.

obsm associated with sc_groups can be inferred to have been built exclusively from data within that group, but no guarantees can be made (i.e. the format will not validate that additional information has not been included). obsm associated with the `sc_dataset can be inferred to have been built from more than one group, but again, this will not be validated. Missing values are permitted.

Cons: There would be no way to infer whether an sc_dataset obsm was built using data from RNA + Protein or RNA + ATAC + Protein. A string "description" field could be associated to provide additional context to make up for this.

2. Create obsm and varm only at sc_dataset level

Like above, but eliminate need for inference, and lean into use of a string "description" field to describe how the obsm were generated.


I favor option 2. @falexwolf @aaronwolen @gtca I'm interested in your perspectives and other ideas you might have.

@vjcitn do you know who the right people from bioconductor would be to get feedback on this question?

@falexwolf
Copy link
Contributor

falexwolf commented Mar 8, 2022

Thank you for laying out these options, @ambrosejcarr!

I agree with all points you make. However, I overall arrive at favoring option 1 for the following reasons

  1. I think that understanding information encoded in single data modalities will - at least for the foreseeable future - remain of high value and I'd offer users intuitive slots for it. In cases they are not needed, sc_group-level obsm and varm slots can remain empty without any downside.
  2. I think the points re data provenance made above are fully legit, but I also think they need to be resolved using schema/syntax decisions around data provenance tracking and tooling for it. I thought this is beyond the scope of this repository.1
  3. Deserializing into AnnData/MuData will be much less convoluted if one keeps the sc_group-level slots. I expect similarly increased ease for deserializing into SingleCellExperiment/Seurat.
  4. I expect user intuition will expect sc_group-level slots. If they aren't present, usage patterns will arise that hack ways to store such information (see 1.). I think it's better to offer a simple structure that is potentially unused, rather than open up ways for hacks.

I'm looking forward to hearing more opinions!

Footnotes

  1. As I understood from our last meeting, in this repository, we're deciding on schema questions of data representation, and not on syntax questions regarding attributes of how data was generated. I think that by following the AnnData/MuData/SingleCellExperiment/Seurat layout of data, we're restricting ourselves to the most "basic properties of the data itself": data contingency in storage & a matrix layout following the semantics of the terms "observation" and "variable". I think the rationals for this are rooted in mere efficiency reasoning for data access (contingency) and the basic "learning-use-case"/ extracting information from data via stats, which happen to be the same. Stats & machine learning converged over decades on organizing data for learning as an observations-variables matrix, for which Wickham initiated comprehensive tooling via tidydata. We tried to summarize this process in the anndata preprint. I'd be happy to update my understanding if someone feels differently!

@ivirshup
Copy link

ivirshup commented Mar 8, 2022

Alternative solution

I think theres a third option here, which is to not have sc_dataset level obsm, varm at all. Instead you could just have more sc_groups. This solution would directly address the main con of option one.

I'm thinking of a structure like:

sc_dataset {Cells, (RNA, ATAC)}
    sc_group {Cells, RNA}
        X
            counts
            normalized
    sc_group {Cells, ATAC}
        X
            counts
            normalized
    sc_group {Cells, RNA + ATAC}
        obsm
            embedding

This could be expressed in flexible way by defining subsets of observations and variables at the sc_dataset level.

Something I like about this is you could also specifically express transcript to protein connections with:

sc_group {Cells, RNA + Prot}
    varp
        rna_to_protein

I would also agree that complete provenance seems out of scope for the storage schema. Maybe what we're trying to do is express "associated with", not "derived from"?

ambrosejcarr added a commit to falexwolf/matrix-api that referenced this issue Mar 9, 2022
@ambrosejcarr
Copy link
Member Author

ambrosejcarr commented Mar 9, 2022

Thanks very much for the thoughts, and @ivirshup for identifying the missing option. I was hoping there would be more ideas I missed.

As I understood from our last meeting, in this repository, we're deciding on schema questions of data representation, and not on syntax questions regarding attributes of how data was generated.

I would also agree that complete provenance seems out of scope for the storage schema. Maybe what we're trying to do is express "associated with", not "derived from"?

Thanks both for picking up on this and apologies for the lack of clarity in my initial write up. What I was trying to express is an anticipated analysis toolchain use case, wherein they would want to be able to explain provenance of objects to communicate how they should be used to their end-users.

@falexwolf I find your points about creating flexibility to enable a downstream syntax discussion compelling. Option 1 is indeed constraining. Extending that line of thought, @ivirshup I wonder if your Option 3 the same flaws as Option 1: it constrains use prematurely.

I'm tentatively comfortable with Option 1 (dataset & group level obsm), because I think @falexwolf 's statement is correct:

I think that understanding information encoded in single data modalities will - at least for the foreseeable future - remain of high value and I'd offer users intuitive slots for it. In cases they are not needed, sc_group-level obsm and varm slots can remain empty without any downside.

... and Option 1 would enable the organization you suggest @ivirshup. What do you two think?

@joshua-d-campbell
Copy link

Hi all, sorry I am late to the party, but this is a really interesting discussion. One question that I have is if the sc_groups is also supposed to be used to differentiate between different subsets of cells and features in addition to different modalities. If so, then it may be the case that we often don’t get an embedding at the whole dataset level as many cells/barcodes may be excluded. Here is an example with two modalities where cells get filtered and the joint embedding gets made on the subset of cells:

sc_dataset {Cells, (RNA, ATAC)}
    sc_group {Cells [obs_subset=all], RNA}
        X
            counts
    sc_group {Cells [obs_subset=filtered], RNA}
        X
            counts
            normalized
            scaled
        obsm
            embedding
    sc_group {Cells [obs_subset=all], ATAC}
        X
            counts
    sc_group {Cells [obs_subset=filtered], ATAC}
        X
            counts
            normalized
            scaled
        obsm
            embedding
    sc_group {Cells [obs_subset=filtered], RNA + ATAC}
        obsm
            embedding

Thus, having @ivirshup's option 3 available would be nice. Note that option 1 and 3 are not necessarily mutually exclusive either, we could still have a dataset-wide obsm in addition to sc_group specific obsms. Also, one question, will each sc_group will have its own obs and var? That may be a different issue though.

@ivirshup
Copy link

One question that I have is if the sc_groups is also supposed to be used to differentiate between different subsets of cells and features in addition to different modalities.

@joshua-d-campbell, this is a case I was thinking of as well.

Also, one question, will each sc_group will have its own obs and var?

To me, yes. Some statistics/ annotation will only be meaningful for some modalities. As you probably don't want to mix modalities when calculating "mean".


@ambrosejcarr option 1 could definitely be a superset of option 3.

One big question I have about option 1 is: does the top level obsm or varm have to correspond to the union of observations and variables? What if you have a joint decomposition on just RNA and ATAC, but protein level info is available. Would the feature loadings be stored in the top level obsm?

I would also really like to hear @gtca's thoughts on this. My understanding is that both MuData and MultiAssayExperiment have gone with something like option 1, but Danila would have more context here.

@ambrosejcarr
Copy link
Member Author

One big question I have about option 1 is: does the top level obsm or varm have to correspond to the union of observations and variables? What if you have a joint decomposition on just RNA and ATAC, but protein level info is available. Would the feature loadings be stored in the top level obsm?

I couldn't understand the bolded part of your use case description. Could you add a bit more detail?

Also, one question, will each sc_group will have its own obs and var?

To me, yes. Some statistics/ annotation will only be meaningful for some modalities. As you probably don't want to mix modalities when calculating "mean".

We are imagining a specification that supports this, but also enables operations at the dataset level. I think this might be an answer to your question question above, but I'm not sure and would like to hear more about the use case to know if there are gaps here. I think you might be highlighting one.

  1. sc_group obs and var MAY overlap the obs and var of other sc_groups
  2. sc_dataset obs and var are made up of the union of obs and var from sc_groups it contains.

@johnkerl
Copy link
Member

johnkerl commented Feb 2, 2023

See
https://github.com/single-cell-data/SOMA/blob/main/abstract_specification.md
for the current structure -- Experiment and Measurement were introduced precisely to address the issues raised here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants