Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Begin Outputting AnnData file format #151

Open
ilan-gold opened this issue Dec 4, 2024 · 16 comments
Open

Begin Outputting AnnData file format #151

ilan-gold opened this issue Dec 4, 2024 · 16 comments

Comments

@ilan-gold
Copy link

I'm not sure if here or in pyroe this issue belongs, but I think (at least from my limited understanding), it would be cool if the sparse matrix output of alevin could be AnnData on disk. It seems like https://docs.rs/anndata/latest/anndata/ would be a good candidate for the i/o.

It seems like infer and quant would be the two candidates for them. This issue is mostly about opening a discussion about this feature, not being prescriptive or anything! I'm new to this ecosystem :)

@rob-p
Copy link
Contributor

rob-p commented Dec 13, 2024

Hi @ilan-gold,

Indeed, this would be great! This would be trivial to do in pyroe, as that is in Python and has immediate access to all relevant scVerse packages.

That said, it would be cool to avoid a conversion program at all and be able to output AnnData directly (or auto convert to it if desired from within Rust). Do you know if there is an AnnData crate for rust? Otherwise, I know there’s and HDF5 crate, though that I think that pulls in the (kinda flakey?) C-based dependency.

—Rob

@ilan-gold
Copy link
Author

Hi @rob-p I'm back from vacation now. Yes, there is some code in rust: https://github.com/kaizhang/anndata-rs/tree/main. Even though it does not provide installation instructions, it appears to be released as a package: https://docs.rs/anndata/latest/anndata/

@kaizhang Do you have any insight here?

@kaizhang
Copy link

kaizhang commented Jan 8, 2025

@ilan-gold @rob-p Please see here for an example of how to use anndata Rust package to output a cellxgene count matrix: https://github.com/regulatory-genomics/precellar/blob/a8d2fca438baf596a2be7cc6432364c7ef3534fe/precellar/src/transcript/quantification.rs#L80. Note the latest version use zstd compression under the hood. When you open the h5ad file in python, you may need to import hdf5plugin.

@ilan-gold
Copy link
Author

ilan-gold commented Jan 8, 2025

@kaizhang Thanks. I guess with zarr the issue is that zarrs is not feature complete with all of the python codecs for string handling (although I think it is v2 compliant otherwise). Is there an option for uncompressed hdf5? I think we use that regularly without issue.

@rob-p
Copy link
Contributor

rob-p commented Jan 9, 2025

Thanks @kaizhang & @ilan-gold. I'm thinking an ideal place for such a conversion may be simpleaf, but in the mean time, I'm trying to get a feel for the anndata-rs package.

It's interesting because we use sprs internally in alevin-fry, and though that relies on nalgebra_sparse it turns out they have different underlying structs for their CSR matrices (though they are trivially convertible).

Anyway, I think I almost have a CSR to unspliced, spliced, ambiguous AnnData conversion function, but I am curious about one issue. I'd like the resulting AnnData object to have 3 layers ("unspliced", "spliced" and "ambiguous"). However, I don't have a need for a default "X" layer. However, if I set the layers without the X, I get the following error reading it into Python AnnData:

----> 1 x = anndata.read('foo.anndata.new')

File ~/miniconda3/lib/python3.8/site-packages/anndata/_io/h5ad.py:254, in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
    250         raise ValueError()
    252 _clean_uns(d)  # backwards compat
--> 254 return AnnData(**d)

File ~/miniconda3/lib/python3.8/site-packages/anndata/_core/anndata.py:291, in AnnData.__init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx)
    289     self._init_as_view(X, oidx, vidx)
    290 else:
--> 291     self._init_as_actual(
    292         X=X,
    293         obs=obs,
    294         var=var,
    295         uns=uns,
    296         obsm=obsm,
    297         varm=varm,
    298         raw=raw,
    299         layers=layers,
    300         dtype=dtype,
    301         shape=shape,
    302         obsp=obsp,
    303         varp=varp,
    304         filename=filename,
    305         filemode=filemode,
    306     )

File ~/miniconda3/lib/python3.8/site-packages/anndata/_core/anndata.py:541, in AnnData._init_as_actual(self, X, obs, var, uns, obsm, varm, varp, obsp, raw, layers, dtype, shape, filename, filemode)
    538 self._clean_up_old_format(uns)
    540 # layers
--> 541 self._layers = Layers(self, layers)

File ~/miniconda3/lib/python3.8/site-packages/anndata/_core/aligned_mapping.py:278, in Layers.__init__(self, parent, vals)
    276 self._data = dict()
    277 if vals is not None:
--> 278     self.update(vals)

File ~/miniconda3/lib/python3.8/_collections_abc.py:832, in MutableMapping.update(self, other, **kwds)
    830 if isinstance(other, Mapping):
    831     for key in other:
--> 832         self[key] = other[key]
    833 elif hasattr(other, "keys"):
    834     for key in other.keys():

File ~/miniconda3/lib/python3.8/site-packages/anndata/_core/aligned_mapping.py:151, in AlignedActualMixin.__setitem__(self, key, value)
    150 def __setitem__(self, key: str, value: V):
--> 151     value = self._validate_value(value, key)
    152     self._data[key] = value

File ~/miniconda3/lib/python3.8/site-packages/anndata/_core/aligned_mapping.py:52, in AlignedMapping._validate_value(self, val, key)
     50     if self.parent.shape[axis] != val.shape[i]:
     51         right_shape = tuple(self.parent.shape[a] for a in self.axes)
---> 52         raise ValueError(
     53             f"Value passed for key {key!r} is of incorrect shape. "
     54             f"Values of {self.attrname} must match dimensions "
     55             f"{self.axes} of parent. Value had shape {val.shape} while "
     56             f"it should have had {right_shape}."
     57         )
     58 if not self._allow_df and isinstance(val, pd.DataFrame):
     59     name = self.attrname.title().rstrip("s")

ValueError: Value passed for key 'ambiguous' is of incorrect shape. Values of layers must match dimensions (0, 1) of parent. Value had shape (190461, 33696) while it should have had (0, 0).

which presumably is because X is its parent? Maybe I'm misusing the nomeclature, but is there a way to accomplish this (i.e. to have several layers without the default X)?

@ilan-gold
Copy link
Author

ilan-gold commented Jan 9, 2025

Hmmm @rob-p This could be an edge case that anndata-rs is not handling. If you send the file I can have a look to try to see.

This might be an issue with AnnData. I'll look into it a bit

@ilan-gold
Copy link
Author

Ok so here are some (maybe informative) code snippets:

import anndata as ad
import numpy as np

ad.AnnData(X=None, layers={"foo": np.arange(4).reshape((2, 2)), "bar": np.arange(4).reshape((2, 2))}) # errors, probably erroneously
import anndata as ad
import numpy as np

adata = ad.AnnData(X=np.arange(4).reshape((2, 2)), layers={"foo": np.arange(4).reshape((2, 2)), "bar": np.arange(4).reshape((2, 2))})
del adata.X
adata.write_h5ad("foo.h5ad")
ad.read_h5ad("foo.h5ad").X == None

so if you're having trouble reading the data back in, it's tough to say without the file. So I stand by my original point, please send.

@rob-p
Copy link
Contributor

rob-p commented Jan 9, 2025

Hi @ilan-gold,

So here's my quick and dirty test to convert a USA matrix market format output from alevin-fry into an AnnData object. Right now, I get around the above issue by just setting the X layer to be an empty matrix so that it takes essentially no space. Obviously this code needs to be made much cleaner and pulled out into it's own function / crate, but it can be used to create files to test for downstream compatibility --- currently I can load up what it produces in python using anndata.read, so that's a good start!

@ilan-gold
Copy link
Author

Nice, the goal here would be to do this directly from the in-memory representation in alevin-fry? Or would the intermediate mtx format be needed even here "in production?"

@rob-p
Copy link
Contributor

rob-p commented Jan 9, 2025

I think both paths would be possible. My thought on the migration path and testing it out would be the following.

Phase 1:

  • build a new crate (current proposed name af-anndata) that has the improved version of this conversion function.
  • pull this crate into simpleaf, and have simpleaf take a flag (or just alter the default behavior) to convert the alevin-fry output to the anndata format after quantification is complete. This requires no changes to the current alevin-fry, but removes the user's need to perform an extra conversion step manually as long as they are using simpleaf. We could optionally choose to retain or remove the original CSR representation, though for compatibility with other downstream ecosystems, we'd want to make sure that any relevant packages are also updated to accept the anndata input (e.g. fishpond cc @DongzeHE, @mikelove).

Phase 2:

  • If phase 1 goes well, and we iron out any usability or performance wrinkles, then we could move the af-anndata dependency from simpleaf to alevin-fry itself, and output the anndata representation from alevin-fry directly, completely sidestepping the conversion. We'd probably want to deprecate and remove the CSR output at the same time to avoid having to support and maintain 2 separate output formats.
  • I don't think we're currently dealing with individual datasets this large, but one thing worth considering is how challenging (if at all) it would be to do chunked output in the anndata format. Specifically, one of the benefits of CSR is that we consider each cell barcode as it's own row, and CSR is amenable to row-by-row construction and output. This means we never need to hold the whole output matrix in memory and instead only have to hold a number of cells proportional to the number of threads the user is using for quantification. Such a scheme would also be useful if it could be done with the anndata object as well.

These are my initial thoughts, but I'm open to discussion or feedback on any of these points!

@ilan-gold
Copy link
Author

I don't think we're currently dealing with individual datasets this large, but one thing worth considering is how challenging (if at all) it would be to do chunked output in the anndata format. Specifically, one of the benefits of CSR is that we consider each cell barcode as it's own row, and CSR is amenable to row-by-row construction and output. This means we never need to hold the whole output matrix in memory and instead only have to hold a number of cells proportional to the number of threads the user is using for quantification. Such a scheme would also be useful if it could be done with the anndata object as well.

This would require custom code + handling via zarr/hdf5 but I think should be doable (if not so performant on zarr). But that's a long way off, and that's just a first impression.

Otherwise I like the plan. Seems like a good way to gradually introduce the functionality.

@rob-p
Copy link
Contributor

rob-p commented Jan 9, 2025

Otherwise I like the plan. Seems like a good way to gradually introduce the functionality.

Awesome! Then I'll get started on this (i.e. grab the af-anndata crate name, move the test code to the corresponding repository and begin improving and robustifying the implementation). I'll tag you in that repo once it's up with some basic code, and when the first version of that is complete, I'll start integrating the relevant functionality into simpleaf, which should be trivial once the af-anndata crate is ready!

@mikelove
Copy link

mikelove commented Jan 9, 2025

any relevant packages are also updated to accept the anndata input

This should be fine as long as it's covered by a CRAN package e.g.:

https://cran.r-project.org/web/packages/anndata/readme/README.html

But GitHub only packages wouldn't work as dependencies.

@ilan-gold
Copy link
Author

optionally choose to retain or remove the original CSR representation, though for compatibility with other downstream ecosystems, we'd want to make sure that any relevant packages are also updated to accept the anndata input

I definitely think putting this behind a flag is a good way to start, but yea, these AnnData objects should be extremely simple, at least at first. So I would guess there are minimal changes needed for compat downstream.

@rob-p
Copy link
Contributor

rob-p commented Jan 9, 2025

Hi @ilan-gold,

I've started implementing this in a library. That repo is [af-anndata]. In the mean time, if you want to take a look at a converted output, you can find one I made and uploaded here. This has layers for the spliced, unspliced and ambiguous counts (and corresponding varm entries for the gene symbol IDs as well). Feel free to poke around and:

(1) let me know if you have any suggestion on how to avoid the "empty X layer" hack.
(2) let me know if you have any other suggestions on output format, naming in the object, etc.

We can continue discussion here or, if you'd prefer, we can pick it up in the af-anndata repo!

@ilan-gold
Copy link
Author

Hi @rob-p I'd be still interested in what caused #151 (comment). You shouldn't have to make an empty CSR matrix so it would be good to understand why the file could not be read back in (maybe the same issue I raised in AnnData: scverse/anndata#1816, even likely). Or you could wait for me to fix that issue and then try it out later. If we need to go back and re-do, then that's what it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants