Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal] Changing the backend to xarray #32

Closed
Hoeze opened this issue Jun 16, 2018 · 9 comments
Closed

[proposal] Changing the backend to xarray #32

Hoeze opened this issue Jun 16, 2018 · 9 comments

Comments

@Hoeze
Copy link

Hoeze commented Jun 16, 2018

Xarray has a lot of advantages, e.g.:

  • named dimensions
  • Dask integration for multi-file datasets and chunked calculations for data not fitting into memory
  • Interoperability with numpy / pandas
  • NetCDF4 support, this would save the necessity to design custom HDF formats

The only big problem currently is the missing sparse data support, but this will be changed (hopefully in the near) future:
pydata/xarray#1375

@falexwolf
Copy link
Member

Sorry for the late response, I was on holidays.

I looked into xarray in the beginning and decided against because of the missing sparse data support and the plain fact, that things like scikit-learn only accept numpy arrays and sparse matrices as input.

These days, we're putting a lot of thought in improving the backed infrastructure of anndata for chunked calculations. We might return to xarray for that reason. I can also keep you posted on the benchmarks, soon, here.

@flying-sheep
Copy link
Member

we now support zarr, which is feature-comparable so i guess this can be closed

@ivirshup
Copy link
Member

@Hoeze, do you have a sense of how sparse data could be handled with netCDF or if anyone is working on it? I saw you had mentioned this on the xarray sparse issue, but haven't been able to find out too much myself.

If we could conform more to a standard like netCDF, that could help with interchange as mentioned here: ivirshup/sc-interchange#5.

@Hoeze
Copy link
Author

Hoeze commented Jul 10, 2019

@ivirshup Yes, there are some things ongoing.
The best bet for native sparse array support in xarray will be pydata/sparse.
However, you should talk to @shoyer for the native integration into xarray.
It would be awesome if someone would push this!

This solution will likely only support COO format for some time until pydata/sparse supports CSD (see pydata/sparse#258).
However, a lot of frameworks like TileDB or Tensorflow support only COO anyway.

In the meantime you can still save the data in sparse format and wrap it yourself.
I.e. take the coordinate index and the data array from your sparse matrix and save this as NetCDF4.
This of course requires some wrapping inside AnnData or any other framework you want to use.


IMHO, if possible I would prefer a dense matrix over a sparse one.
Everything with a sparsity ratio lower than 90-95% will very likely cost more processing power to decode than you can theoretically save. Especially in cases where you have to convert it to dense format anyway.
Also, compression algorithms can save comparable amounts of storage.
In each case, you save a lot of engineering effort.
However, @falexwolf might have another opinion, as he did a lot of benchmarking on this.

@ivirshup
Copy link
Member

Thanks for the feedback!

There were some very cool PRs over the weekend that make this seem closer to reality, like pydata/sparse#261.

However, a lot of frameworks like TileDB or Tensorflow support only COO anyway.

I think this is fine. On the fly conversion from COO to CSR or CSC should be easy enough. The main issue with COO right now is that scipy.sparse's version doesn't have subsetting, which makes it a pain to use here.

I.e. take the coordinate index and the data array from your sparse matrix and save this as NetCDF4.
This of course requires some wrapping inside AnnData or any other framework you want to use.

I'm not entirely sure what this entails. Will I be able to have a COO array and dense array with shared coordinates in a netcdf file? Or is that the wrapping you were referring to?

IMHO, if possible I would prefer a dense matrix over a sparse one.

I don't think one in unequivocally better that the other for all operations. In my experience, reading the whole matrix into memory is much faster when it's sparse on disk. This may be less of an issue with more modern compression algorithms, but support is limited with hdf5.

To me, the main pain points with sparse representation are random access along non-compressed dimensions, library support (though this is fairly good for in-memory data), and chunking.

@Hoeze
Copy link
Author

Hoeze commented Jul 15, 2019

There were some very cool PRs over the weekend that make this seem closer to reality, like pydata/sparse#261.

Yes, with pydata/xarray#3117 this could finally happen soon!

I.e. take the coordinate index and the data array from your sparse matrix and save this as NetCDF4.
This of course requires some wrapping inside AnnData or any other framework you want to use.

I'm not entirely sure what this entails. Will I be able to have a COO array and dense array with shared coordinates in a netcdf file? Or is that the wrapping you were referring to?

Yes, that's the wrapping problem:
NetCDF does not have (as far as i know) any conventions about storing sparse structures, trees, etc.
This means you have to store e.g. a sparse COO matrix as a coordinate matrix and a value vector.
When reading this data, you then have to wrap it with e.g. pydata/sparse, scipy.sparse or another language-dependent library.

However, as soon as xarray fully supports sparse arrays, it should handle this wrapping by itself.

IMHO, if possible I would prefer a dense matrix over a sparse one.

I don't think one in unequivocally better that the other for all operations. In my experience, reading the whole matrix into memory is much faster when it's sparse on disk. This may be less of an issue with more modern compression algorithms, but support is limited with hdf5.

To me, the main pain points with sparse representation are random access along non-compressed dimensions, library support (though this is fairly good for in-memory data), and chunking.

TileDB will be very useful in this case. It is multithreaded and stores data in chunks. I.e. even non-compressed dimension lookups should be quite fast.
Unfortunately, TileDB's Python and R library are still in its infancy.

@shoyer
Copy link

shoyer commented Jul 15, 2019

However, as soon as xarray fully supports sparse arrays, it should handle this wrapping by itself.

Well, to be clear -- it could handle the wrapping by itself. We would need to define a metadata convention (but this should be pretty simple/straightforward).

@ivirshup
Copy link
Member

@shoyer, what would the goals of an NetCDF-storable sparse array be for xarray? Would you just want to target reading the whole array into memory at once via xarray?

I see how this would be straight forward. If partial/chunked access for dask or keeping the data compatible with NetCDF libraries are goals I think it get's more complicated. Are these cases in scope for an xarray solution, or would this have to happen downstream?

@shoyer
Copy link

shoyer commented Jul 17, 2019

Reading whole sparse arrays from a netCDF file at once seems like a good start, and something that could easily be done in xarray.

Eventually, it would probably be nice to have chunked/partial access, but that does seem much more complicated. I'm not sure netCDF is the right file format in that case, since you probably want a more intelligent (tree like) on-disk indexing scheme and netCDF's compression filters are not very flexible. Maybe this could be done more easily with zarr? Either way, xarray could wrap a third-party library that implements sparse arrays on disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants