[proposal] Changing the backend to xarray #32

Hoeze · 2018-06-16T16:08:04Z

Xarray has a lot of advantages, e.g.:

named dimensions
Dask integration for multi-file datasets and chunked calculations for data not fitting into memory
Interoperability with numpy / pandas
NetCDF4 support, this would save the necessity to design custom HDF formats

The only big problem currently is the missing sparse data support, but this will be changed (hopefully in the near) future:
pydata/xarray#1375

falexwolf · 2018-06-20T09:43:19Z

Sorry for the late response, I was on holidays.

I looked into xarray in the beginning and decided against because of the missing sparse data support and the plain fact, that things like scikit-learn only accept numpy arrays and sparse matrices as input.

These days, we're putting a lot of thought in improving the backed infrastructure of anndata for chunked calculations. We might return to xarray for that reason. I can also keep you posted on the benchmarks, soon, here.

flying-sheep · 2018-08-27T13:56:09Z

we now support zarr, which is feature-comparable so i guess this can be closed

ivirshup · 2019-07-10T02:41:20Z

@Hoeze, do you have a sense of how sparse data could be handled with netCDF or if anyone is working on it? I saw you had mentioned this on the xarray sparse issue, but haven't been able to find out too much myself.

If we could conform more to a standard like netCDF, that could help with interchange as mentioned here: ivirshup/sc-interchange#5.

Hoeze · 2019-07-10T11:17:36Z

@ivirshup Yes, there are some things ongoing.
The best bet for native sparse array support in xarray will be pydata/sparse.
However, you should talk to @shoyer for the native integration into xarray.
It would be awesome if someone would push this!

This solution will likely only support COO format for some time until pydata/sparse supports CSD (see pydata/sparse#258).
However, a lot of frameworks like TileDB or Tensorflow support only COO anyway.

In the meantime you can still save the data in sparse format and wrap it yourself.
I.e. take the coordinate index and the data array from your sparse matrix and save this as NetCDF4.
This of course requires some wrapping inside AnnData or any other framework you want to use.

IMHO, if possible I would prefer a dense matrix over a sparse one.
Everything with a sparsity ratio lower than 90-95% will very likely cost more processing power to decode than you can theoretically save. Especially in cases where you have to convert it to dense format anyway.
Also, compression algorithms can save comparable amounts of storage.
In each case, you save a lot of engineering effort.
However, @falexwolf might have another opinion, as he did a lot of benchmarking on this.

ivirshup · 2019-07-15T07:23:02Z

Thanks for the feedback!

There were some very cool PRs over the weekend that make this seem closer to reality, like pydata/sparse#261.

However, a lot of frameworks like TileDB or Tensorflow support only COO anyway.

I think this is fine. On the fly conversion from COO to CSR or CSC should be easy enough. The main issue with COO right now is that scipy.sparse's version doesn't have subsetting, which makes it a pain to use here.

I.e. take the coordinate index and the data array from your sparse matrix and save this as NetCDF4.
This of course requires some wrapping inside AnnData or any other framework you want to use.

I'm not entirely sure what this entails. Will I be able to have a COO array and dense array with shared coordinates in a netcdf file? Or is that the wrapping you were referring to?

IMHO, if possible I would prefer a dense matrix over a sparse one.

I don't think one in unequivocally better that the other for all operations. In my experience, reading the whole matrix into memory is much faster when it's sparse on disk. This may be less of an issue with more modern compression algorithms, but support is limited with hdf5.

To me, the main pain points with sparse representation are random access along non-compressed dimensions, library support (though this is fairly good for in-memory data), and chunking.

Hoeze · 2019-07-15T09:02:06Z

There were some very cool PRs over the weekend that make this seem closer to reality, like pydata/sparse#261.

Yes, with pydata/xarray#3117 this could finally happen soon!

I.e. take the coordinate index and the data array from your sparse matrix and save this as NetCDF4.
This of course requires some wrapping inside AnnData or any other framework you want to use.

I'm not entirely sure what this entails. Will I be able to have a COO array and dense array with shared coordinates in a netcdf file? Or is that the wrapping you were referring to?

Yes, that's the wrapping problem:
NetCDF does not have (as far as i know) any conventions about storing sparse structures, trees, etc.
This means you have to store e.g. a sparse COO matrix as a coordinate matrix and a value vector.
When reading this data, you then have to wrap it with e.g. pydata/sparse, scipy.sparse or another language-dependent library.

However, as soon as xarray fully supports sparse arrays, it should handle this wrapping by itself.

IMHO, if possible I would prefer a dense matrix over a sparse one.

I don't think one in unequivocally better that the other for all operations. In my experience, reading the whole matrix into memory is much faster when it's sparse on disk. This may be less of an issue with more modern compression algorithms, but support is limited with hdf5.

To me, the main pain points with sparse representation are random access along non-compressed dimensions, library support (though this is fairly good for in-memory data), and chunking.

TileDB will be very useful in this case. It is multithreaded and stores data in chunks. I.e. even non-compressed dimension lookups should be quite fast.
Unfortunately, TileDB's Python and R library are still in its infancy.

shoyer · 2019-07-15T16:47:54Z

However, as soon as xarray fully supports sparse arrays, it should handle this wrapping by itself.

Well, to be clear -- it could handle the wrapping by itself. We would need to define a metadata convention (but this should be pretty simple/straightforward).

ivirshup · 2019-07-17T06:11:32Z

@shoyer, what would the goals of an NetCDF-storable sparse array be for xarray? Would you just want to target reading the whole array into memory at once via xarray?

I see how this would be straight forward. If partial/chunked access for dask or keeping the data compatible with NetCDF libraries are goals I think it get's more complicated. Are these cases in scope for an xarray solution, or would this have to happen downstream?

shoyer · 2019-07-17T17:01:14Z

Reading whole sparse arrays from a netCDF file at once seems like a good start, and something that could easily be done in xarray.

Eventually, it would probably be nice to have chunked/partial access, but that does seem much more complicated. I'm not sure netCDF is the right file format in that case, since you probably want a more intelligent (tree like) on-disk indexing scheme and netCDF's compression filters are not very flexible. Maybe this could be done more easily with zarr? Either way, xarray could wrap a third-party library that implements sparse arrays on disk.

flying-sheep closed this as completed Aug 27, 2018

tomwhite mentioned this issue Nov 22, 2019

Investigate Dask for speeding up Zheng17 scverse/scanpy#921

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[proposal] Changing the backend to xarray #32

[proposal] Changing the backend to xarray #32

Hoeze commented Jun 16, 2018 •

edited

Loading

falexwolf commented Jun 20, 2018

flying-sheep commented Aug 27, 2018

ivirshup commented Jul 10, 2019

Hoeze commented Jul 10, 2019 •

edited

Loading

ivirshup commented Jul 15, 2019

Hoeze commented Jul 15, 2019 •

edited

Loading

shoyer commented Jul 15, 2019

ivirshup commented Jul 17, 2019

shoyer commented Jul 17, 2019

[proposal] Changing the backend to xarray #32

[proposal] Changing the backend to xarray #32

Comments

Hoeze commented Jun 16, 2018 • edited Loading

falexwolf commented Jun 20, 2018

flying-sheep commented Aug 27, 2018

ivirshup commented Jul 10, 2019

Hoeze commented Jul 10, 2019 • edited Loading

ivirshup commented Jul 15, 2019

Hoeze commented Jul 15, 2019 • edited Loading

shoyer commented Jul 15, 2019

ivirshup commented Jul 17, 2019

shoyer commented Jul 17, 2019

Hoeze commented Jun 16, 2018 •

edited

Loading

Hoeze commented Jul 10, 2019 •

edited

Loading

Hoeze commented Jul 15, 2019 •

edited

Loading