-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[proposal] Changing the backend to xarray #32
Comments
Sorry for the late response, I was on holidays. I looked into xarray in the beginning and decided against because of the missing sparse data support and the plain fact, that things like scikit-learn only accept numpy arrays and sparse matrices as input. These days, we're putting a lot of thought in improving the backed infrastructure of anndata for chunked calculations. We might return to xarray for that reason. I can also keep you posted on the benchmarks, soon, here. |
we now support zarr, which is feature-comparable so i guess this can be closed |
@Hoeze, do you have a sense of how sparse data could be handled with netCDF or if anyone is working on it? I saw you had mentioned this on the xarray sparse issue, but haven't been able to find out too much myself. If we could conform more to a standard like netCDF, that could help with interchange as mentioned here: ivirshup/sc-interchange#5. |
@ivirshup Yes, there are some things ongoing. This solution will likely only support COO format for some time until pydata/sparse supports CSD (see pydata/sparse#258). In the meantime you can still save the data in sparse format and wrap it yourself. IMHO, if possible I would prefer a dense matrix over a sparse one. |
Thanks for the feedback! There were some very cool PRs over the weekend that make this seem closer to reality, like pydata/sparse#261.
I think this is fine. On the fly conversion from COO to CSR or CSC should be easy enough. The main issue with COO right now is that
I'm not entirely sure what this entails. Will I be able to have a COO array and dense array with shared coordinates in a netcdf file? Or is that the wrapping you were referring to?
I don't think one in unequivocally better that the other for all operations. In my experience, reading the whole matrix into memory is much faster when it's sparse on disk. This may be less of an issue with more modern compression algorithms, but support is limited with hdf5. To me, the main pain points with sparse representation are random access along non-compressed dimensions, library support (though this is fairly good for in-memory data), and chunking. |
Yes, with pydata/xarray#3117 this could finally happen soon!
Yes, that's the wrapping problem: However, as soon as xarray fully supports sparse arrays, it should handle this wrapping by itself.
TileDB will be very useful in this case. It is multithreaded and stores data in chunks. I.e. even non-compressed dimension lookups should be quite fast. |
Well, to be clear -- it could handle the wrapping by itself. We would need to define a metadata convention (but this should be pretty simple/straightforward). |
@shoyer, what would the goals of an NetCDF-storable sparse array be for xarray? Would you just want to target reading the whole array into memory at once via xarray? I see how this would be straight forward. If partial/chunked access for dask or keeping the data compatible with NetCDF libraries are goals I think it get's more complicated. Are these cases in scope for an xarray solution, or would this have to happen downstream? |
Reading whole sparse arrays from a netCDF file at once seems like a good start, and something that could easily be done in xarray. Eventually, it would probably be nice to have chunked/partial access, but that does seem much more complicated. I'm not sure netCDF is the right file format in that case, since you probably want a more intelligent (tree like) on-disk indexing scheme and netCDF's compression filters are not very flexible. Maybe this could be done more easily with zarr? Either way, xarray could wrap a third-party library that implements sparse arrays on disk. |
Xarray has a lot of advantages, e.g.:
The only big problem currently is the missing sparse data support, but this will be changed (hopefully in the near) future:
pydata/xarray#1375
The text was updated successfully, but these errors were encountered: