Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage from R #18

Open
alimanfoo opened this issue Jul 20, 2018 · 51 comments
Open

Usage from R #18

alimanfoo opened this issue Jul 20, 2018 · 51 comments

Comments

@alimanfoo
Copy link
Member

It would be great to be able to use zarr format data from R. This issue is intended for discussing options for enabling/supporting usage from R.

@alimanfoo
Copy link
Member Author

One option might be to use the zarr python package from R via reticulate. It would be good to try this out and find out if there are any interoperability issues. One way of doing this could be to try to run all the code examples from the zarr tutorial but from R via reticulate. Some benchmarking would probably also be useful, to identify any areas where performance is affected by having to move or translate data between R and python.

If it is a workable option, it might then be cool to write a version of the zarr tutorial but for R users, which could be based off the current zarr python tutorial but include any specific information that R users might need to be aware of.

@alimanfoo
Copy link
Member Author

Another option could be to write R bindings for the Z5 C++ library, e.g., via RCPP. This would be more work but might provide opportunities for better performance by avoiding any unnecessary data transformations or copies required when using reticulate.

@alimanfoo
Copy link
Member Author

A technical point of interest, in R arrays use column-major (Fortran) memory layout. Zarr provides the option to use either row (C) or column (F) memory layout for data within chunks, and the same layout is used when retrieving data for all or part of a zarr array into a numpy array. E.g.:

In [20]: z = zarr.zeros((100, 100), order='F')

In [21]: a = z[:]

In [22]: a.flags
Out[22]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

So when using zarr from R, using order='F' should be more natural and give better performance.

@mikejiang
Copy link

mikejiang commented Sep 17, 2018

I could be wrong, but from my understanding, isn't zarr is more of a software that uses key-value and chunked-compressed mechanism to provide efficient on-disk array solution? That is to say, being able to load the zaar data in R is far from having a full-fledged and equally performed R package that can access zarr backend as efficient as the current python lib? (even if the R binding for Z5 lib is implemented).
Can you provide more insights regarding to the amount of software engineering efforts required to translate zarr to R without reticulate?

@alimanfoo
Copy link
Member Author

alimanfoo commented Sep 18, 2018 via email

@alimanfoo alimanfoo transferred this issue from zarr-developers/zarr-python Jul 3, 2019
@gdkrmr
Copy link

gdkrmr commented Nov 12, 2019

I took a stab at wrapping z5 from R, it currently compiles, but it is not functional yet and there is still quite a lot of work to do, so don't judge me :-).
I am sharing this to avoid duplicated efforts, anyone who wants can join development

https://github.com/gdkrmr/zarr-R

@constantinpape
Copy link

I took a stab at wrapping z5 from R, it currently compiles, but it is not functional yet and there is still quite a lot of work to do, so don't judge me

Let me know if you have any questions or need any help from the z5 site.

@alimanfoo
Copy link
Member Author

alimanfoo commented Nov 12, 2019 via email

@gdkrmr
Copy link

gdkrmr commented Nov 12, 2019

Let me know if you have any questions or need any help from the z5 site.

Thanks, what is the ETA for v2.0.0?

@constantinpape
Copy link

Thanks, what is the ETA for v2.0.0?

The API redesign is done, I just need to test it a bit more.
Initially, my plan was to wait for the implementation of the S3 backend.
I have put in a bit of work into this, but it's not quite done yet and I don't have time to finish it right now.
(I was initially hoping for some external contributions to the S3 part, but this hasn't happened yet).

Anyway, I think I will release 2.0.0 without S3 or other cloud backends and push this to 2.1.0.
I can probably do it next week. I will let you know once it's there.

@jakirkham
Copy link
Member

@gdkrmr, it would be great if you could hop on one of our meetings ( #1 ), am sure others would be interested in hearing about your work and how we can help you.

@gdkrmr
Copy link

gdkrmr commented Jan 24, 2020

  • I have something basic working working for z5 v2, currently only Int32 and Float64 (The builtin types of R).
  • I still need to think how to deal with the other data types (e.g. Booleans).
  • It needs a nice and simple high level interface, currently everything follows the z5 design pretty closely, which is not really suitable for the average R user.
  • It produces an .so file that is almost 40Mb large :-)
  • I will be happy to join your next meeting!

@constantinpape
Copy link

* I still need to think how to deal with the other data types (e.g. Booleans).

Just fyi, I don't support bool right now in z5.
@alimanfoo @jakirkham Are there any optimisations when zarr stores bools or is it storing a bool as one byte?

* It produces an .so file that is almost 40Mb large :-)

Interesting, for the python bindings the .so is quite a bit smaller,
~ 2.5 MB (build on Ubuntu 18 with gcc 7 and Release).

@jakirkham
Copy link
Member

I don't think we are doing anything special. Though could imagine one implementing a bit packing codec.

Maybe there are some compiler flags that can help?

@constantinpape
Copy link

Maybe there are some compiler flags that can help?

Probably yes.

@gdkrmr What operating system are you using and which compiler?
Are you using CMake? If so, maybe try compiling with Release or with MinSizeRel.

@alimanfoo
Copy link
Member Author

alimanfoo commented Jan 27, 2020 via email

@gdkrmr
Copy link

gdkrmr commented Jan 28, 2020

* I still need to think how to deal with the other data types (e.g. Booleans).

Just fyi, I don't support bool right now in z5.

@alimanfoo @jakirkham Are there any optimisations when zarr stores bools or is it storing a bool as one byte?

R stores bools as bytes (EDIT: no, they are stored as int32), because there is also a NA for bools. So I guess the way to go is to add an argument when reading to transform the data either into R integers or bools

Maybe there are some compiler flags that can help?

Probably yes.

@gdkrmr What operating system are you using and which compiler?
Are you using CMake? If so, maybe try compiling with Release or with MinSizeRel.

Ubuntu 16.04 and I have to use the R build system, which uses Makefiles.

* It produces an .so file that is almost 40Mb large :-)

Interesting, for the python bindings the .so is quite a bit smaller,

~ 2.5 MB (build on Ubuntu 18 with gcc 7 and Release).

I can get the size of the .so down to < 1MB if I strip debug symbols or use link time optimization.
I have asked on the R developers mailing list and the CRAN (the official R package repository) policy is quite restrictive with these kinds of flags, so they have to live with it. Ironically their checker throws a warning if the .so gets too large :-). I just found that this was a curious fact, nothing to really worry about.

EDIT: R stores rlogicals as int32, not uint8

@gdkrmr
Copy link

gdkrmr commented Jul 31, 2020

What is the state of Zarr support in R? I haven't looked at my package for a while and wonder if someone else has done some work on this in the meanwhile or is planning to work on this?

@LTLA
Copy link

LTLA commented Dec 11, 2020

I'm late to the party, but Googling most permutations of "zarr for R" gives this thread as the top hit, then @gdkrmr's repo, and Bioconductor's ZarrExperiment (I'll get to this later). So I'd guess your stuff is still the best we've got right now.

If you're planning to keep working on your zarr R package, I'd be willing to test it out on some genome-scale data. I've been eyeing some alternatives to HDF5 for a while and would be very interested in building on top of whatever you make.

Our current approach in ZarrExperiment just does the simple thing of dispatching to the Python library via reticulate. A native port would be much preferred if it is feasible. If your package gets more mature, we would use it to create a DelayedArray backend for zarr that would work in all analysis pipelines as a plug-and-play replacement for HDF5.

(Maybe you should call the package zarrr, ho ho ho.)

@ocefpaf
Copy link

ocefpaf commented Apr 28, 2021

I'm late to the party, but Googling most permutations of "zarr for R" gives this thread as the top hit, then @gdkrmr's repo, and Bioconductor's ZarrExperiment (I'll get to this later). So I'd guess your stuff is still the best we've got right now.

Same here. I'll be teaching a workshop for R users soon and I was wondering about zarr support. So far I got it via nczarr. See the last cell of this notebook. But it would be nice to add alternatives that don't require a netcdf installation.

@joshmoore
Copy link
Member

Would it help to get zarrrrrr interested parties together at the next community meeting (May 5th) to discuss a path forward?

From my side, I'd love to see one (or more?) R implementation in https://github.com/zarr-developers/zarr_implementations/

cc: @gdkrmr @keller-mark @ocefpaf @LTLA (@dominikl? @jkh1?)

@jkh1
Copy link

jkh1 commented Apr 29, 2021

Count me in. As a regular R user, this is something I've been thinking about recently. I'd favour the C++/Rcpp path over the reticulate approach as I've had issues with reticulate before (in my experience, R doesn't always play well with the various python envs/conda).

@davidbrochart
Copy link

Maybe we could provide R bindings of xtensor-zarr? We already do that for xtensor, and there exists an R package for xtensor already. We could improve this package so that it allows Zarr access, and users could use the same package for array processing. The package would then be equivalent to something like Zarr + NumPy.

@gdkrmr
Copy link

gdkrmr commented Apr 29, 2021

the netcdf-c library has added support for zarr files. netcdf-c is the basis for the R package ncdf4. There are discussion on how to get it working in R Unidata/netcdf-c#1982.

@keller-mark
Copy link

This sounds great! I started an extremely rough pure R function for producing a single Zarr chunk from an R matrix here https://github.com/vitessce/vitessce-r/blob/keller-mark/zarr/R/zarr.R#L215 in case anyone is interested. Unfortunately I cannot attend at 2pm eastern time on May 5th due to a conflict but perhaps @ilan-gold @manzt @th789 @mccalluc are interested

@gdkrmr
Copy link

gdkrmr commented Apr 30, 2021

I will try to attend but cannot make any promises.

@joshmoore
Copy link
Member

Looks like the time slot didn't work out for R folks. No worries. Note that the 19th is cancelled; we'll be back on the regular zoom on the 2nd though. If a different time slot would be better, feel free to say the word.

@ocefpaf
Copy link

ocefpaf commented May 5, 2021

the netcdf-c library has added support for zarr files. netcdf-c is the basis for the R package ncdf4. There are discussion on how to get it working in R Unidata/netcdf-c#1982.

In a way that already works. See the last cell of https://nbviewer.jupyter.org/gist/ocefpaf/4a078b19db4fd5507d2d21691abaa689

But nczarr is not exactly the same as zarr. I'm not well versed in the details but maybe a core zarr (c/c++/rust, whatever) that we can wrap in Python and R is still needed?

@joshmoore
Copy link
Member

@ocefpaf : I only know what's on the docs and what I've tested on the CLI, but my understanding was that nczarr has a mode to work with pure Zarr that may be of interest. I'd defer to @DennisHeimbigner whether a portion of the library could be used as a core.

@DennisHeimbigner
Copy link

Josh is correct. We support pure zarr read/write, so as long as you are willing to live
with the restricted meta-data of pure zarr you can use netcdf-c for pure V2 zarr.
The next netcdf-c release (version 4.8.1) will also add support for the Xarray convention
for named dimensions.
As for pulling out pieces, that is doable. As is usual, the documentation could be improved.
Much of the code uses the netcdf internal data structures for implementing the netcdf-C
API. But at least these parts might be usable.

  1. the code caching and read/write of chunks
  2. the code that reads/writes zarr metadata.
  3. the code that wraps access to the underlying storage e.g. files, zip file, and S3.

@DennisHeimbigner
Copy link

BTW you could try this experiment with R wrapping netcdf-4.8.0

  1. Take a simple R program that creates a netcdf .nc file, call it simple.nc
  2. Modify the program so that instead of calling whatever the R equivalent of nc_create("simple.nc"...) instead call the equivalent of nc_create("file://simple.zarr#mode=zarr,file",...)

This should create a directory called "simple.zarr" that contains a pure zarr container.
The name "simple.zarr" is not special, you can call it whatever you want.
If you try this then let me know what happens.

@jkh1
Copy link

jkh1 commented Sep 6, 2021

@schienstockd
Copy link

I have a beginner's question to opening zarr files with netcdf.

I have built netcdf-c with zarr support (I think) and then built the R package ncdf4.
I am not sure how to open the dataset up then. I have an OME-ZARR file generated from biofromats2raw and tried to open the file like this:

library(ncdf4)

# open file
ncin <- nc_open(
  "file:///Users/me/image.ome.zarr#mode=nczarr,zarr"
)
ncin
# Error in R_nc4_inq: NetCDF: Invalid argument
# Error in nc_get_grp_info(gids[ib], root_group$fqgn, format) : 
#   nc_get_grp_info: R_nc4_inq returned error on group id 524289

@gdkrmr
Copy link

gdkrmr commented Oct 30, 2021

I couldn't get it to work either, see Unidata/netcdf-c#1982

@DennisHeimbigner
Copy link

A couple of things.

  1. It appears you are using netcdf-c version 4.8.1 correct?
  2. try this command to avoid any R interference.
    ncdump -h "file:///Users/me/image.ome.zarr#mode=nczarr,zarr"

@DennisHeimbigner
Copy link

BTW what operating system are you using?

@schienstockd
Copy link

MacOS Catalina 10.15.5

It is a a multiscale OME-zarr where the image is in the path /0/0 of the file.
When I try to read from the root then the file is not found.
When I try to read from the dataset path directly the header seems to be empty:

> ncdump -h "file:///Users/me/ccidImage.ome.zarr#mode=nczarr,zarr"
ncdump: file:///Users/me/ccidImage.ome.zarr#mode=nczarr,zarr: No such file or directory

> ncdump -h "file:///Users/me/ccidImage.ome.zarr/0/0#mode=nczarr,zarr"
netcdf \0 {
}
nc-config --all

This netCDF 4.8.1-development has been built with the following features: 

  --cc            -> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
  --cflags        -> -I/usr/local/include
  --libs          -> -L/usr/local/lib -lnetcdf
  --static        -> -lhdf5_hl -lhdf5 -lsz -lz -ldl -lm -lsz -lcurl -lzip

  --has-c++       -> no
  --cxx           -> 

  --has-c++4      -> yes
  --cxx4          -> /usr/local/Homebrew/Library/Homebrew/shims/mac/super/clang++
  --cxx4flags     -> -I/usr/local/Cellar/osgeo-netcdf/4.7.4/include
  --cxx4libs      -> -L/usr/local/Cellar/osgeo-netcdf/4.7.4/lib -lnetcdf-cxx4 -lnetcdf

  --has-fortran   -> yes
  --fc            -> /usr/local/bin/gfortran
  --fflags        -> /usr/local/Cellar/osgeo-netcdf/4.7.4/include
  --flibs         -> -L/usr/local/Cellar/osgeo-netcdf/4.7.4/lib
  --has-f90       -> TRUE
  --has-f03       -> FALSE

  --has-dap       -> yes
  --has-dap2      -> yes
  --has-dap4      -> yes
  --has-nc2       -> yes
  --has-nc4       -> yes
  --has-hdf5      -> yes
  --has-hdf4      -> no
  --has-logging   -> no
  --has-pnetcdf   -> no
  --has-szlib     -> yes
  --has-cdf5      -> yes
  --has-parallel4 -> no
  --has-parallel  -> no
  --has-nczarr    -> yes

  --prefix        -> /usr/local
  --includedir    -> /usr/local/include
  --libdir        -> /usr/local/lib
  --version       -> netCDF 4.8.1-development

@joshmoore
Copy link
Member

When I try to read from the dataset path directly the header seems to be empty:

Could this be related to the "dimension_separator" metadata, @DennisHeimbigner ? @schienstockd , can you show us the content of 0/0/.zarray?

@DennisHeimbigner
Copy link

I think I see the problem. I use a heuristic to break a key into the variable key
and the chunk index/key. The heuristic says to get the longest suffix of integers
as the chunk index. So, in this case it is eating up too much of the key as the chunk index.
I can fix, but out of curiosity why do you have a variable named "0"

@schienstockd
Copy link

@joshmoore

0/0/.zarray

{
  "chunks" : [
    1,
    1,
    1,
    512,
    512
  ],
  "compressor" : {
    "clevel" : 5,
    "blocksize" : 0,
    "shuffle" : 1,
    "cname" : "lz4",
    "id" : "blosc"
  },
  "dtype" : ">u2",
  "fill_value" : 0,
  "filters" : null,
  "order" : "C",
  "shape" : [
    180,
    4,
    8,
    512,
    512
  ],
  "zarr_format" : 2,
  "dimension_separator" : "/"
}

I am not sure where the '0' variable comes from .. I used bioformats2raw to convert the image

@jakirkham
Copy link
Member

Looks like there is a very rough Zarr implementation in R

https://github.com/keller-mark/pizzarr

cc @keller-mark (hopefully I've clarified that correctly; please feel free to correct me if not)

@keller-mark
Copy link

Yes very rough indeed. Of course open to contributions or more detailed feature requests / issues.

@joshmoore
Copy link
Member

See discussion post under zarr-developers/zarr-python#1088

cc: @mike-lawrence

@bart1
Copy link

bart1 commented Oct 10, 2022

The stars package seems to have a implementation (I did not test it): https://r-spatial.org/r/2022/09/13/zarr.html

@mike-lawrence
Copy link

The stars package seems to have a implementation (I did not test it): https://r-spatial.org/r/2022/09/13/zarr.html

I think stars only provides read access, no write.

@mike-lawrence
Copy link

Seems to be solid progress here

@jkh1
Copy link

jkh1 commented Apr 30, 2023

The Rarr package is now on Bioconductor. The repository is here. It's written in C and writing is supported although for now limited to double and string types.

@mike-lawrence
Copy link

The Rarr package is now on Bioconductor. The repository is here. It's written in C and writing is supported although for now limited to double and string types.

Cool! I always forget to check bioconductr for packages 🤦‍♂️

@keller-mark
Copy link

Hi all, update on pizzarr: some things are working now!

  • Reading/writing of integer and float arrays (v2 spec)
  • 3 stores: MemoryStore, DirectoryStore, and HttpStore.
  • 2 types of compression: LZ4 and Zstd
  • Convenience functions for arrays/groups
  • R-like (one-based) and Python-like (zero-based) slicing
  • List of some remaining features to implement at https://github.com/keller-mark/pizzarr/issues

I have updated the docs a bit, with a simple OME-NGFF demo at https://keller-mark.github.io/pizzarr/articles/ome-ngff.html

Screenshot 2023-08-19 at 6 06 44 PM

@sanketverma1704
Copy link
Member

Thanks for working on Pizzarr and updating us, @keller-mark.

May I add this to our website (https://zarr.dev/implementations/)?

@keller-mark
Copy link

@MSanKeys963 Yes feel free to add! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests