Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

obstore-based Store implementation #1661

Open
wants to merge 93 commits into
base: main
Choose a base branch
from

Conversation

kylebarron
Copy link

@kylebarron kylebarron commented Feb 8, 2024

A Zarr store based on obstore, which is a Python library that uses the Rust object_store crate under the hood.

object-store is a rust crate for interoperating with remote object stores like S3, GCS, Azure, etc. See the highlights section of its docs.

obstore maps async Rust functions to async Python functions, and is able to stream GET and LIST requests, which all make it a good candidate for use with the Zarr v3 Store protocol.

You should be able to test this branch with the latest pre-release version of obstore:

pip install --pre --upgrade obstore

TODO:

  • Examples
  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@jhamman
Copy link
Member

jhamman commented Feb 8, 2024

Amazing @kylebarron! I'll spend some time playing with this today.

@kylebarron
Copy link
Author

With roeap/object-store-python#9 it should be possible to fetch multiple ranges within a file concurrently with range coalescing (using get_ranges_async). Note that this object-store API accepts multiple ranges within one object, which is still not 100% aligned with the Zarr get_partial_values because that allows fetches across multiple objects.

That PR also adds a get_opts function which now supports "offset" and "suffix" ranges, of the sort Range:N- and Range:-N, which would allow removing the raise NotImplementedError on line 37.

@martindurant
Copy link
Member

martindurant/rfsspec#3

@normanrz
Copy link
Member

Great work @kylebarron!
What are everbody's thoughts on having this in zarr-python vs. spinning it out as a separate package?

@martindurant
Copy link
Member

What are everbody's thoughts on having this in zarr-python vs. spinning it out as a separate package?

I suggest we see whether it makes any improvements first, so it's author's choice for now.

@kylebarron
Copy link
Author

While @rabernat has seen some impressive perf improvements in some settings when making many requests with Rust's tokio runtime, which would possibly also trickle down to a Python binding, the biggest advantage I see is improved ease of use in installation.

A common hurdle I've seen is handling dependency management, especially around boto3, aioboto3, etc dependencies. Versions need to be compatible at runtime with any other libraries the user also has in their environment. And Python doesn't allow multiple versions of the same dependency at the same time in one environment. With a Python library wrapping a statically-linked Rust binary, you can remove all Python dependencies and remove this class of hardship.

The underlying Rust object-store crate is stable and under open governance via the Apache Arrow project. We'll just have to wait on some discussion in object-store-python for exactly where that should live.

I don't have an opinion myself on where this should live, but it should be on the order of 100 lines of code wherever it is (unless the v3 store api changes dramatically)

@jhamman
Copy link
Member

jhamman commented Feb 12, 2024

I suggest we see whether it makes any improvements first, so it's author's choice for now.

👍

What are everbody's thoughts on having this in zarr-python vs. spinning it out as a separate package?

I want to keep an open mind about what the core stores provided by Zarr-Python are. My current thinking is that we should just do a MemoryStore and a LocalFilesystemStore. Everything else can be opt-in by installing a 3rd party package. That said, I like having a few additional stores in the mix as we develop the store interface since it helps us think about the design more broadly.

@martindurant
Copy link
Member

A common hurdle I've seen is handling dependency management, especially around boto3, aioboto3, etc dependencies.

This is no longer an issue, s3fs has much more relaxed deps than it used to. Furthermore, it's very likely to be already part of an installation environment.

@normanrz
Copy link
Member

I want to keep an open mind about what the core stores provided by Zarr-Python are. My current thinking is that we should just do a MemoryStore and a LocalFilesystemStore. Everything else can be opt-in by installing a 3rd party package.

I agree with that. I think it is beneficial to keep the number of dependencies of core zarr-python small. But, I am open for discussion.

That said, I like having a few additional stores in the mix as we develop the store interface since it helps us think about the design more broadly.

Sure! That is certainly useful.

@jhamman jhamman added the V3 label Feb 13, 2024
@itsgifnotjiff
Copy link

This is awesome work, thank you all!!!

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
@kylebarron
Copy link
Author

The object-store-python package is not very well maintained roeap/object-store-python#24, so I took a few days to implement my own wrapper around the Rust object_store crate: https://github.com/developmentseed/object-store-rs

I'd like to update this PR soonish to use that library instead.

@martindurant
Copy link
Member

If the zarr group prefers object-store-rs, we can move it into the zarr-developers org, if you like. I would like to be involved in developing it, particularly if it can grow more explicit fsspec compatible functionality.

@kylebarron
Copy link
Author

kylebarron commented Oct 22, 2024

I have a few questions because the Store API has changed a bit since the spring.

  • There's a new BufferPrototype object. Is the BufferPrototype chosen by the store implementation or the caller? It would be very nice if this prototype could be chosen by the store implementation, because then we could return a RustBuffer object that implements the Python buffer protocol, but doesn't need to copy the buffer into Python memory.
  • Similarly for puts, is Buffer guaranteed to implement the buffer protocol? Contrary to fetching, we can't do zero-copy puts right now with object-store

I like that list now returns an AsyncGenerator. That aligns well with the underlying object-store rust API, but for technical reasons we can't expose that as an async iterable to Python yet (apache/arrow-rs#6587), even though we do expose the readable stream to Python as an async iterable.

@TomAugspurger
Copy link
Contributor

Is the BufferPrototype chosen by the store implementation or the caller? It would be very nice if this prototype could be chosen by the store implementation, because then we could return a RustBuffer object that implements the Python buffer protocol, but doesn't need to copy the buffer into Python memory.

This came up in the discussion at https://github.com/zarr-developers/zarr-python/pull/2426/files/5e0ffe80d039d9261517d96ce87220ce8d48e4f2#diff-bb6bb03f87fe9491ef78156256160d798369749b4b35c06d4f275425bdb6c4ad. By default, it's passed as default_buffer_prototype though I think the user can override at the call site or globally.

Does it look compatible with what you need?

@kylebarron kylebarron changed the title object-store-based Store implementation obstore-based Store implementation Feb 6, 2025
@kylebarron
Copy link
Author

It looks like this is now passing all tests, with just Read the docs and codecov targets not hit.

What are the final steps for this PR? Where should we write documentation?

@jhamman
Copy link
Member

jhamman commented Feb 7, 2025

It looks like this is now passing all tests, with just Read the docs and codecov targets not hit.

🎉

What are the final steps for this PR? Where should we write documentation?

@kylebarron
Copy link
Author

kylebarron commented Feb 10, 2025

obstore 0.4.0 was released and this PR was updated to use that latest version.

Optional dependency

From comment above

  1. We want to include this in Zarr-Python so long as obstore is an optional dependency

CI had been failing because (I think) obstore was marked as an optional dependency but then unconditionally exported from zarr.storage.

I removed the re-export from zarr.storage but renamed _object.py to obstore.py. So anyone who wishes to use the obstore-backed class needs to run

import zarr.storage.obstore

themselves. When that import is run, obstore will be imported, and so obstore must exist in the environment at that time.

Alternatively

We could keep the re-export but ensure that obstore.py doesn't import obstore in any public scope. Thoughts?

Future follow up PRs:

@jhamman
Copy link
Member

jhamman commented Feb 12, 2025

Regarding the import of obstore, I suggest moving all imports from obstore into the class itself. Users should get an ImportError if they attempt to instantiate the Store without the required dependency.

@martindurant
Copy link
Member

You can have

import obstore
self.obs = obstore

in __init__ to avoid having to have import all over the place.

@kylebarron
Copy link
Author

We can move the imports into the class; that's fine.

You can have

import obstore
self.obs = obstore

in __init__ to avoid having to have import all over the place.

It's not clear to me how the typing would work there? How would you define the type hint for obs so that self.obs.get would correctly access the same type hint as obstore.get? I'd rather import obstore a bunch than lose that type safety.

@martindurant
Copy link
Member

I'm not usually too worried by typing. I'm sure it can be fixed, but fighting with mypy is never fun.

@kylebarron
Copy link
Author

We can move the imports into the class; that's fine.

Hmm, moving the imports inside the class caused a pickling error. Maybe this won't work:

FAILED tests/test_store/test_object.py::TestObjectStore::test_serializable_store - TypeError: cannot pickle 'module' object

https://github.com/zarr-developers/zarr-python/actions/runs/13295245421/job/37125558963?pr=1661#step:6:299

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.