Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Using Zarr with GPU arrays #2574

Open
jhamman opened this issue Dec 18, 2024 · 14 comments
Open

DOC: Using Zarr with GPU arrays #2574

jhamman opened this issue Dec 18, 2024 · 14 comments
Labels
documentation Improvements to the documentation help wanted Issue could use help from someone with familiarity on the topic

Comments

@jhamman
Copy link
Member

jhamman commented Dec 18, 2024

Describe the issue linked to the documentation

Zarr Python 3 has new support for working with CuPy arrays. The only problem is no one knows how to use it!

@akshaysubr / @madsbk / @jakirkham - would you be up for adding a section to the docs? See #2568 for a in-progress docs refactor that should make it easy to add a page on this.

Suggested fix for documentation

No response

@jhamman jhamman added documentation Improvements to the documentation help wanted Issue could use help from someone with familiarity on the topic labels Dec 18, 2024
@akshaysubr
Copy link
Contributor

@jhamman I can work on this section. I'd like to keep it brief for now and expand it once we integrate GPU compression and storage solutions since that's when users will see the full benefit of this pipeline.

@jhamman
Copy link
Member Author

jhamman commented Jan 4, 2025

Sounds great @akshaysubr! The docs refactor landed this week so you could pick up and add something to the advanced topics user-guide section.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 21, 2025

I looked into this a bit today. I think there are ~3 things to implement / document:

  1. Using Zarr with GPU memory (CuPy arrays). This is implemented.
  2. GPU-accelerated I/O (GPU direct storage / kvikio). This isn't implemented yet.
  3. GPU-accelerated decompression / decoding. This isn't implemented yet.

I can put up a PR for the first one. But I wanted to confirm that using zarr.config.set({"buffer": "zarr.core.buffer.gpu.Buffer", "ndbuffer": "zarr.core.buffer.gpu.NDBuffer"}) is the recommended approach (edit: see below).

If that is, then does zarr.storage.GpuMemoryStore provide anything over MemoryStore? If I don't update the config to use GPU Buffers, then I think we get some weird behavior with GpuMemoryStore, where the data is automatically converted between host and device on set, but converted back to host memory on get through the Zarr API:

In [1]: import zarr

In [2]: store = zarr.storage.GpuMemoryStore()

In [3]: z = zarr.create_array(store=store, shape=(10000, 10000), chunks=(1000, 1000), dtype='int32')
/home/coder/zarr-python/src/zarr/core/buffer/gpu.py:99: UserWarning: Creating a zarr.buffer.gpu.Buffer with an array that does not support the __cuda_array_interface__ for zero-copy transfers, falling back to slow copy based path
  return cls(buffer.as_array_like())

In [4]: z[:100, :100] = 1
/home/coder/zarr-python/src/zarr/core/buffer/gpu.py:99: UserWarning: Creating a zarr.buffer.gpu.Buffer with an array that does not support the __cuda_array_interface__ for zero-copy transfers, falling back to slow copy based path
  return cls(buffer.as_array_like())

In [5]: type(z[:1000, :1000])
Out[5]: numpy.ndarray

In [6]: type(store._store_dict["c/0/0"])
Out[6]: zarr.core.buffer.gpu.Buffer

I think that's not what we want. Using a zarr.storage.MemoryStore (with the updated buffer prototype) seems to gives us what we want (data stored in memory on the device, and CuPy arrays when reading data).

Edit: Reading through #2473 (comment), this might not be ideal:

Agree, you do not want to change the global default_buffer_prototype. This was never my intention, maybe we should rename it tocpu_buffer_prototype?.

Since IIUC that will also put metadata in GPU memory. I'll do a bit more exploration.

@dcherian
Copy link
Contributor

So does the current implementation only include GpuMemoryStore? For some reason, I thought this was superceding Kvikio.

@TomAugspurger
Copy link
Contributor

Here's what currently works:

import cupy as cp

import zarr.api.asynchronous
import zarr.core.buffer.cpu
import zarr.core.buffer.gpu
import zarr.storage


async def main() -> None:
    src = cp.random.uniform(size=(100, 100))  # allocate on the device
    store = zarr.storage.LocalStore("data.zarr")
    z = await zarr.api.asynchronous.create_array(
        store=store, name="a", shape=src.shape, chunks=(10, 10), dtype=src.dtype, overwrite=True
    )

    await z.setitem(
        (slice(10), slice(10)), src[:10, :10], prototype=zarr.core.buffer.gpu.buffer_prototype
    )

    result = await z.getitem(
        (slice(10), slice(10)), prototype=zarr.core.buffer.gpu.buffer_prototype
    )
    print(type(result))  # <class 'cupy.ndarray'>

if __name__ == "__main__":
    import asyncio

    asyncio.run(main())

There's also GpuMemoryStore, but right now I think that's mostly just a convenience that automatically moves incoming data to on-device memory.

Notably, this uses the prototype argument to getitem and setitem, and so is only available through the async API. I think we'll want a single configuration option to control this, so that it's usable through the rest of Zarr's API (I'll write up a separate issue for designing this).

For some reason, I thought this was superceding Kvikio.

I think kvikio will still be useful for users who want to want to read data directly into device memory (avoiding any intermediate allocations on the host before copying to the device). kvikio isn't currently compatible with zarr v3, but I'm hoping to look into that sometime.

@akshaysubr
Copy link
Contributor

Yes, the prototype argument to getitem and setitem is how to control what array type to use internally. This should automatically be changed to a CuPy array by setting the config default buffer and ndbuffer to the GPU versions.

from zarr.core.config import config
import zarr.core.buffer.gpu as gpu

# GPU buffers
gpu_buffer_mapping = {"buffer":  gpu.Buffer, "ndbuffer": gpu.NDBuffer}

config.set(gpu_buffer_mapping)

This will then set the default_buffer_prototype to the GPU version and therefore all API calls should be able to automatically use it (assuming they are all using this method to get the prototype argument).

@akshaysubr
Copy link
Contributor

Looks like Array.__setitem__ calls Array.set_basic_selection which internally uses the default_buffer_prototype method to get the buffer classes set in the config. So just setting up the default configs should be sufficient.

prototype = default_buffer_prototype()

@madsbk
Copy link
Contributor

madsbk commented Jan 21, 2025

The problem is that default_buffer_prototype also changes how the metadata is read, I think?

@akshaysubr
Copy link
Contributor

Would it make sense to always use the numpy prototype for metadata? So essentially pass in zarr.core.buffer.cpu.numpy_buffer_prototype in this get call?

@TomAugspurger
Copy link
Contributor

Would it make sense to always use the numpy prototype for metadata?

That's what I'm leaning towards. The one thing I wanted to understand before suggesting that is the difference between BufferPrototype.buffer and BufferPrototype.nd_buffer. Is it possible that .buffer is intended to be used for metadata, and nd_buffer is intended for data chunks? I haven't had a chance to go through the usage and history yet.

@jhamman
Copy link
Member Author

jhamman commented Jan 21, 2025

Would it make sense to always use the numpy prototype for metadata? So essentially pass in zarr.core.buffer.cpu.numpy_buffer_prototype in this get call?

I would actually like to see us implement a set of metadata specific methods on the store. I started down this path in #1851.

@TomAugspurger
Copy link
Contributor

Is it possible that .buffer is intended to be used for metadata, and .nd_buffer is intended for data chunks? I haven't had a chance to go through the usage and history yet.

I don't think this is an option. Some data buffers, those involving codecs before the ArrayBytesCodecs, will just be dealing with plain buffers rather than ndarrays.

@akshaysubr your suggestion works, with one other tweak to ensure that host memory is used when writing:

diff --git a/src/zarr/core/array.py b/src/zarr/core/array.py
index 632e8221..75bb344b 100644
--- a/src/zarr/core/array.py
+++ b/src/zarr/core/array.py
@@ -38,6 +38,7 @@ from zarr.core.buffer import (
     NDBuffer,
     default_buffer_prototype,
 )
+from zarr.core.buffer.cpu import buffer_prototype as cpu_buffer_prototype
 from zarr.core.chunk_grids import RegularChunkGrid, _auto_partition, normalize_chunks
 from zarr.core.chunk_key_encodings import (
     ChunkKeyEncoding,
@@ -168,7 +169,7 @@ async def get_array_metadata(
         if zarray_bytes is None:
             raise FileNotFoundError(store_path)
     elif zarr_format == 3:
-        zarr_json_bytes = await (store_path / ZARR_JSON).get()
+        zarr_json_bytes = await (store_path / ZARR_JSON).get(prototype=cpu_buffer_prototype)
         if zarr_json_bytes is None:
             raise FileNotFoundError(store_path)
     elif zarr_format is None:
@@ -1295,7 +1296,7 @@ class AsyncArray(Generic[T_ArrayMetadata]):
         """
         Asynchronously save the array metadata.
         """
-        to_save = metadata.to_buffer_dict(default_buffer_prototype())
+        to_save = metadata.to_buffer_dict(cpu_buffer_prototype)
         awaitables = [set_or_delete(self.store_path / key, value) for key, value in to_save.items()]
 
         if ensure_parents:
@@ -1307,7 +1308,7 @@ class AsyncArray(Generic[T_ArrayMetadata]):
                     [
                         (parent.store_path / key).set_if_not_exists(value)
                         for key, value in parent.metadata.to_buffer_dict(
-                            default_buffer_prototype()
+                            cpu_buffer_prototype
                         ).items()
                     ]
                 )

With that, we get cpu.Buffer used for metadata and gpu.Buffer used for data when users configure things properly.

import cupy as cp

import zarr.core.buffer.cpu
import zarr.core.buffer.gpu
import zarr.storage
from zarr.registry import (
    register_buffer,
    register_ndbuffer,
)

register_buffer(zarr.core.buffer.gpu.Buffer)
register_ndbuffer(zarr.core.buffer.gpu.NDBuffer)

zarr.config.set({"buffer": "zarr.core.buffer.gpu.Buffer"})
zarr.config.set({"ndbuffer": "zarr.core.buffer.gpu.NDBuffer"})

src = cp.random.uniform(size=(100, 100))  # allocate on the device
store = zarr.storage.MemoryStore()
z = zarr.create_array(
    store=store, name="a", shape=src.shape, chunks=(10, 10), dtype=src.dtype, overwrite=True
)

z[:10, :10] = src[:10, :10]  # errors
assert isinstance(store._store_dict["zarr.json"], zarr.core.buffer.cpu.Buffer), type(store._store_dict["zarr.json"])
assert type(z[:10, :10]) is cp.ndarray

If we're OK with that (host memory for metadata, GPU memory for data if requested) then I can clean this up and make a PR.

I would actually like to see us implement a set of metadata specific methods on the store.

I don't think we need GPU stuff to motivate that, but IMO it's probably worth doing.

@akshaysubr
Copy link
Contributor

@TomAugspurger This looks good. Would it also be useful to add a helper method to set up the zarr configs for the GPU? Something like zarr.config.set(default_gpu_configs) that does all of the setup? This would be useful when we have GPU codecs so the codec mapping can also be updated easily this way.

@TomAugspurger
Copy link
Contributor

Would it also be useful to add a helper method to set up the zarr configs for the GPU?

Yeah, I think so. It wasn't obvious to me why you might need to update both BufferPrototype.buffer and .nd_buffer, so others might be confused as well. Having a single configuration switch for "do the right thing if you have a gpu" will be valuable I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements to the documentation help wanted Issue could use help from someone with familiarity on the topic
Projects
None yet
Development

No branches or pull requests

5 participants