Use 3 numpy arrays for manifest internally #107

TomNicholas · 2024-05-10T16:54:37Z

Supercedes #39 as a way to close #33, the difference being that this uses 3 separate numpy arrays to store the path strings, byte offsets, and byte range lengths (rather than trying to put them all in one numpy array with a structured dtype). Effectively implements (2) in #104.

Relies on numpy 2.0 (which is currently only available as a release candidate).

…uctured array

for more information, see https://pre-commit.ci

martindurant · 2024-05-10T17:37:05Z

Here is a script which generated a 9GB JSON file across many years of NWM data: https://gist.github.com/rsignell-usgs/d386c85e02697c5b89d0211371e8b944 . I'll see if I can find a parquet version, but you should reckon on 10x in on-disk size (parquet or compressed numpy).

Unfortunately, the references have been deleted, because the whole dataset is now also available as zarr. I may have the chance sometime to regenerate them, if it's important.

martindurant · 2024-05-10T17:40:04Z

Also, a super-simple arrow- or awkward-like string representation as contiguous numpy arrays could look something like

class String:
    def __init__(self, offsets, data) -> None:
        self.offsets = offsets
        self.data = data

    def __getitem__(self, item):
        if isinstance(item, int):
            return self.data[self.offsets[item]: self.offsets[item + 1]].decode()
        else:
            return String(self.offsets.__getitem__(item), self.data)

>>> s = String(np.array([0, 5, 10]), b"HelloWorld")
>>> s[1]
'World'

TomNicholas · 2024-05-10T20:04:11Z

Here is a script which generated a 9GB JSON file across many years of NWM data: https://gist.github.com/rsignell-usgs/d386c85e02697c5b89d0211371e8b944 . I'll see if I can find a parquet version, but you should reckon on 10x in on-disk size (parquet or compressed numpy).

That's useful context for #104, thanks Martin!

…rlying numpy arrays

TomNicholas

h5py (the latter from the scientific python nightly repo)

Can I do this by adding a special pip install command to a conda env?

TomNicholas · 2024-06-12T23:35:06Z

Now builds atop #139

keewis · 2024-06-13T10:14:23Z

Can I do this by adding a special pip install command to a conda env?

you can:

# - netcdf4
- h5netcdf
- pip
- pip:
  - -i https://pypi.anaconda.org/scientific-python-nightly-wheels/simple
  - --pre
  - h5py
  - numpy

(and if you need multiple packages from different sources, put the definition in requirements.txt files and include them in the environment file using -r <requirements.txt>)

martindurant · 2024-06-18T00:26:28Z

Note that en efficient packing of a string array would be all the character data concatenated, and a separate offsets array. This is what arrow or awkward does. Parquet can also store this way, but alternating length-data is standard. Dense packings like that ssusme immutability https://github.com/martindurant/fastparquet/blob/faster/fastparquet/wrappers.py#L51 has the simplest possible version of this (yes, fastparquet is intending to move to pure numpy within months, if I didn't say this already).

TomNicholas · 2024-06-18T00:28:41Z

@martindurant does numpy's new string dtype not do that too?

martindurant · 2024-06-18T00:30:51Z

No, it allocates on the heap and stores pointers, I believe. That is not (necessarily) memory contiguous, but allows for mutability.

TomNicholas · 2024-06-18T05:41:03Z

This PR now works (passes all tests locally). The failures are the same as on main, tracked in #147 .

Note that en efficient packing of a string array would be all the character data concatenated, and a separate offsets array. This is what arrow or awkward does.

What about using a pyarrow string array instead:

@rabernat @martindurant a pyarrow string array might be a bit more memory-efficient, but

a) I can store millions of chunk references using just a few MB with this new numpy dtype, which seems plenty good enough to me

In [1]: from virtualizarr.tests import create_manifestarray

In [2]: marr = create_manifestarray(shape=(100, 100, 100), chunks=(1, 1, 1))

In [3]: marr
Out[3]: ManifestArray<shape=(100, 100, 100), dtype=float32, chunks=(1, 1, 1)>

In [4]: marr.manifest._paths.nbytes / 1e6
Out[4]: 16.0

In [5]: (marr.manifest._paths.nbytes + marr.manifest._offsets.nbytes + marr.manifest._lengths.nbytes) / 1e6
Out[5]: 24.0

b) IIUC pyarrow string arrays are not N-dimensional arrays, and half the point of this PR is that my implementation of concat/stack/broadcasting for ManifestArrays becomes really simple if I can just delegate those operations to a wrapped set of numpy arrays.

TomNicholas · 2024-06-18T05:45:19Z

virtualizarr/manifests/manifest.py

    @classmethod
-    def validate_chunks(cls, entries: Any) -> Mapping[ChunkKey, ChunkEntry]:
-        validate_chunk_keys(list(entries.keys()))
+    def from_arrays(
+        cls,
+        paths: np.ndarray[Any, np.dtype[np.dtypes.StringDType]],
+        offsets: np.ndarray[Any, np.dtype[np.int32]],
+        lengths: np.ndarray[Any, np.dtype[np.int32]],
+    ) -> "ChunkManifest":
+        """
+        Create manifest directly from numpy arrays containing the path and byte range information.
+
+        Useful if you want to avoid the memory overhead of creating an intermediate dictionary first,
+        as these 3 arrays are what will be used internally to store the references.
+


@ayushnag @sharkinsspatial you might want to try building numpy arrays of references and passing them to this new constructor instead to keep memory usage down.

TomNicholas added 15 commits March 17, 2024 16:13

change entries property to a structured array, add from_dict

0c445fd

fix validation

3bc483f

equals method

20f2ded

re-implemented concatenation through concatenation of the wrapped str…

be8af12

…uctured array

fixed manifest.from_kerchunk_dict

bd8ad22

fixed kerchunk tests

385290d

Merge branch 'main' into structured_array_manifest

309019a

Merge branch 'main' into structured_array_manifest

4132b32

Merge branch 'main' into structured_array_manifest

830dccc

change private attributes to 3 numpy arrays

c0180cc

add from_arrays method

e93d3b8

to and from dict working again

3913143

fix dtype comparisons

6a5d996

depend on numpy release candidate

8a77a0a

get concatenation and stacking working

a95117f

TomNicholas added enhancement New feature or request performance labels May 10, 2024

TomNicholas marked this pull request as draft May 10, 2024 16:55

[pre-commit.ci] auto fixes from pre-commit.com hooks

b45b160

for more information, see https://pre-commit.ci

TomNicholas mentioned this pull request May 10, 2024

[WIP] Structured array for manifest #39

Closed

2 tasks

remove manifest-level tests of concatenation

7410a66

TomNicholas added 6 commits May 11, 2024 13:18

generalized create_manifestarray fixture

e1e8bf7

added tests of broadcasting

7e97e74

made basic broadcasting tests pass by calling np.broadcast_to on unde…

96b2841

…rlying numpy arrays

generalize fixture for creating scalar ManifestArrays

00c1757

improve regression test for expanding scalar ManifestArray

06180b3

remove now-unneeded scalar broadcasting logic

dae048b

TomNicholas commented Jun 12, 2024

View reviewed changes

merge in hypothesis test for broadcasting

ef9e4d2

TomNicholas temporarily deployed to test-release June 12, 2024 23:35 — with GitHub Actions Inactive

TomNicholas added 4 commits June 13, 2024 10:59

add backwards compatibility for pre-numpy2.0

0a109bc

Merge branch 'main' into numpy_arrays_manifest

37bfac7

Merge branch 'main' into numpy_arrays_manifest

530bc6e

depend on numpy>=2.0.0

8d2dcd5

TomNicholas added 4 commits June 17, 2024 21:38

Merge branch 'main' into numpy_arrays_manifest

db5f36d

rewrote broadcast_to shape logic

1e62807

remove faulty hypothesis strategies

dc824bb

remove hypothesis from dependencies

a97e5d8

TomNicholas mentioned this pull request Jun 18, 2024

Bug in broadcast_to logic #146

Closed

un-xfail broadcasting test case, fixing #146

e53157e

TomNicholas commented Jun 18, 2024

View reviewed changes

TomNicholas added 4 commits June 18, 2024 11:04

ignore remaining mypy errors

daaa331

release notes

5f6c912

Merge branch 'main' into numpy_arrays_manifest

669f4f8

update dependencies in CI test env

7893c88

TomNicholas merged commit c4d4325 into main Jun 18, 2024
4 checks passed

TomNicholas deleted the numpy_arrays_manifest branch June 18, 2024 20:23

TomNicholas mentioned this pull request Jun 21, 2024

Unable to type hint new StringDType #151

Open

DahnJ mentioned this pull request Jun 28, 2024

Incrementally-populated Zarr Arrays zarr-developers/zarr-specs#300

Open

TomNicholas mentioned this pull request Jul 12, 2024

Support for numpy<2? #184

Closed

TomNicholas mentioned this pull request Oct 2, 2024

Rewrite manifest logic in Rust? #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use 3 numpy arrays for manifest internally #107

Use 3 numpy arrays for manifest internally #107

TomNicholas commented May 10, 2024

martindurant commented May 10, 2024

martindurant commented May 10, 2024 •

edited

Loading

TomNicholas commented May 10, 2024

TomNicholas left a comment •

edited

Loading

TomNicholas commented Jun 12, 2024

keewis commented Jun 13, 2024 •

edited

Loading

martindurant commented Jun 18, 2024

TomNicholas commented Jun 18, 2024

martindurant commented Jun 18, 2024

TomNicholas commented Jun 18, 2024

TomNicholas Jun 18, 2024

Use 3 numpy arrays for manifest internally #107

Use 3 numpy arrays for manifest internally #107

Conversation

TomNicholas commented May 10, 2024

martindurant commented May 10, 2024

martindurant commented May 10, 2024 • edited Loading

TomNicholas commented May 10, 2024

TomNicholas left a comment • edited Loading

Choose a reason for hiding this comment

TomNicholas commented Jun 12, 2024

keewis commented Jun 13, 2024 • edited Loading

martindurant commented Jun 18, 2024

TomNicholas commented Jun 18, 2024

martindurant commented Jun 18, 2024

TomNicholas commented Jun 18, 2024

TomNicholas Jun 18, 2024

Choose a reason for hiding this comment

martindurant commented May 10, 2024 •

edited

Loading

TomNicholas left a comment •

edited

Loading

keewis commented Jun 13, 2024 •

edited

Loading