XL Datasets: Minimal Zarr-only dataset implementation #186

cwognum · 2024-08-27T23:50:51Z

Changelogs

Introduce a new DatasetV2 class in a new polaris.experimental module.
Explicitly rename the old dataset to DatasetV1, but create the Dataset alias for it.
Extract a BaseDataset class that encompasses all functionality that is shared between V1 and V2.
Restricted use of __getitem__ to only allow indexing a specific value or a row.
Add a new Zarr manifest for upload integrity.
- See Adding new Zarr manifest generation to DatasetV2 class #185

Checklist:

Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
Update the API documentation if a new function is added, or an existing one is deleted.
Write concise and explanatory changelogs above.
If possible, assign one of the following labels to the PR: feature, fix, chore, documentation or test (or ask a maintainer to do it for you).

Closes #132

The implementation has been relatively straight-forward given the PoC. There are two points that required some further thinking.

Naming and code organization: I wasn't sure how to name the new class, where to introduce it and whether to rename the old class. For clarity, I decided to rename the old class, but we do maintain the Dataset alias for it to not break backwards compatibility. I also introduced a new experimental module to house the V2 implementation.
Indexing: With Add a converter from PDB to Zarr to the DatasetFactory #171, we made it possible for the Zarr root to contain groups. This raises the question how to index those. For these cases, I decided to explicitly introduce a index array that needs to be part of that group.

On V1 and V2 compatibility

You can think of a Pandas DataFrame as a set of named NumPy arrays. For storage, the conversion from Pandas to a Zarr archive is therefore relatively straight-forward. Note that this is limited to the storage - It's not straight-forward to implement the Pandas API on top of a Zarr archive, but this was decided to be out of scope.

An example of this conversion was implemented in this test case. A toy example would look like:

    df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

    root = zarr.open("path/to/arhive.zarr", "w")
    for col in df.columns:
        root.array(col, data=df[col].values)

Test-driven development! Yeah

Now the fun starts...

* updates for calculating zarr manifests & adding basic tests for it * moving cache_dir assignment to DatasetV1 and DatasetV2 model validators * Updating argument types for parquet utils * Updating argument types for md5 util * fixing DatasetV1 export & dataset model validators * PR feedback updates * Adding test that checks the length of the manifest after update * PR feedback

Andrewq11

Thanks @cwognum! Left some comments but only a couple may be blocking.

polaris/dataset/_base.py

polaris/dataset/_subset.py

polaris/utils/v2_manifest.py

tests/test_dataset_v2.py

tests/test_evaluate.py

tests/test_dataset_v2.py

jstlaurent

I have a few comments and suggestions. Overall, good, solid work. 😄

polaris/dataset/_base.py

polaris/experimental/_dataset_v2.py

polaris/mixins/_checksum.py

polaris/utils/v2_manifest.py

…m__ method

cwognum · 2024-09-05T23:06:59Z

@jstlaurent @Andrewq11 Thank you for the review!

This ended up being a good opportunity to also review the Dataset code that was mostly written over a year ago. One of the things I realized, is that the __getitem__() method was partially broken and implementing it properly would've taken a good amount of time and complicate the code. This would've become even more tricky for the DatasetV2 class. I decided to therefore change this part of the API and simplify the type of indexing we allow.

Unfortunately, this does imply that some of the code examples we've shown in presentations are no longer accurate. See also https://github.com/polaris-hub/polaris-hub/pull/471. I am personally okay with this, since the Dataset API generally does not get a lot of attention compared to the Benchmark API. What do you think?

polaris/dataset/_base.py

polaris/experimental/_dataset_v2.py

polaris/utils/errors.py

Andrewq11

Looks good on my side. Thanks for all the work here!

More progress toward XL datasets 🚀

…est from the v2 dataset and moved the verify_checksum parameter to v1

jstlaurent

Looks good. Still had a couple of things to say. 😉

polaris/dataset/_base.py

polaris/dataset/_subset.py

polaris/experimental/_dataset_v2.py

cwognum added 7 commits August 26, 2024 19:30

Extracted common interface between V1 and V2

2f4fd18

Skeleton structure for tests and Dataset V2. Small changes to shared API

81cee18

Implemented the test cases

613dcb2

Test-driven development! Yeah

Basic test cases passed

27c73ab

Now the fun starts...

Added additional validation

6484216

Improved docs

df33bfc

Fixed some reference errors in the docs

7d4b718

cwognum added the feature Annotates any PR that adds new features; Used in the release process label Aug 27, 2024

cwognum and others added 7 commits August 27, 2024 19:52

Merge branch 'main' into feat/dataset-v2

295265c

Disable use of iloc to loc mapping for Dataset V2

0484d68

Updated import to prevent circular import

ca76f9d

Ruff check and format

f0b7c4b

fixing code check test

18bde88

Move code to dataset base class

75fe310

cwognum requested review from kirahowe, jstlaurent and Andrewq11 September 3, 2024 14:55

cwognum self-assigned this Sep 3, 2024

cwognum added this to the XL Datasets milestone Sep 3, 2024

Andrewq11 reviewed Sep 3, 2024

View reviewed changes

jstlaurent reviewed Sep 4, 2024

View reviewed changes

cwognum and others added 2 commits September 4, 2024 20:30

Merge branch 'main' into feat/dataset-v2

1194836

Addressed most feedback on the PR, still need to revisit the __getite…

024e71d

…m__ method

cwognum changed the title ~~XXL Datasets: Minimal Zarr-only dataset implementation~~ XL Datasets: Minimal Zarr-only dataset implementation Sep 5, 2024

cwognum mentioned this pull request Sep 5, 2024

Rethink how the owner for an artifact is specified #192

Open

Worked on the __getitem__ method

13fa9f1

cwognum requested review from Andrewq11 and jstlaurent September 5, 2024 23:07

Address special case of pointer columns

0e04c1f

Andrewq11 reviewed Sep 6, 2024

View reviewed changes

polaris/dataset/_base.py Outdated Show resolved Hide resolved

Andrewq11 reviewed Sep 6, 2024

View reviewed changes

polaris/experimental/_dataset_v2.py Outdated Show resolved Hide resolved

Andrewq11 reviewed Sep 6, 2024

View reviewed changes

polaris/utils/errors.py Show resolved Hide resolved

Andrewq11 approved these changes Sep 6, 2024

View reviewed changes

cwognum added 3 commits September 6, 2024 17:43

Renamed md5sum to zarr_manifest_md5sum for clarity, remove equality t…

d3a18d5

…est from the v2 dataset and moved the verify_checksum parameter to v1

Merge branch 'main' into feat/dataset-v2

6efee7d

Fix missing import

7bf7ac8

jstlaurent approved these changes Sep 11, 2024

View reviewed changes

polaris/dataset/_base.py Outdated Show resolved Hide resolved

polaris/dataset/_subset.py Outdated Show resolved Hide resolved

polaris/experimental/_dataset_v2.py Outdated Show resolved Hide resolved

polaris/experimental/_dataset_v2.py Show resolved Hide resolved

cwognum added 2 commits September 11, 2024 15:17

Added PR feedback

6d35122

Update decorators

8ae8e5e

cwognum merged commit d28c6e4 into main Sep 11, 2024
4 checks passed

cwognum deleted the feat/dataset-v2 branch September 11, 2024 19:43

cwognum mentioned this pull request Sep 11, 2024

Type the return type of @model_validator(mode="after") with Self #198

Merged

5 tasks

cwognum mentioned this pull request Oct 16, 2024

XL Datasets: Upload #191

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XL Datasets: Minimal Zarr-only dataset implementation #186

XL Datasets: Minimal Zarr-only dataset implementation #186

cwognum commented Aug 27, 2024 •

edited

Loading

Andrewq11 left a comment

jstlaurent left a comment

cwognum commented Sep 5, 2024

Andrewq11 left a comment

jstlaurent left a comment

XL Datasets: Minimal Zarr-only dataset implementation #186

XL Datasets: Minimal Zarr-only dataset implementation #186

Conversation

cwognum commented Aug 27, 2024 • edited Loading

Changelogs

On V1 and V2 compatibility

Andrewq11 left a comment

Choose a reason for hiding this comment

jstlaurent left a comment

Choose a reason for hiding this comment

cwognum commented Sep 5, 2024

Andrewq11 left a comment

Choose a reason for hiding this comment

jstlaurent left a comment

Choose a reason for hiding this comment

cwognum commented Aug 27, 2024 •

edited

Loading