Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XL Datasets: Minimal Zarr-only dataset implementation #186

Merged
merged 23 commits into from
Sep 11, 2024
Merged

Conversation

cwognum
Copy link
Collaborator

@cwognum cwognum commented Aug 27, 2024

Changelogs

  • Introduce a new DatasetV2 class in a new polaris.experimental module.
  • Explicitly rename the old dataset to DatasetV1, but create the Dataset alias for it.
  • Extract a BaseDataset class that encompasses all functionality that is shared between V1 and V2.
  • Restricted use of __getitem__ to only allow indexing a specific value or a row.
  • Add a new Zarr manifest for upload integrity.

Checklist:

  • Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
  • Update the API documentation if a new function is added, or an existing one is deleted.
  • Write concise and explanatory changelogs above.
  • If possible, assign one of the following labels to the PR: feature, fix, chore, documentation or test (or ask a maintainer to do it for you).

Closes #132

The implementation has been relatively straight-forward given the PoC. There are two points that required some further thinking.

  1. Naming and code organization: I wasn't sure how to name the new class, where to introduce it and whether to rename the old class. For clarity, I decided to rename the old class, but we do maintain the Dataset alias for it to not break backwards compatibility. I also introduced a new experimental module to house the V2 implementation.
  2. Indexing: With Add a converter from PDB to Zarr to the DatasetFactory #171, we made it possible for the Zarr root to contain groups. This raises the question how to index those. For these cases, I decided to explicitly introduce a index array that needs to be part of that group.

On V1 and V2 compatibility

You can think of a Pandas DataFrame as a set of named NumPy arrays. For storage, the conversion from Pandas to a Zarr archive is therefore relatively straight-forward. Note that this is limited to the storage - It's not straight-forward to implement the Pandas API on top of a Zarr archive, but this was decided to be out of scope.

An example of this conversion was implemented in this test case. A toy example would look like:

    df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

    root = zarr.open("path/to/arhive.zarr", "w")
    for col in df.columns:
        root.array(col, data=df[col].values)

@cwognum cwognum added the feature Annotates any PR that adds new features; Used in the release process label Aug 27, 2024
cwognum and others added 7 commits August 27, 2024 19:52
* updates for calculating zarr manifests & adding basic tests for it

* moving cache_dir assignment to DatasetV1 and DatasetV2 model validators

* Updating argument types for parquet utils

* Updating argument types for md5 util

* fixing DatasetV1 export & dataset model validators

* PR feedback updates

* Adding test that checks the length of the manifest after update

* PR feedback
@cwognum cwognum self-assigned this Sep 3, 2024
@cwognum cwognum added this to the XL Datasets milestone Sep 3, 2024
Copy link
Contributor

@Andrewq11 Andrewq11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cwognum! Left some comments but only a couple may be blocking.

polaris/dataset/_base.py Outdated Show resolved Hide resolved
polaris/dataset/_base.py Outdated Show resolved Hide resolved
polaris/dataset/_base.py Outdated Show resolved Hide resolved
polaris/dataset/_base.py Show resolved Hide resolved
polaris/dataset/_base.py Outdated Show resolved Hide resolved
polaris/dataset/_subset.py Outdated Show resolved Hide resolved
polaris/utils/v2_manifest.py Outdated Show resolved Hide resolved
tests/test_dataset_v2.py Outdated Show resolved Hide resolved
tests/test_evaluate.py Outdated Show resolved Hide resolved
tests/test_dataset_v2.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jstlaurent jstlaurent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments and suggestions. Overall, good, solid work. 😄

polaris/dataset/_base.py Outdated Show resolved Hide resolved
polaris/dataset/_base.py Outdated Show resolved Hide resolved
polaris/dataset/_base.py Outdated Show resolved Hide resolved
polaris/dataset/_base.py Outdated Show resolved Hide resolved
polaris/dataset/_base.py Outdated Show resolved Hide resolved
polaris/experimental/_dataset_v2.py Outdated Show resolved Hide resolved
polaris/experimental/_dataset_v2.py Outdated Show resolved Hide resolved
polaris/experimental/_dataset_v2.py Outdated Show resolved Hide resolved
polaris/mixins/_checksum.py Outdated Show resolved Hide resolved
polaris/utils/v2_manifest.py Outdated Show resolved Hide resolved
@cwognum cwognum changed the title XXL Datasets: Minimal Zarr-only dataset implementation XL Datasets: Minimal Zarr-only dataset implementation Sep 5, 2024
@cwognum
Copy link
Collaborator Author

cwognum commented Sep 5, 2024

@jstlaurent @Andrewq11 Thank you for the review!

This ended up being a good opportunity to also review the Dataset code that was mostly written over a year ago. One of the things I realized, is that the __getitem__() method was partially broken and implementing it properly would've taken a good amount of time and complicate the code. This would've become even more tricky for the DatasetV2 class. I decided to therefore change this part of the API and simplify the type of indexing we allow.

Unfortunately, this does imply that some of the code examples we've shown in presentations are no longer accurate. See also https://github.com/polaris-hub/polaris-hub/pull/471. I am personally okay with this, since the Dataset API generally does not get a lot of attention compared to the Benchmark API. What do you think?

polaris/dataset/_base.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Andrewq11 Andrewq11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on my side. Thanks for all the work here!

More progress toward XL datasets 🚀

Copy link
Contributor

@jstlaurent jstlaurent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Still had a couple of things to say. 😉

polaris/dataset/_base.py Outdated Show resolved Hide resolved
polaris/dataset/_subset.py Outdated Show resolved Hide resolved
polaris/experimental/_dataset_v2.py Outdated Show resolved Hide resolved
polaris/experimental/_dataset_v2.py Show resolved Hide resolved
@cwognum cwognum merged commit d28c6e4 into main Sep 11, 2024
4 checks passed
@cwognum cwognum deleted the feat/dataset-v2 branch September 11, 2024 19:43
@cwognum cwognum mentioned this pull request Oct 16, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Annotates any PR that adds new features; Used in the release process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Larger-than-memory datasets with Zarr
3 participants