-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XL Datasets: Minimal Zarr-only dataset implementation #186
Conversation
Test-driven development! Yeah
Now the fun starts...
* updates for calculating zarr manifests & adding basic tests for it * moving cache_dir assignment to DatasetV1 and DatasetV2 model validators * Updating argument types for parquet utils * Updating argument types for md5 util * fixing DatasetV1 export & dataset model validators * PR feedback updates * Adding test that checks the length of the manifest after update * PR feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cwognum! Left some comments but only a couple may be blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few comments and suggestions. Overall, good, solid work. 😄
@jstlaurent @Andrewq11 Thank you for the review! This ended up being a good opportunity to also review the Dataset code that was mostly written over a year ago. One of the things I realized, is that the Unfortunately, this does imply that some of the code examples we've shown in presentations are no longer accurate. See also https://github.com/polaris-hub/polaris-hub/pull/471. I am personally okay with this, since the Dataset API generally does not get a lot of attention compared to the Benchmark API. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on my side. Thanks for all the work here!
More progress toward XL datasets 🚀
…est from the v2 dataset and moved the verify_checksum parameter to v1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Still had a couple of things to say. 😉
Changelogs
DatasetV2
class in a newpolaris.experimental
module.DatasetV1
, but create theDataset
alias for it.BaseDataset
class that encompasses all functionality that is shared between V1 and V2.__getitem__
to only allow indexing a specific value or a row.Checklist:
feature
,fix
,chore
,documentation
ortest
(or ask a maintainer to do it for you).Closes #132
The implementation has been relatively straight-forward given the PoC. There are two points that required some further thinking.
Dataset
alias for it to not break backwards compatibility. I also introduced a newexperimental
module to house the V2 implementation.On V1 and V2 compatibility
You can think of a Pandas DataFrame as a set of named NumPy arrays. For storage, the conversion from Pandas to a Zarr archive is therefore relatively straight-forward. Note that this is limited to the storage - It's not straight-forward to implement the Pandas API on top of a Zarr archive, but this was decided to be out of scope.
An example of this conversion was implemented in this test case. A toy example would look like: