Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some notes on alignment with pangeo-forge-recipes around "keys" #32

Open
rabernat opened this issue Aug 16, 2021 · 1 comment
Open

Some notes on alignment with pangeo-forge-recipes around "keys" #32

rabernat opened this issue Aug 16, 2021 · 1 comment

Comments

@rabernat
Copy link

At today's Pangeo Forge meeting, @alxmrs told us a bit more about Xarray Beam. I'm happy to note that we are aligning a bit around certain abstractions. I'm not proposing to merge or share code or anything at this stage. Just happy to see that we have independently arrived at some similar concepts. Comparing side by side may also suggest improvements.

Xarray Beam

https://xarray-beam.readthedocs.io/en/latest/data-model.html#keys-in-xarray-beam

To keep track of how individual records could be combined into a larger (virtual) dataset, Xarray-Beam defines a Key object. > Key objects consist of:

  • offsets: integer offests for chunks from the origin in an immutabledict
  • vars: The subset of variables included in each chunk, either as a frozenset, or as None to indicate “all variables”.

Pangeo Forge

We currently call this an "index," although I agree that key is a better name. Our use of indexes is less well documented as it is not really public API but more of an internal thing. We mention it here: https://pangeo-forge.readthedocs.io/en/latest/recipe_user_guide/file_patterns.html#inspect-a-filepattern

The index is its own special type of object used internally by recipes, a pangeo_forge_recipes.patterns.Index, (which is basically a tuple of one or more pangeo_forge_recipes.patterns.DimIndex objects).

The code is pretty straightforward and the comments explain things better.

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/49997cb52cff466bd394c1348ef23981e782a4d9/pangeo_forge_recipes/patterns.py#L70-L114

These indexes are used both by FilePatterns (see #31) as well as for parallel writing of Zarr datasets.

cc @cisaacstern

@shoyer
Copy link
Member

shoyer commented Aug 17, 2021

Thanks for the references! I've been picking this up as well in my review of #31. This is indeed a happy coincidence.

From an information perspective, it seems like Index keeps track of a little more information? We don't have an equivalent of sequence_len in xarray_beam.Key. I can see how that would be useful -- in Xarray-Beam, you have to supply dimension sizes explicitly in Rechunk. On the other hand, maybe you don't need that information for many operations, so it could be annoying to need to update that.

Did you consider using a frozenset instead of a tuple for Index?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants