Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental code for data fragments. #282

Merged
merged 29 commits into from
Sep 19, 2023
Merged

Conversation

JSKenyon
Copy link
Collaborator

@JSKenyon JSKenyon commented Aug 3, 2023

  • Tests added / passed

    $ py.test -v -s daskms/tests

    If the pep8 tests fail, the quickest way to correct
    this is to run autopep8 and then flake8 and
    pycodestyle to fix the remaining issues.

    $ pip install -U autopep8 flake8 pycodestyle
    $ autopep8 -r -i daskms
    $ flake8 daskms
    $ pycodestyle daskms
    
  • Fully documented, including HISTORY.rst for all changes
    and one of the docs/*-api.rst files for new API

    To build the docs locally:

    pip install -r requirements.readthedocs.txt
    cd docs
    READTHEDOCS=True make html
    

This PR is a WIP which investigates reading data from multiple sources (with potentially different backends) and utilising xarray functionality to merge the resulting datasets dynamically. Practically, this makes it possible to read the static contents of a measurement set e.g. DATA, UVW from one location (e.g. a read-only s3 bucket) and the mutable contents such as FLAG from another location. This may make it possible to implement a basic versioning system in which we create proxy datasets which hold some (mutable) data, but which point back at some parent object from which the remaining data can be retrieved.

@JSKenyon JSKenyon marked this pull request as draft August 3, 2023 13:54
@JSKenyon JSKenyon changed the title Add experimental code for reading data from multiple sources. Add experimental code for data fragments. Aug 8, 2023
@JSKenyon JSKenyon requested a review from sjperkins August 11, 2023 10:14
@JSKenyon
Copy link
Collaborator Author

Ok, I think that this is ready for another pair of eyes, if only to sanity check what I have done so far. I think that it is pretty simple.
One thing to note is that I elected not to use the current CLI infrastructure. I did make an attempt but ran into issues with nested subparsers.

Currently the CLI is very basic and only provides the option to stat or rebase a fragment. The first of these simply reports the parents of the target fragment. The second allows a user to modify the parent in place. This is useful if you want to exclude bad/irrelevant parents.

The CLI could optionally be extended with the following, more complicated functionality:

  • merge: Write the contents of a fragment and its parents back to the root. This shouldn't be too difficult but can possibly wait until we rework __dask_ms_metadata__ i.e. it will be much easier if all the info to reproduce the appropriate xarray.Dataset objects is present in the metadata. This is important functionality as applications which don't use dask-ms directly won't be able to utilize the fragments directly.
  • composite: Produce a new fragment which cherry-picks data variables from multiple fragments. This is less urgent but may become important if users need to mix and match state from various fragments e.g. retaining the newest version of CORRECTED_DATA while rolling back the flags to an earlier fragment. This may be easier than merge as fragments are always zarr, so the ordering/grouping is implicit in the way the data is stored. The difficulty here will be the selection mechanism/CLI interface.

@JSKenyon JSKenyon marked this pull request as ready for review August 22, 2023 07:28
@JSKenyon JSKenyon marked this pull request as draft August 22, 2023 07:30
@JSKenyon JSKenyon marked this pull request as ready for review August 22, 2023 07:30
@JSKenyon
Copy link
Collaborator Author

This PR doesn't depend on #284 but that PR is likely also required for this functionality to be exploited as it is sometimes necessary to rechunk data being written to a fragment due to zarr chunk size limits.

@sjperkins sjperkins mentioned this pull request Aug 29, 2023
6 tasks
@sjperkins
Copy link
Member

Could you also please rebase this PR on master?

@JSKenyon JSKenyon force-pushed the multisource-experimental branch from 26b7a91 to 4c7b856 Compare September 14, 2023 09:07
@JSKenyon
Copy link
Collaborator Author

I have rebased to master. I hope I did it correctly - I haven't had much practice with rebase.

@sjperkins sjperkins merged commit bee456a into master Sep 19, 2023
13 checks passed
@sjperkins sjperkins deleted the multisource-experimental branch September 19, 2023 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants