Add experimental code for data fragments. #282

JSKenyon · 2023-08-03T13:54:24Z

Tests added / passed
```
$ py.test -v -s daskms/tests
```
If the pep8 tests fail, the quickest way to correct
this is to run autopep8 and then flake8 and
pycodestyle to fix the remaining issues.
```
$ pip install -U autopep8 flake8 pycodestyle
$ autopep8 -r -i daskms
$ flake8 daskms
$ pycodestyle daskms
```
Fully documented, including HISTORY.rst for all changes
and one of the docs/*-api.rst files for new API

To build the docs locally:
```
pip install -r requirements.readthedocs.txt
cd docs
READTHEDOCS=True make html
```

This PR is a WIP which investigates reading data from multiple sources (with potentially different backends) and utilising xarray functionality to merge the resulting datasets dynamically. Practically, this makes it possible to read the static contents of a measurement set e.g. DATA, UVW from one location (e.g. a read-only s3 bucket) and the mutable contents such as FLAG from another location. This may make it possible to implement a basic versioning system in which we create proxy datasets which hold some (mutable) data, but which point back at some parent object from which the remaining data can be retrieved.

JSKenyon · 2023-08-11T10:30:22Z

Ok, I think that this is ready for another pair of eyes, if only to sanity check what I have done so far. I think that it is pretty simple.
One thing to note is that I elected not to use the current CLI infrastructure. I did make an attempt but ran into issues with nested subparsers.

Currently the CLI is very basic and only provides the option to stat or rebase a fragment. The first of these simply reports the parents of the target fragment. The second allows a user to modify the parent in place. This is useful if you want to exclude bad/irrelevant parents.

The CLI could optionally be extended with the following, more complicated functionality:

merge: Write the contents of a fragment and its parents back to the root. This shouldn't be too difficult but can possibly wait until we rework __dask_ms_metadata__ i.e. it will be much easier if all the info to reproduce the appropriate xarray.Dataset objects is present in the metadata. This is important functionality as applications which don't use dask-ms directly won't be able to utilize the fragments directly.
composite: Produce a new fragment which cherry-picks data variables from multiple fragments. This is less urgent but may become important if users need to mix and match state from various fragments e.g. retaining the newest version of CORRECTED_DATA while rolling back the flags to an earlier fragment. This may be easier than merge as fragments are always zarr, so the ordering/grouping is implicit in the way the data is stored. The difficulty here will be the selection mechanism/CLI interface.

JSKenyon · 2023-08-22T07:32:36Z

This PR doesn't depend on #284 but that PR is likely also required for this functionality to be exploited as it is sometimes necessary to rechunk data being written to a fragment due to zarr chunk size limits.

sjperkins · 2023-09-14T08:29:33Z

Could you also please rebase this PR on master?

… a wrapper around identical functionality.

… load datasets.

…ny parent.

JSKenyon · 2023-09-14T09:09:24Z

I have rebased to master. I hope I did it correctly - I haven't had much practice with rebase.

JSKenyon marked this pull request as draft August 3, 2023 13:54

JSKenyon changed the title ~~Add experimental code for reading data from multiple sources.~~ Add experimental code for data fragments. Aug 8, 2023

JSKenyon requested a review from sjperkins August 11, 2023 10:14

JSKenyon marked this pull request as ready for review August 22, 2023 07:28

JSKenyon marked this pull request as draft August 22, 2023 07:30

JSKenyon marked this pull request as ready for review August 22, 2023 07:30

sjperkins mentioned this pull request Aug 29, 2023

Experimental clean-up release #285

Open

6 tasks

JSKenyon added 20 commits September 14, 2023 11:02

Add experimental code for reading data from multiple sources.

48da17d

WIP on proxy reads.

3b50340

Add further work on composite datasets.

c25bca9

Rename and document xds_to/from_fragment.

73278e2

Checkpoint further work on fragment code.

3a50c17

Rename experimental module.

473b346

Apply black.

58f8f28

Some tidying.

9ed2818

More tidying.

1713981

More docs.

4090974

Clarify variable names.

a679d63

More notes about storage options.

2867bed

Begin adding tests. Fix bug exposesd by tests.

89f6dc3

Simplify.

ea17b50

Apply black.

8fcd44e

Use xds_from_table_fragment everywhere - xds_from_ms_fragment is just…

fcafb7e

… a wrapper around identical functionality.

Add test/fix for self-parenthood.

9fd8a4f

Add pytest to testing dependencies. Check with Simon.

747007f

Add further tests.

0147ac2

More tests and more consistent use of tmp_path_factory.

91bd15b

JSKenyon added 8 commits September 14, 2023 11:02

Do not merge fragment attrs into parent (for now).

8d4d3bb

Add some tests for subtable fragments.

7b38704

Checkpoint changes to fragments code - first determine ancestry, then…

6792c8e

… load datasets.

Add beginnings of fragments cli.

04a82d0

Simplify get_ancestry code.

f1cfeb1

Update HISTORY.rst.

122776e

Fix silent failure when accessing a subtable which doesn't exist in a…

f0c96f0

…ny parent.

Add root_url to DaskMSStore. Fixes incorrect stores for s3.

4c7b856

JSKenyon force-pushed the multisource-experimental branch from 26b7a91 to 4c7b856 Compare September 14, 2023 09:07

Merge branch 'master' into multisource-experimental

5cdef02

sjperkins merged commit bee456a into master Sep 19, 2023
13 checks passed

sjperkins deleted the multisource-experimental branch September 19, 2023 12:27

This was referenced Sep 19, 2023

Remove xarray from testing dependencies #291

Merged

Dataset Versioning #248

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental code for data fragments. #282

Add experimental code for data fragments. #282

JSKenyon commented Aug 3, 2023 •

edited

Loading

JSKenyon commented Aug 11, 2023

JSKenyon commented Aug 22, 2023

sjperkins commented Sep 14, 2023

JSKenyon commented Sep 14, 2023

Add experimental code for data fragments. #282

Add experimental code for data fragments. #282

Conversation

JSKenyon commented Aug 3, 2023 • edited Loading

JSKenyon commented Aug 11, 2023

JSKenyon commented Aug 22, 2023

sjperkins commented Sep 14, 2023

JSKenyon commented Sep 14, 2023

JSKenyon commented Aug 3, 2023 •

edited

Loading