Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fs] basic sync tool #14248

Open
wants to merge 66 commits into
base: main
Choose a base branch
from
Open

[fs] basic sync tool #14248

wants to merge 66 commits into from

Conversation

danking
Copy link
Contributor

@danking danking commented Feb 5, 2024

CHANGELOG: Introduce hailctl fs sync which robustly transfers one or more files between Amazon S3, Azure Blob Storage, and Google Cloud Storage.

There are really two distinct conceptual changes remaining here. Given my waning time available, I am not going to split them into two pull requests. The changes are:

  1. basename always agrees with the basename UNIX utility. In particular, the folder /foo/bar/baz/'s basename is not '' it is 'baz'. The only folders or objects whose basename is '' are objects whose name literally ends in a slash, e.g. an object named gs://foo/bar/baz/.

  2. hailctl fs sync, a robust copying tool with a user-friendly CLI.

hailctl fs sync comprises two pieces: plan.py and sync.py. The latter, sync.py is simple: it delegates to our existing copy infrastructure. That copy infastructure has been lightly modified to support this use-case. The former, plan.py, is a concurrent file system diff.

plan.py generates and sync.py consumes a "plan folder" containing these files:

  1. matches files whose names and sizes match. Two columns: source URL, destination URL.

  2. differs files or folders whose names match but either differ in size or differ in type. Four columns: source URL, destination URL, source state, destination state. The states are either: file, dif, or a size. If either state is a size, both states are sizes.

  3. srconly files only present in the source. One column: source URL.

  4. dstonly files only present in the destination. One column: destination URL.

  5. plan a proposed set of object-to-object copies. Two columns: source URL, destination URL.

  6. summary a one-line file containing the total number of copies in plan and the total number of bytes which would be copied.

As described in the CLI documentation, the intended use of these commands is:

hailctl fs sync --make-plan plan1 --copy-to gs://gcs-bucket/a s3://s3-bucket/b
hailctl fs sync --use-plan plan1

The first command generates a plan folder and the second command executes the plan. Separating this process into two commands allows the user to verify what exactly will be copied including the exact destination URLs. Moreover, if hailctl fs sync --use-plan fails, the user can re-run hailctl fs sync --make-plan to generate a new plan which will avoid copying already successfully copied files. Moreover, the user can re-run hailctl fs sync --make-plan to verify that every file was indeed successfully copied.

Testing. This change has a few sync-specific tests but largely reuses the tests for hailtop.aiotools.copy.

Future Work. Propagating a consistent kind of hash across all clouds and using that for detecting differences is a better solution than the file-size based difference used here. If all the clouds always provided the same type of hash value, this would be trivial to add. Alas, at time of writing, S3 and Google both support CRC32C for every blob (though, in S3, you must explicitly request it at object creation time), but Azure Blob Storage does not. ABS only supports MD5 sums which Google does not support for multi-part uploads.

Resolves #14654

@danking danking mentioned this pull request Feb 5, 2024
@danking danking force-pushed the new-new-copier branch 2 times, most recently from e4ad2c5 to c4d1a62 Compare February 8, 2024 01:00
@danking
Copy link
Contributor Author

danking commented Feb 12, 2024

I made some pretty substantial changes over the weekend to allow me to copy our giant annotation database buckets. Let me clean those up before we review.

@danking danking force-pushed the new-new-copier branch 3 times, most recently from b42f926 to 5363d1f Compare February 12, 2024 22:15
Dan King added 19 commits February 27, 2024 17:17
CHANGELOG: Introduce `hailctl fs sync` which robustly transfers one or more files between Amazon S3, Azure Blob Storage, and Google Cloud Storage.

There are really two distinct conceptual changes remaining here. Given my waning time available, I
am not going to split them into two pull requests. The changes are:

1. `basename` always agrees with the the [`basename` UNIX
utility](https://en.wikipedia.org/wiki/Basename). In particular, the folder `/foo/bar/baz/`'s
basename is *not* `''` it is `'baz'`. The only folders or objects whose basename is `''` are objects
whose name literally ends in a slash, e.g. an *object* named `gs://foo/bar/baz/`.

2. `hailctl fs sync`, a robust copying tool with a user-friendly CLI.

`hailctl fs sync` comprises two pieces: `plan.py` and `sync.py`. The latter, `sync.py` is simple: it
delegates to our existing copy infrastructure. That copy infastructure has been lightly modified to
support this use-case. The former, `plan.py`, is concurrent file system `diff`.

`plan.py` generates and `sync.py` consumes a "plan folder" containing these files:

1. `matches` files whose names and sizes match. Two columns: source URL, destination URL.

2. `differs` files or folders whose names match but either differ in size or differ in type. Four
   columns: source URL, destination URL, source state, destination state. The states are either:
   `file`, `dif`, or a size. If either state is a size, both states are sizes.

3. `srconly` files only present in the source. One column: source URL.

4. `dstonly` files only present in the destination. One column: destination URL.

5. `plan` a proposed set of object-to-object copies. Two columns: source URL, destination URL.

6. `sumary` a one-line file containing the total number of copies in plan and the total number of
   bytes which would be copied.

As described in the CLI documentation, the intended use of these commands is:

```
hailctl fs sync --make-plan plan1 --copy-to gs://gcs-bucket/a s3://s3-bucket/b
hailctl fs sync --use-plan plan1
```

The first command generates a plan folder and the second command executes the plan. Separating this
process into two commands allows the user to verify what exactly will be copied including the exact
destination URLs. Moreover, if `hailctl fs sync --use-plan` fails, the user can re-run `hailctl fs
sync --make-plan` to generate a new plan which will avoid copying already successfully copied files.
Moreover, the user can re-run `hailctl fs sync --make-plan` to verify that every file was indeed
successfully copied.

Testing. This change has a few sync-specific tests but largely reuses the tests for `hailtop.aiotools.copy`.

Future Work. Propagating a consistent kind of hash across all clouds and using that for detecting
differences is a better solution than the file-size based difference used here. If all the clouds
always provided the same type of hash value, this would be trivial to add. Alas, at time of writing,
S3 and Google both support CRC32C for every blob (though, in S3, you must explicitly request it at
object creation time), but *Azure Blob Storage does not*. ABS only supports MD5 sums which Google
does not support for multi-part uploads.
@chrisvittal chrisvittal self-assigned this Jun 25, 2024
chrisvittal
chrisvittal previously approved these changes Aug 5, 2024
Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start at a user friendly cross cloud sync tool and I feel that it should be merged as is.

There's a lot of opportunity to take advantage of cloud specific APIs like the GCS storage transfer service to make this and the more basic copier tool more robust.

@chrisvittal chrisvittal removed their assignment Aug 7, 2024
@chrisvittal chrisvittal self-assigned this Aug 7, 2024
@patrick-schultz patrick-schultz dismissed chrisvittal’s stale review August 19, 2024 19:32

Dismissing for now, as our CI currently thinks the approval makes this mergable

@chrisvittal chrisvittal self-requested a review August 19, 2024 20:04
Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving after my previous review was dismissed to unblock the merge queue.

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is about ready to merge. Just one question.


import click
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new dependency? Should we add it to requirements.txt?

@chrisvittal chrisvittal mentioned this pull request Sep 25, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[fs] The sync tool issue
4 participants