Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bundle diff and update #214

Open
ransomw1c opened this issue Jun 19, 2019 · 6 comments
Open

bundle diff and update #214

ransomw1c opened this issue Jun 19, 2019 · 6 comments
Assignees
Labels
CLI issues around CLI for datamon feature-request A feature is a set of net new related use cases P0 Highest priority. Current planned for work.

Comments

@ransomw1c
Copy link
Contributor

in order to move data from Argo data-science workflows to production services (that is, the programs more directly connected to an externally visible web program) via datamon, the plan is to use either bundle ids or labels (which resolve to bundle ids) to provide of referring to workflow output from the production services. now, from one such reference (bundle id) to the next, there could in general be a fair amount of duplicate files in the data-science artifact that production needs to access. production is using bundle download (not the FUSE fs) to access the data, so supposing that there's already a bundle downloaded to the production environment, there need be a way, given the next reference, to (1) describe the diff between the currently downloaded bundle (in terms of file lists, not file contents) as well as (2) updated the downloaded bundle on disk such that only the differences as resolved, not such that the entirety of the bundle is downloaded again.

as is, the result of a bundle download contains a .datamon/ directory with the bundle's metadata (file lists with names and hashes). so to provide the diff, it'll suffice to compare the metadata available on GCS to the local .datamon/ directory. then updating the bundle on disk will consist of concurrently iterating through the diff, (i) adding missing files, (ii) removing additional files, and (iii) replacing any file with a differing hash. afterward, the .datamon/ directory metadata will need to be updated such that the result of an update is the same as the result of fresh bundle download: there is not a local history being stored within .datamon/ as with .git/.


note that this iss is distinct from the similarly-titled #204 : that issue has to do with updating a bundle stored in GCS via local changes (specifically, via the FUSE filesystem abstraction). this issue is about updating a bundle stored locally (after a bundle download) with changes from GCS.

@ransomw1c ransomw1c added feature-request A feature is a set of net new related use cases P0 Highest priority. Current planned for work. CLI issues around CLI for datamon labels Jun 19, 2019
@ransomw1c ransomw1c self-assigned this Jun 19, 2019
@ransomw1c
Copy link
Contributor Author

ransomw1c commented Jun 19, 2019

@jakedsouza @galvare2 here's a design sketch of the CLI to review

bundle diff

the bundle diff command diffs a bundle on the local filesystem with a bundle stored in GCS. it uses the same flags as bundle download, so in particular bundles in GCS can be specified with either a label or a bundle id.

output is a CSV list

<diff_type> , <filename> , <remote_size> , <remote_hash> , <local_size> , <local_hash>

where <diff_type> may be among the following

  • A (added) file name has been added to remote, not present in local
  • D (deleted) file name is not present on remote, is present on local
  • U (updated) file name is present on both the local and remote with different hash values on each

the <filename> is always present while <remote_*> and <local_*> are omitted in the case of deleted and added diff types, respectively.

$ datamon bundle diff --repo ransom-test-repo-20190408 --destination /some/downloaded/bundle --bundle 1Jbb3SicFGoKB7JQJZdCCwdBQwE
A , some/added/file , 1024 , 50b49bf9e99964053bd228e02ac2c283dfd4974353f7218565d3bdf326851ef6d090fa5d39941436f72425b80fda51e70b7e802998151cd25042c08b80766f1b , ,
D , some/deleted/file , , , 1024 , 50b49bf9e99964053bd228e02ac2c283dfd4974353f7218565d3bdf326851ef6d090fa5d39941436f72425b80fda51e70b7e802998151cd25042c08b80766f1b
U , some/changed/file , 512 , 50b49bf9e99964053bd228e02ac2c283dfd4974353f7218565d3bdf326851ef6d090fa5d39941436f72425b80fda51e70b7e802998151cd25042c08b80766f1a , 1024 , 50b49bf9e99964053bd228e02ac2c283dfd4974353f7218565d3bdf326851ef6d090fa5d39941436f72425b80fda51e70b7e802998151cd25042c08b80766f1c

bundle update

bundle update will again have the same flags as bundle download. in the case of bundle update, the --destination parameter will be expected to be a directory previously used as the --destination of a bundle download (there is a .datamon/ directory placed in the bundle download --destination that can be used to ensure this is the case, although such an implementation detail is intended to be transparent to the user).

after bundle update, the --destination directory will be the same as if bundle download had been used to download into an empty directory, except only the downloaded files will be added.

deleting files on the local copy should be optional according to flag

@ransomw1c
Copy link
Contributor Author

ransomw1c commented Jun 19, 2019

initial draft notes

i haven't written a particularly performant diffing algorithm.

after bundle update, the --destination directory will be the same as if bundle download had been used to download into an empty directory, except only the downloaded files will be added.

more specifically means that the files will be the same. some empty directories might be left over.

@galvare2
Copy link

If there is no difference, bundle update will do nothing, correct? So there would be no need to first do bundle diff and then only do bundle update if a diff is found?

@jakedsouza
Copy link

I think for flood and seismic immediate requirements, bundle update seems more important than diff.

@ransomw1c
Copy link
Contributor Author

ransomw1c commented Jun 26, 2019

If there is no difference, bundle update will do nothing, correct? So there would be no need to first do bundle diff and then only do bundle update if a diff is found?

yes, this correct: under the hood, update uses the same data structure as diff, so if the diff is empty, nothing need to happen during the update .. in the current implementation, the metadata is re-written on update in every case, and there could be some additional check for empty diffs to prevent this.. i'll make on note on #222 to that effect.

@ransomw1c
Copy link
Contributor Author

ransomw1c commented Jun 27, 2019

inter-process concurrency (IPC)

requirements

in addition to the design sketch mentioned above, we need to address the possibility of multiple datamon processes accessing the same local bundle (a directory with .datamon folder containing metadata) at once. specifically, we'd like to be able to run multiple bundle update commands with the same --destination and not have undefined results.

discussion

a usual methodology for IPC by version control systems is lockfiles.

in git, 'lockfile.h` describes how atomic writes and IPC locking are implemented via the same mechanism

 * * Mutual exclusion and atomic file updates. When we want to change
 *   a file, we create a lockfile `<filename>.lock`, write the new
 *   file contents into it, and then rename the lockfile to its final
 *   destination `<filename>`. We create the `<filename>.lock` file
 *   with `O_CREAT|O_EXCL` so that we can notice and fail if somebody
 *   else has already locked the file, then atomically rename the
 *   lockfile to its final destination to commit the changes and
 *   unlock the file.

so that, specifically, existence of the .git/index lockfile can be used by the porcilain to determine whether another process is accessing the local repository.

in datamon, the same storage.Store interface is used to describe both the local store as well as the remote (GCS) store. implementing atomicity of writes can be implemented behind the existing Store interface, while acquiring and releasing access to files requires additions to the interface. it's not clear to me if modifying the internal abstractions is a great idea in order to get a first cut of the required functionality.

another option is using a lockfile at the command level, decoupling the locking on operations at a particular --destination from the lock on particular files within the directory. this is the option i'm currently leaning toward. it could start simply, just by exiting the process with an error if the destination is locked. further changes could implement queuing locks with multiple lockfiles, such that the process could either (according to a parameter) exit if --destination is locked or wait until it has access.

it's possible that there's a channel-based way to implement IPC, yet lockfiles seem like a good place to start since they're a better-understood solution.

finally, in the case of bundle update, we could potentially allow multiple processes to be downloading files at once. i view this as a further optimization after locking on the destination, and will have to mull it over more. the idea is to append files on update such that the local bundle is the union of the initial download plus all updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLI issues around CLI for datamon feature-request A feature is a set of net new related use cases P0 Highest priority. Current planned for work.
Development

No branches or pull requests

3 participants