bundle diff and update #214

ransomw1c · 2019-06-19T17:55:07Z

in order to move data from Argo data-science workflows to production services (that is, the programs more directly connected to an externally visible web program) via datamon, the plan is to use either bundle ids or labels (which resolve to bundle ids) to provide of referring to workflow output from the production services. now, from one such reference (bundle id) to the next, there could in general be a fair amount of duplicate files in the data-science artifact that production needs to access. production is using bundle download (not the FUSE fs) to access the data, so supposing that there's already a bundle downloaded to the production environment, there need be a way, given the next reference, to (1) describe the diff between the currently downloaded bundle (in terms of file lists, not file contents) as well as (2) updated the downloaded bundle on disk such that only the differences as resolved, not such that the entirety of the bundle is downloaded again.

as is, the result of a bundle download contains a .datamon/ directory with the bundle's metadata (file lists with names and hashes). so to provide the diff, it'll suffice to compare the metadata available on GCS to the local .datamon/ directory. then updating the bundle on disk will consist of concurrently iterating through the diff, (i) adding missing files, (ii) removing additional files, and (iii) replacing any file with a differing hash. afterward, the .datamon/ directory metadata will need to be updated such that the result of an update is the same as the result of fresh bundle download: there is not a local history being stored within .datamon/ as with .git/.

note that this iss is distinct from the similarly-titled #204 : that issue has to do with updating a bundle stored in GCS via local changes (specifically, via the FUSE filesystem abstraction). this issue is about updating a bundle stored locally (after a bundle download) with changes from GCS.

The text was updated successfully, but these errors were encountered:

ransomw1c · 2019-06-19T21:40:58Z

@jakedsouza @galvare2 here's a design sketch of the CLI to review

`bundle diff`

the bundle diff command diffs a bundle on the local filesystem with a bundle stored in GCS. it uses the same flags as bundle download, so in particular bundles in GCS can be specified with either a label or a bundle id.

output is a CSV list

<diff_type> , <filename> , <remote_size> , <remote_hash> , <local_size> , <local_hash>

where <diff_type> may be among the following

A (added) file name has been added to remote, not present in local
D (deleted) file name is not present on remote, is present on local
U (updated) file name is present on both the local and remote with different hash values on each

the <filename> is always present while <remote_*> and <local_*> are omitted in the case of deleted and added diff types, respectively.

$ datamon bundle diff --repo ransom-test-repo-20190408 --destination /some/downloaded/bundle --bundle 1Jbb3SicFGoKB7JQJZdCCwdBQwE
A , some/added/file , 1024 , 50b49bf9e99964053bd228e02ac2c283dfd4974353f7218565d3bdf326851ef6d090fa5d39941436f72425b80fda51e70b7e802998151cd25042c08b80766f1b , ,
D , some/deleted/file , , , 1024 , 50b49bf9e99964053bd228e02ac2c283dfd4974353f7218565d3bdf326851ef6d090fa5d39941436f72425b80fda51e70b7e802998151cd25042c08b80766f1b
U , some/changed/file , 512 , 50b49bf9e99964053bd228e02ac2c283dfd4974353f7218565d3bdf326851ef6d090fa5d39941436f72425b80fda51e70b7e802998151cd25042c08b80766f1a , 1024 , 50b49bf9e99964053bd228e02ac2c283dfd4974353f7218565d3bdf326851ef6d090fa5d39941436f72425b80fda51e70b7e802998151cd25042c08b80766f1c

`bundle update`

bundle update will again have the same flags as bundle download. in the case of bundle update, the --destination parameter will be expected to be a directory previously used as the --destination of a bundle download (there is a .datamon/ directory placed in the bundle download --destination that can be used to ensure this is the case, although such an implementation detail is intended to be transparent to the user).

after bundle update, the --destination directory will be the same as if bundle download had been used to download into an empty directory, except only the downloaded files will be added.

deleting files on the local copy should be optional according to flag

ransomw1c · 2019-06-19T21:49:20Z

initial draft notes

i haven't written a particularly performant diffing algorithm.

after bundle update, the --destination directory will be the same as if bundle download had been used to download into an empty directory, except only the downloaded files will be added.

more specifically means that the files will be the same. some empty directories might be left over.

galvare2 · 2019-06-20T23:49:16Z

If there is no difference, bundle update will do nothing, correct? So there would be no need to first do bundle diff and then only do bundle update if a diff is found?

jakedsouza · 2019-06-20T23:52:43Z

I think for flood and seismic immediate requirements, bundle update seems more important than diff.

ransomw1c · 2019-06-26T20:26:41Z

If there is no difference, bundle update will do nothing, correct? So there would be no need to first do bundle diff and then only do bundle update if a diff is found?

yes, this correct: under the hood, update uses the same data structure as diff, so if the diff is empty, nothing need to happen during the update .. in the current implementation, the metadata is re-written on update in every case, and there could be some additional check for empty diffs to prevent this.. i'll make on note on #222 to that effect.

ransomw1c · 2019-06-27T17:56:28Z

inter-process concurrency (IPC)

requirements

in addition to the design sketch mentioned above, we need to address the possibility of multiple datamon processes accessing the same local bundle (a directory with .datamon folder containing metadata) at once. specifically, we'd like to be able to run multiple bundle update commands with the same --destination and not have undefined results.

discussion

a usual methodology for IPC by version control systems is lockfiles.

in git, 'lockfile.h` describes how atomic writes and IPC locking are implemented via the same mechanism

 * * Mutual exclusion and atomic file updates. When we want to change
 *   a file, we create a lockfile `<filename>.lock`, write the new
 *   file contents into it, and then rename the lockfile to its final
 *   destination `<filename>`. We create the `<filename>.lock` file
 *   with `O_CREAT|O_EXCL` so that we can notice and fail if somebody
 *   else has already locked the file, then atomically rename the
 *   lockfile to its final destination to commit the changes and
 *   unlock the file.

so that, specifically, existence of the .git/index lockfile can be used by the porcilain to determine whether another process is accessing the local repository.

in datamon, the same storage.Store interface is used to describe both the local store as well as the remote (GCS) store. implementing atomicity of writes can be implemented behind the existing Store interface, while acquiring and releasing access to files requires additions to the interface. it's not clear to me if modifying the internal abstractions is a great idea in order to get a first cut of the required functionality.

another option is using a lockfile at the command level, decoupling the locking on operations at a particular --destination from the lock on particular files within the directory. this is the option i'm currently leaning toward. it could start simply, just by exiting the process with an error if the destination is locked. further changes could implement queuing locks with multiple lockfiles, such that the process could either (according to a parameter) exit if --destination is locked or wait until it has access.

it's possible that there's a channel-based way to implement IPC, yet lockfiles seem like a good place to start since they're a better-understood solution.

finally, in the case of bundle update, we could potentially allow multiple processes to be downloading files at once. i view this as a further optimization after locking on the destination, and will have to mull it over more. the idea is to append files on update such that the local bundle is the union of the initial download plus all updates.

ransomw1c added feature-request A feature is a set of net new related use cases P0 Highest priority. Current planned for work. CLI issues around CLI for datamon labels Jun 19, 2019

ransomw1c self-assigned this Jun 19, 2019

This was referenced Jun 21, 2019

init bundle diff #218

Merged

download cleanup, concurrency patterns and otherwise #220

Merged

init bundle update #222

Merged

ransomw1c mentioned this issue Jun 27, 2019

dedupe cmd Run funcs (stores) #224

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bundle diff and update #214

bundle diff and update #214

ransomw1c commented Jun 19, 2019

ransomw1c commented Jun 19, 2019 •

edited

Loading

ransomw1c commented Jun 19, 2019 •

edited

Loading

galvare2 commented Jun 20, 2019

jakedsouza commented Jun 20, 2019

ransomw1c commented Jun 26, 2019 •

edited

Loading

ransomw1c commented Jun 27, 2019 •

edited

Loading

bundle diff and update #214

bundle diff and update #214

Comments

ransomw1c commented Jun 19, 2019

ransomw1c commented Jun 19, 2019 • edited Loading

bundle diff

bundle update

ransomw1c commented Jun 19, 2019 • edited Loading

initial draft notes

galvare2 commented Jun 20, 2019

jakedsouza commented Jun 20, 2019

ransomw1c commented Jun 26, 2019 • edited Loading

ransomw1c commented Jun 27, 2019 • edited Loading

inter-process concurrency (IPC)

requirements

discussion

ransomw1c commented Jun 19, 2019 •

edited

Loading

`bundle diff`

`bundle update`

ransomw1c commented Jun 19, 2019 •

edited

Loading

ransomw1c commented Jun 26, 2019 •

edited

Loading

ransomw1c commented Jun 27, 2019 •

edited

Loading