Function to merge two Snapshot values #5707

illicitonion · 2018-04-16T14:01:59Z

The Snapshot type represents a set of files, which have particular contents.

In order to merge snapshots (e.g. "a snapshot containing a compiler binary" and "some source files to compile") for process execution, we need to be able to merge snapshots.

Exactly how we expose this in Python is an open question (#5502) but it's reasonable to assume we'll want an in-rust implementation of merging which we expose to the Python somehow. So, let's write a static function (and some tests!) in snapshot.rs.

Snapshots have two components, which will need handling separately:

Digest. This is the digest the Directory protocol buffer which represents the tree of files. Note that there are strict ordering requirements (alphabetical) for entries in this.
path_stats. This is a list of the entries in the Snapshot, ordered in the order that they were captured (i.e. grabbing a snapshot for ["foo", "bar"] will have its path_stats in that order). There isn't necessarily an obviously correct ordering here, so we should probably preserve the ordering of the first snapshot passed, and insert additional PathStats in a stable order afterwards (deduplicating as we go).

The text was updated successfully, but these errors were encountered:

cosmicexplorer · 2018-04-17T20:44:43Z

@illicitonion made the below comment in #5703:

I'm not sure, but either (definitely for a future follow-up):

The Python code here should just have a Snapshot, not a collection of Snapshots.

The Python code should have to handle overlapping paths.

I don't currently have a strong preference between the two, and I suspect both will end up being practical, but I suspect that recommending towards 1 will be nicer for rule writers.

This was in response to having to use an OrderedSet to deduplicate paths from multiple input snapshots in CatExecutionRequest. The instance that comment was about actually only needed a single snapshot, so it was able to drop the OrderedSet, but having to deduplicate file paths in general seems like it could introduce subtle errors and it would be great to have that handled when we merge snapshots.

stuhood · 2018-04-25T00:44:57Z

I believe that this is nearly working... will add some tests tomorrow.

@illicitonion : I do wonder whether we're missing an abstraction that would be approximately "entire Directory structure in memory with digests for files only". Basically, we're currently using Directory/DirectoryNode/FileNode, but that causes us to bounce back and forth between Digest and Directory as we manipulate a Directory. If instead we had a recursive structure that directly referenced its children rather than indirecting through Digest, we could manipulate that more easily.

As a strawman, it might look like:

struct LoadedDirectory {
  name: String,
  files: FileNode // still referenced via digests
  directories: Vec<LoadDirectory> // directly referenced, with no digest indirection
}

Perhaps the end result would not be any easier to work with though.

illicitonion · 2018-04-25T09:22:59Z

That's an interesting idea. I suspect if we were going to do that, we'd probably just add a transitive Vec<bazel_protos::remote_execution::Directory>, rather than inventing a new structure...

That would jump the memory overhead of a snapshot from a constant ~40 bytes to ~80 bytes per file, which I suspect would be a non-trivial increase in graph size; but actually, because we already keep the Vec<PathStat> in Snapshots, I guess we'd basically just end up slightly-more-than doubling the memory footprint, because the most expensive part of a Directory is the filenames. Possibly a slightly crazy idea is that we could have a custom Path implementation which points at a FileNode and shares the underlying path bytes, which would make this basically free.

I guess there's also a middleground of keeping an in-memory cache of Digest -> Directory in the Store.

Is the problem you're trying to address:

Loading a Directory from a Store by Digest is expensive/slow.
Loading a Directory from a Store by Digest is unergonomic.
A mess of Future chaining.
or something else?

If 1, we should probably inline/cache, but I'd want to see numbers showing it's a problem, because I believe reading from LMDB on an SSD shouldn't be too crazy...
If 2 or 3, I'd definitely be interested in seeing your calling code to see what problems you're bumping into :)

stuhood · 2018-04-25T15:37:50Z

2 and 3, mostly. But also 1, because we will "always" be operating on an entire recursive directory when we operate on a Snapshot, and needing to recursively re-load the Directory from disk as it's manipulated seems like it would make a lot of round trips.

This isn't a strong feeling yet... just a thought.

### Problem As described in #5707: we need a way to merge `Snapshot` objects (although we have not yet decided how to expose them to `@rule`s.) ### Solution Add `Snapshot::merge`. ### Result Fixes #5707.

cosmicexplorer mentioned this issue Apr 17, 2018

convert usages of the ExecuteProcess helper into simple @rules to simplify snapshot consumption for process execution #5703

Merged

stuhood self-assigned this Apr 23, 2018

stuhood mentioned this issue Apr 25, 2018

Add support for merging Snapshots #5746

Merged

stuhood closed this as completed in #5746 Apr 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function to merge two Snapshot values #5707

Function to merge two Snapshot values #5707

illicitonion commented Apr 16, 2018

cosmicexplorer commented Apr 17, 2018

stuhood commented Apr 25, 2018 •

edited

Loading

illicitonion commented Apr 25, 2018

stuhood commented Apr 25, 2018

Function to merge two Snapshot values #5707

Function to merge two Snapshot values #5707

Comments

illicitonion commented Apr 16, 2018

cosmicexplorer commented Apr 17, 2018

stuhood commented Apr 25, 2018 • edited Loading

illicitonion commented Apr 25, 2018

stuhood commented Apr 25, 2018

stuhood commented Apr 25, 2018 •

edited

Loading