Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to merge two Snapshot values #5707

Closed
illicitonion opened this issue Apr 16, 2018 · 4 comments
Closed

Function to merge two Snapshot values #5707

illicitonion opened this issue Apr 16, 2018 · 4 comments
Assignees

Comments

@illicitonion
Copy link
Contributor

The Snapshot type represents a set of files, which have particular contents.

In order to merge snapshots (e.g. "a snapshot containing a compiler binary" and "some source files to compile") for process execution, we need to be able to merge snapshots.

Exactly how we expose this in Python is an open question (#5502) but it's reasonable to assume we'll want an in-rust implementation of merging which we expose to the Python somehow. So, let's write a static function (and some tests!) in snapshot.rs.

Snapshots have two components, which will need handling separately:

  1. Digest. This is the digest the Directory protocol buffer which represents the tree of files. Note that there are strict ordering requirements (alphabetical) for entries in this.

  2. path_stats. This is a list of the entries in the Snapshot, ordered in the order that they were captured (i.e. grabbing a snapshot for ["foo", "bar"] will have its path_stats in that order). There isn't necessarily an obviously correct ordering here, so we should probably preserve the ordering of the first snapshot passed, and insert additional PathStats in a stable order afterwards (deduplicating as we go).

@cosmicexplorer
Copy link
Contributor

@illicitonion made the below comment in #5703:

I'm not sure, but either (definitely for a future follow-up):

  1. The Python code here should just have a Snapshot, not a collection of Snapshots.
  2. The Python code should have to handle overlapping paths.

I don't currently have a strong preference between the two, and I suspect both will end up being practical, but I suspect that recommending towards 1 will be nicer for rule writers.

This was in response to having to use an OrderedSet to deduplicate paths from multiple input snapshots in CatExecutionRequest. The instance that comment was about actually only needed a single snapshot, so it was able to drop the OrderedSet, but having to deduplicate file paths in general seems like it could introduce subtle errors and it would be great to have that handled when we merge snapshots.

@stuhood
Copy link
Member

stuhood commented Apr 25, 2018

I believe that this is nearly working... will add some tests tomorrow.

@illicitonion : I do wonder whether we're missing an abstraction that would be approximately "entire Directory structure in memory with digests for files only". Basically, we're currently using Directory/DirectoryNode/FileNode, but that causes us to bounce back and forth between Digest and Directory as we manipulate a Directory. If instead we had a recursive structure that directly referenced its children rather than indirecting through Digest, we could manipulate that more easily.

As a strawman, it might look like:

struct LoadedDirectory {
  name: String,
  files: FileNode // still referenced via digests
  directories: Vec<LoadDirectory> // directly referenced, with no digest indirection
}

Perhaps the end result would not be any easier to work with though.

@illicitonion
Copy link
Contributor Author

That's an interesting idea. I suspect if we were going to do that, we'd probably just add a transitive Vec<bazel_protos::remote_execution::Directory>, rather than inventing a new structure...

That would jump the memory overhead of a snapshot from a constant ~40 bytes to ~80 bytes per file, which I suspect would be a non-trivial increase in graph size; but actually, because we already keep the Vec<PathStat> in Snapshots, I guess we'd basically just end up slightly-more-than doubling the memory footprint, because the most expensive part of a Directory is the filenames. Possibly a slightly crazy idea is that we could have a custom Path implementation which points at a FileNode and shares the underlying path bytes, which would make this basically free.

I guess there's also a middleground of keeping an in-memory cache of Digest -> Directory in the Store.

Is the problem you're trying to address:

  1. Loading a Directory from a Store by Digest is expensive/slow.
  2. Loading a Directory from a Store by Digest is unergonomic.
  3. A mess of Future chaining.
    or something else?

If 1, we should probably inline/cache, but I'd want to see numbers showing it's a problem, because I believe reading from LMDB on an SSD shouldn't be too crazy...
If 2 or 3, I'd definitely be interested in seeing your calling code to see what problems you're bumping into :)

@stuhood
Copy link
Member

stuhood commented Apr 25, 2018

2 and 3, mostly. But also 1, because we will "always" be operating on an entire recursive directory when we operate on a Snapshot, and needing to recursively re-load the Directory from disk as it's manipulated seems like it would make a lot of round trips.

This isn't a strong feeling yet... just a thought.

stuhood pushed a commit that referenced this issue Apr 26, 2018
### Problem

As described in #5707: we need a way to merge `Snapshot` objects (although we have not yet decided how to expose them to `@rule`s.)

### Solution

Add `Snapshot::merge`.

### Result

Fixes #5707.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants