-
-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Function to merge two Snapshot values #5707
Comments
@illicitonion made the below comment in #5703:
This was in response to having to use an OrderedSet to deduplicate paths from multiple input snapshots in CatExecutionRequest. The instance that comment was about actually only needed a single snapshot, so it was able to drop the OrderedSet, but having to deduplicate file paths in general seems like it could introduce subtle errors and it would be great to have that handled when we merge snapshots. |
I believe that this is nearly working... will add some tests tomorrow. @illicitonion : I do wonder whether we're missing an abstraction that would be approximately "entire Directory structure in memory with digests for files only". Basically, we're currently using As a strawman, it might look like:
Perhaps the end result would not be any easier to work with though. |
That's an interesting idea. I suspect if we were going to do that, we'd probably just add a transitive That would jump the memory overhead of a snapshot from a constant ~40 bytes to ~80 bytes per file, which I suspect would be a non-trivial increase in graph size; but actually, because we already keep the I guess there's also a middleground of keeping an in-memory cache of Is the problem you're trying to address:
If 1, we should probably inline/cache, but I'd want to see numbers showing it's a problem, because I believe reading from LMDB on an SSD shouldn't be too crazy... |
2 and 3, mostly. But also 1, because we will "always" be operating on an entire recursive directory when we operate on a This isn't a strong feeling yet... just a thought. |
The Snapshot type represents a set of files, which have particular contents.
In order to merge snapshots (e.g. "a snapshot containing a compiler binary" and "some source files to compile") for process execution, we need to be able to merge snapshots.
Exactly how we expose this in Python is an open question (#5502) but it's reasonable to assume we'll want an in-rust implementation of merging which we expose to the Python somehow. So, let's write a static function (and some tests!) in snapshot.rs.
Snapshots have two components, which will need handling separately:
Digest. This is the digest the Directory protocol buffer which represents the tree of files. Note that there are strict ordering requirements (alphabetical) for entries in this.
path_stats. This is a list of the entries in the Snapshot, ordered in the order that they were captured (i.e. grabbing a snapshot for
["foo", "bar"]
will have itspath_stats
in that order). There isn't necessarily an obviously correct ordering here, so we should probably preserve the ordering of the first snapshot passed, and insert additionalPathStats
in a stable order afterwards (deduplicating as we go).The text was updated successfully, but these errors were encountered: