-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse registry indexes should be viewable atomically #10928
Comments
Coming from withoutboats/rfcs#7 I was asked how The Update Framework (TUF)'s snapshots feature might help here. For context, TUF is a framework for cryptographically authenticating software updates, and its snapshots feature provides git-like consistency across several files. TUF has 4 roles:
Separately TUF also supports a notion of delegated targets where you can specify policies for how 3rd party keys are allowed to sign specific files, which would be useful for letting end users sign crates, but that's not necessary to implement to use TUF for the sparse index and something that can be pursued as followup work as part of e.g. a Sigstore integration (see https://internals.rust-lang.org/t/pre-rfc-using-sigstore-for-signing-and-verifying-crates/18115/2) If you're curious if it's bad for @bors to hold three roles, not necessarily. TUF splits them up to give you the option of eventually factoring them into separate systems which reduce the overall attack surface of each system and cross-check each other. But you can assign several roles to @bors to get started, and just having the root role separate gives you survivable compromise of @bors. |
Does TUF have any advice/recommendations on how we get from a bunch of files to a "snapshot"? The link suggests that the snapshot should include a list with all the files each with its hash, but this is impractical at the scale of crates.io. For GIT indexes we already have a Merkel tree, so the snapshot can just be the commit hash (or the tree hash) as suggested in your RFC. For sparse registries we will probably have to invent our own thing, but it might as well learn lessons from and take advice from those who went before us. |
You'd probably need to talk to a TUF expert about that. I can point one at this issue if you'd like. It might be possible to use delegated targets as something akin to git tree objects. |
I'm a TUF maintainer and happy to answer any questions here.
This is very similar to a proposal in TUF for this exact problem. The short version is that you can make separate "snapshots" for each file (with the name, version number, and hash), then combine them using a Merkle tree, and sign the root of that Merkle tree with the timestamp role. A sparse Merkle tree would likely make this more efficient, but give the same properties. The proposal is a draft that is mostly pending real-world implementation and validation. |
That is an extremely interesting read. Thank you! |
Sparse Merkle trees work well for hashmap-like key/value data, and many are designed to scale well with constant updates. Here's a Rust implementation of one: https://github.com/diem/diem/blob/main/storage/jellyfish-merkle/src/lib.rs This paper describes a similar idea: |
The tree shape should only affect performance, and not the security properties, so splitting the tree per folder would work as long as all data is included. As @tarcieri mentioned, the sparse Merkle tree has has performance benefits for frequent updates as less of the nodes have to change when a leaf is updated. |
I found sometime today to reread the TUF specification. I've been using some terminology wrong, I don't think it's technically the size of
Which suggests that if TUF were applied directly, |
@Eh2406 delegated targets are each stored in their own .json files which can be independently updated (and I believe also support nested delegations which would allow for a delegation hierarchy?) Though their main use case is to delegate authority via different signing keys, I think it might also be possible to use them with the same signing keys to break up targets.json into many small files? |
@Eh2406 TUF does the final artifact signing through |
Thanks for opening the discussion! To start with, could someone please point me to documentation on these sparse HTTP-based registries, so that at least I can help better think about the problem, and advise on how to use TUF here? |
The documentation is in a to-be-landed PR here: https://github.com/rust-lang/cargo/pull/11224/files#diff-5e9d17cb5bab9eebb8cb0dc5e276eea6f62ffbce72a0e763a28cf102493e2104 |
A couple of counter points:
|
Because git had an easy way to represent a snapshot, signing of the commit has was an easy way to prove authenticity of all the data in the commit. However, I don't think the logic necessarily goes the other way. To prove authenticity of all the data, you don't have to have a single snapshot. For example, every individual file could be signed separately. This is still a guarantee that the file came from crates.io at some point, so it still prevents the CDN or some other network MITM from tampering with the data and publishing something that has never been published. Files can come with a signed version or signed last-modified timestamp, so rollback attacks (using terminology from TUF) on them can still be prevented. A global snapshot has an advantage that a freeze attack requires the attacker to freeze everything, rather than individual files, so such attack is easier to spot. However, minimum version requirements in |
Your counterpoints are, of course, all correct. Dependency resolution, which is the primary job of the index, does not require any of the properties asked for in this issue. And all of the properties requested in this issue can be obtained by falling back on the git index. Nonetheless, some of these properties are needed for other desired functionality in cargo and under circumstances that make "just pull the git index" too slow. Specifically suggesting alternative names when a package is not found.
Yes, at some level this is an alternative API for a changelog. For performance any changelog system needs to have a "snapshot" mechanism for compressing the history into a description of the current state. At the extreme position, the registry could just publish snapshots and the changes can be reconstructed by taking diffs. Which is the proposal being suggested here. Thank you for reminding us to think about whether a more explicit method for tracking changes leads to a better API. |
I have collected some relevant data. This also got me thinking about one of the primary use cases "suggesting alternative names when a package is not found". If the user typed in the package |
I was recommended to try https://en.wikipedia.org/wiki/Incremental_encoding on the plain names file. With a naive implementation (which is only valuable for size comparisons) 0.65MB and 0.35MB after gziped. That is a nontrivial size savings. Of course, defining our own pre-compression step will require careful documentation and testing. |
I believe that trying to correct typos on the client side would amplify dangers of typosquatting. The index lacks information about popularity/reputation/relevance, so it can't pick a better crate from among candidates, it can only pick based on names alone, without knowing which ones are spam, and use of mere edit distance opens up a risk of picking easy-to-register close names (e.g. instead of correcting Because typo handling is not a high-traffic service critical for Cargo's uptime, and crates.io has all of the necessary information available server-side to make good and safe suggestions, I think this particular case should be handled with a server-side API instead. |
This came up again in discussions today. I tracked down the code I used for trying Incremental encoding and ran it again to collect new numbers. After posting I had hardened the implementation so the data successfully and always roundtrips as tested by proptest. Compressed were looking at 1.2 MB which compresses to 658 KB. This imagines that were storing a u16 version number and not a hash. Here is the code. |
Have use-cases for this emerged? The sparse protocol has been the default for a while now. I haven't seen people asking for consistent snapshots (or maybe the git copy is sufficient?) I've implemented a private registry and a few registry-adjacent tools, and haven't needed the cross-crate consistency myself. For normal use, BTW, the git index has broken crates depending on crates that don't exist any more, so even use of the git snapshot requires dealing with "dangling pointers" of crates. |
No new use cases have emerged. As you and I have both pointed out to many people, many times, dependency resolution does not need this issue to be solved. For now users who need consistent views or an exact list of changes we are recommending that they use the git protocol. Where this conversation came up again, is in discussions of un-trusted mirrors. Specifically signings schemes that prevent a mirror/MITM from serving an old version of an index file. In the TUF terminology this is called a rollback attack. The signature on each index file could be short-lived to prevent this attack, but then crates.io will need to continuously be refreshing the signatures for all index files. Which ends up being expensive when the registry gets big. A solution to this issue would allow most files to have a long-lived signature and only the "consistent snapshot" file needing a frequently refreshed signature. (Probably stored in a separate "timestamp" file for bandwidth reasons, but that's an implementation detail.) The other major use case for this are tools that want to process ALL changes to the index. For example mirrors that want to proactively ingest all new publishes. Or security scanners. These tools can currently reliably run off the git index, which can provide deltas between "the current state" and "the state when the tool was last run", as long as the tool has access to a file system to do the git operations on. Or they can be built on top of the just released RSS support in crates.io, as long as they can guarantee that they always run more frequently than the RSS expiration time. |
Since it hasn't been mentioned yet: TAP16: snapshot Merkle trees would probably be the best mechanism on the TUF side to encode this, if that's desired (though I understand there's also work on using RSA accumulators which could potentially even be more succinct) |
BTW, you can find @mnm678's investigation of Merkle trees vs RSA accumulators in Chapter 3 of her PhD thesis. If I'm not mistaken, I think Zachary Newman and herself found that Merkle trees generally work better. |
For the past 2 weeks I've been studying an intriguing but overly complicated idea. I'm going to write down what I figured out in the hope that I can stop thinking about it and move on. Some review: In order to have a snapshot of the index one needs a list of every file and for each file a hash or version number. A simple CSV of New ideas: It's kind of a shame to tie the large "list of names" with the "modification numbers". A new name getting added to the index happens in order of magnitude less frequently than a new version being published and the names are ~75% of the size of the file. Most downloads of the new snapshot will end up downloading a large amount of name data that has not changed. So let's put them into two files:
To check the most current modification number four a crate you find the position of the crate name in the first file and look up the value of the corresponding spot in the second file. As @kornelski points out, most users do not need to know about a new name; they only want to check if any of the crates they used for resolution have been held back by a MITM. For this use case they do not need to list of names they only need the modification numbers. But the entire file size savings was by not storing the names in the second file. What to do... Probabilistic data structures: There are data structures that act like hash maps but do not store the keys. The state-of-the-art algorithms can guarantee that The complexity of basic operations:
The These file sizes seem big, because they are approaching optimal for Whether any of this is useful or not I'm not sure. But hopefully I can be done thinking about it. Edit: I was really hoping that by writing it up I could be done thinking about it. But it didn't work... I think the algorithm is using "arithmetic coding" (or some homegrown alternative). But I can explain it better using Huffman encoding. Take a step back. The size of a probabilistic map is That's where Huffman encoding comes in. We encode the values according to the Huffman algorithm. Then we process things one bit at a time. The first data structure encodes the first bit and needs to be correct for every key. So it takes quite a bit of space, it avoids being a disastrous amount because it's only one bit wide. We can do this again for the next bit but this time we can skip any keys whose Huffman encoding has already reached a leaf node. We don't care what value is returned by the structure for that bit of that key, because a reader processing that key will have found a value. With every additional bit we have fewer keys to process and therefore a smaller data structure. |
Interesting! Nice to see @bitwiseshiftleft is writing Rust, too! @Eh2406 If you're generally interested in efficiently updatable verifiable map-like data structures, it would also be worth taking a look at Jellyfish Merkle Trees. The Rust implementation was originally developed by Libra, but there are several forks, for example: https://docs.rs/jmt/ |
Oh hello @tarcieri, nice to see you too. So as not an expert in the Rust registry system I don't really know whether compressed_map or similar is suitable here, or if you'd rather just use Merkle trees. Merkle usually works well if you want short proofs and efficient updates, which seems appropriate here, whereas compressed_map is good for downloading some info on the state of the whole system if you don't need to know what keys are in the map. It's also only research-grade, and not updatable (none of these matrix things are) except by using it to compress deltas. My compressed_map crate can be thought of as more or less the same idea as a Huffman code plus a BinaryFuse-like matrix map, much as @Eh2406 suggests. It's not quite a Huffman code because the possibility of getting the correct answer "by accident" changes the cost model, but it's the same idea. The matrix map is much more complicated to solve than BinaryFuse (which came out later), and also slower, but it gets a few % better compression. This was my first (and so far only) significant Rust project so that is part of why the code is difficult to read. Also I never really finished the paper on it... |
Sorry for the late comment, but you don't need to list hashes unless you are worried about malicious mirrors (see Section 5.6 in this paper). But if there are no mirrors and only ever one canonical package index, then you can safely have savings by omitting hashes. Edit: It is possible I misunderstood what you meant "snapshot." If you mean the TUF snapshot metadata, then my comment above holds true. If not, and you mean what is called the TUF targets metadata, then please ignore 🙂 |
I am uncertain whether I got the TUF terminology correct. In either case malicious mirrors are exactly why this issue has come up again. Currently it requires a large amount of trust and coordination for a new official mirror to be set up. we only go through this effort for the official CDN's. If we had TUF like protections against malicious mirrors, then the project would have a much easier time endorsing whatever other mirrors people find useful. (We are particularly excited about same building mirrors for common CI providers.) |
Problem
With git-based registries any one commit from the registry represented an atomic picture of what all packages were up to at that point in time. This had a number of useful properties:
None of these are possible with the current design of the sparse http-based registries.
Proposed Solution
Its possible to re-obtain the list of all available crates by adding a
names-file
that lists all crates alphabetically at the root of the index. For crates.io this file would be large but probably compress well enough to be usable.If we add to that
names-file
the hash of each index file, then it uniquely identifies a snapshot of the index! Of course, being mostly pseudorandom numbers it will compress really badly. At crates.io scale it will be too big to live in one file. We could split it up into several files, say one per folder in the current index structure. In order to get an atomic picture we would now need ameta-file
recording the hashes of thenames-file
s. This works but would be a lot of overhead for smaller registries. (An RFC for this idea will require describing if/how the number of hash files is configurable.)What happens if the index file is updated after the
names-file
cargo was looking at? cargo will get a index file whose hash does not match the atomic picture requested. So there needs to be some kind of cash buster that makes sure that the received version of the file matches the requested version. We could put this in the name3/f/foo
can become3/f/foo-abcdef123456
. However this would require crates.io to have the same content at both of those locations (assuming stabilization of the current design of #2789), which is just asking for things to come out of sync. We could use a query parameter,3/f/foo
get you the latest version no matter the hash and3/f/foo?hash=abcdef123456
gets you the version with the hashabcdef123456
. However this requires the backing server to be able to look up versions of a file by their hash. Crates.io is currently using S3, which does not provide this functionality; you can look up old versions of a file but only by the versionID S3 generated. So let's double the size of ournames-file
s by recording for each crate a cash busting string (crates.io can use S3s versionID) and the hash. With the semantics that if you ask for that crate with the cash buster appended, you should always receive a response with that hash.Notes
If this is so much better design than the current sparse registry proposal, why should we ever stabilize the current proposal?
Fundamentally this design is going to be more work for server to implement. The current sparse registry design has exactly the same contents as a git registry, and is therefore very easy for registries to upgrade to. There will be many users for whom the additional complexity is not needed, and the current sparse registry proposal is a better fit.
I did want to open this discussion to make sure we have a path to this more atomic design that is compatible with what we are stabilizing in sparse registries. If changes need to be made to sparse registries, now is the time.
The text was updated successfully, but these errors were encountered: