-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducible crate builds #8864
Conversation
For each entry in the tar archive, we generate a new timestamp. Normally cargo will be fast enough that we get a consistent timestamp, but that need not be the case. There's very little reason to produce different timestamps for different files and it's slightly more efficient not to need to make multiple queries, so let's instead generate a single timestamp for all entries that we generate.
For projects supporting reproducible builds, it's possible to set the timestamp used in artifacts by setting SOURCE_DATE_EPOCH to a decimal Unix timestamp. This is helpful because it allows users to produce the exact same artifact, regardless of when the project was built, and it also means that services which generate crates from source can generate a consistent crate without having store previously built artifacts. For all these reasons, let's honor the SOURCE_DATE_EPOCH environment variable if it's set and use the current timestamp if it's not.
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @ehuss (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
Btw it's not that hard to make flate2 consistent between different OSs. Just revert #6317. |
Currently, when reading a file from disk, we include several pieces of data from the on-disk file, including the user and group names and IDs, the device major and minor, the mode, and the timestamp. This means that our archives differ between systems, sometimes in unhelpful ways. In addition, most users probably did not intend to share information about their user and group settings, operating system and disk type, and umask. While these aren't huge privacy leaks, cargo doesn't use them when extracting archives, so there's no value to including them. Since using consistent data means that our archives are reproducible and don't leak user data, both of which are desirable features, let's canonicalize the header to strip out identifying information. We set the user and group information to 0 and root, since that's the only user that's typically consistent among Unix systems. Setting these values doesn't create a security risk since tar can't change the ownership of files when it's running as a normal unprivileged user. Similarly, we set the device major and minor to 0. There is no useful value here that's portable across systems, and it does not affect extraction in any way. We also set the timestamp to the same one that we use for generated files. This is probably the biggest loss of relevant data, but considering that cargo doesn't otherwise use it and honoring it makes the archives unreproducible, we canonicalize it as well. Finally, we canonicalize the mode of an item we're storing by looking at the executable bit and using mode 755 if it's set and mode 644 if it's not. We already use 644 as the default for generated files, and this is the same algorithm that Git uses to determine whether a file should be considered executable. The tests don't test this case because there's no portable way to create executable files on Windows.
a1e66f7
to
e46ca84
Compare
Sure, if that's a thing we want to do, I can do that. |
Thanks for the PR! This is definitely something we'd like to enable! I'm curious, though, about whether things could be fixed by just not setting the mtime on generated files? We already do that I think for normal files (skip setting the mtime and other metadata). Basically I'm curious if we could just avoid setting all these fields and get the same result. |
No, I don't think that's going to be sufficient to enable reproducible builds. GNU tar archives store information about the user that's written into the archive. For example, on my system, we store information about my user ID in the archive, as well as the permissions of the files on my system. Two different users using the same toolchain should be able to produce identical archives, regardless of their user ID or umask settings. |
Right I agree none of that should be in there, but is it currently in there for tarballs produced? AFAIK nothing puts it in there by default with the |
The user/group names are set to be empty by default but the mode and the user id/group are included. The mode should be preserved imo, beyond just the executable bit. Git does it as well, and some build systems might require specific modes. |
Maybe one should just call |
Ah I see that this is buried in |
Currently, when reading a file from disk, we include several pieces of data from the on-disk file, including the user and group names and IDs, the device major and minor, the mode, and the timestamp. This means that our archives differ between systems, sometimes in unhelpful ways. In addition, most users probably did not intend to share information about their user and group settings, operating system and disk type, and umask. While these aren't huge privacy leaks, cargo doesn't use them when extracting archives, so there's no value to including them. Since using consistent data means that our archives are reproducible and don't leak user data, both of which are desirable features, let's canonicalize the header to strip out identifying information. Omit the inclusion of the timestamp for generated files and tell the tar crate to copy deterministic data. That will omit all of the data we don't care about and also canonicalize the mode properly. Our tests don't check the specifics of certain fields because they differ between the generated files and the files that are archived from the disk format. They are still canonicalized correctly for each type, however.
Okay, as requested, I've switched to use the |
@bors: r+ Looks good to me, thanks! |
📌 Commit 449ead0 has been approved by |
⌛ Testing commit 449ead0 with merge 8ff15f4aae822c13fcff31bdd7c58a0c09faacd7... |
💔 Test failed - checks-actions |
@bors: retry |
⌛ Testing commit 449ead0 with merge 83c6b4c48e25549764939a352fd5127b4912d7e7... |
💔 Test failed - checks-actions |
@bors: retry |
☀️ Test successful - checks-actions |
Update cargo 10 commits in 2af662e22177a839763ac8fb70d245a680b15214..bfca1cd22bf514d5f2b6c1089b0ded0ba7dfaa6e 2020-11-12 19:04:56 +0000 to 2020-11-24 16:33:21 +0000 - Shrink the progress bar, to give more space after it. (rust-lang/cargo#8892) - Add some comments to the toml code (rust-lang/cargo#8887) - Start searching git config at new path (rust-lang/cargo#8886) - Fix documentation for CARGO_PRIMARY_PACKAGE. (rust-lang/cargo#8891) - Bump to 0.51.0, update changelog (rust-lang/cargo#8894) - Publish target's "doc" setting when emitting metadata (rust-lang/cargo#8869) - Relaxes expectation of `cargo test` tests to accept test execution time (rust-lang/cargo#8884) - Finish implementation of `-Zextra-link-arg`. (rust-lang/cargo#8441) - Reproducible crate builds (rust-lang/cargo#8864) - Allow resolver="1" to explicitly use the old resolver behavior. (rust-lang/cargo#8857)
Pkgsrc changes: * Adjust patches, convert tabs to spaces so that tests pass. * Remove patches which are no longer needed (upstream changed) * Minor adjustments for SunOS, e.g. disable stack probes. * Adjust cargo checksum patching accordingly. * Remove commented-out use of PATCHELF on NetBSD, which doesn't work anyway... Upstream changes: Version 1.49.0 (2020-12-31) ============================ Language ----------------------- - [Unions can now implement `Drop`, and you can now have a field in a union with `ManuallyDrop<T>`.][77547] - [You can now cast uninhabited enums to integers.][76199] - [You can now bind by reference and by move in patterns.][76119] This allows you to selectively borrow individual components of a type. E.g. ```rust #[derive(Debug)] struct Person { name: String, age: u8, } let person = Person { name: String::from("Alice"), age: 20, }; // `name` is moved out of person, but `age` is referenced. let Person { name, ref age } = person; println!("{} {}", name, age); ``` Compiler ----------------------- - [Added tier 1\* support for `aarch64-unknown-linux-gnu`.][78228] - [Added tier 2 support for `aarch64-apple-darwin`.][75991] - [Added tier 2 support for `aarch64-pc-windows-msvc`.][75914] - [Added tier 3 support for `mipsel-unknown-none`.][78676] - [Raised the minimum supported LLVM version to LLVM 9.][78848] - [Output from threads spawned in tests is now captured.][78227] - [Change os and vendor values to "none" and "unknown" for some targets][78951] \* Refer to Rust's [platform support page][forge-platform-support] for more information on Rust's tiered platform support. Libraries ----------------------- - [`RangeInclusive` now checks for exhaustion when calling `contains` and indexing.][78109] - [`ToString::to_string` now no longer shrinks the internal buffer in the default implementation.][77997] - [`ops::{Index, IndexMut}` are now implemented for fixed sized arrays of any length.][74989] Stabilized APIs --------------- - [`slice::select_nth_unstable`] - [`slice::select_nth_unstable_by`] - [`slice::select_nth_unstable_by_key`] The following previously stable methods are now `const`. - [`Poll::is_ready`] - [`Poll::is_pending`] Cargo ----------------------- - [Building a crate with `cargo-package` should now be independently reproducible.][cargo/8864] - [`cargo-tree` now marks proc-macro crates.][cargo/8765] - [Added `CARGO_PRIMARY_PACKAGE` build-time environment variable.] [cargo/8758] This variable will be set if the crate being built is one the user selected to build, either with `-p` or through defaults. - [You can now use glob patterns when specifying packages & targets.][cargo/8752] Compatibility Notes ------------------- - [Demoted `i686-unknown-freebsd` from host tier 2 to target tier 2 support.][78746] - [Macros that end with a semi-colon are now treated as statements even if they expand to nothing.][78376] - [Rustc will now check for the validity of some built-in attributes on enum variants.][77015] Previously such invalid or unused attributes could be ignored. - Leading whitespace is stripped more uniformly in documentation comments, which may change behavior. You read [this post about the changes][rustdoc-ws-post] for more details. - [Trait bounds are no longer inferred for associated types.][79904] Internal Only ------------- These changes provide no direct user facing benefits, but represent significant improvements to the internals and overall performance of rustc and related tools. - [rustc's internal crates are now compiled using the `initial-exec` Thread Local Storage model.][78201] - [Calculate visibilities once in resolve.][78077] - [Added `system` to the `llvm-libunwind` bootstrap config option.][77703] - [Added `--color` for configuring terminal color support to bootstrap.][79004] [75991]: rust-lang/rust#75991 [78951]: rust-lang/rust#78951 [78848]: rust-lang/rust#78848 [78746]: rust-lang/rust#78746 [78376]: rust-lang/rust#78376 [78228]: rust-lang/rust#78228 [78227]: rust-lang/rust#78227 [78201]: rust-lang/rust#78201 [78109]: rust-lang/rust#78109 [78077]: rust-lang/rust#78077 [77997]: rust-lang/rust#77997 [77703]: rust-lang/rust#77703 [77547]: rust-lang/rust#77547 [77015]: rust-lang/rust#77015 [76199]: rust-lang/rust#76199 [76119]: rust-lang/rust#76119 [75914]: rust-lang/rust#75914 [74989]: rust-lang/rust#74989 [79004]: rust-lang/rust#79004 [78676]: rust-lang/rust#78676 [79904]: rust-lang/rust#79904 [cargo/8864]: rust-lang/cargo#8864 [cargo/8765]: rust-lang/cargo#8765 [cargo/8758]: rust-lang/cargo#8758 [cargo/8752]: rust-lang/cargo#8752 [`slice::select_nth_unstable`]: https://doc.rust-lang.org/nightly/std/primitive.slice.html#method.select_nth_unstable [`slice::select_nth_unstable_by`]: https://doc.rust-lang.org/nightly/std/primitive.slice.html#method.select_nth_unstable_by [`slice::select_nth_unstable_by_key`]: https://doc.rust-lang.org/nightly/std/primitive.slice.html#method.select_nth_unstable_by_key [`hint::spin_loop`]: https://doc.rust-lang.org/stable/std/hint/fn.spin_loop.html [`Poll::is_ready`]: https://doc.rust-lang.org/stable/std/task/enum.Poll.html#method.is_ready [`Poll::is_pending`]: https://doc.rust-lang.org/stable/std/task/enum.Poll.html#method.is_pending [rustdoc-ws-post]: https://blog.guillaume-gomez.fr/articles/2020-11-11+New+doc+comment+handling+in+rustdoc
This series introduces reproducible crate builds. Since crates are essentially gzipped tar archives, we canonicalize the fields such that they don't contain extraneous and potentially privacy-leaking data such as user and group names and IDs, device major and minor, and system timestamps. Outside of the timestamps, the user probably did not intend to share information about their user or system, so this also improves developer privacy somewhat.
The individual commit messages include copious details about the individual changes involved and the rationale for this change, but roughly, the idea is that by setting the environment variable
SOURCE_DATE_EPOCH
, which is the preferred way to specify a fixed timestamp by the Reproducible Builds project, we will produce a fully reproducible archive. In any event, we will now produce consistent timestamps throughout the archive and avoid looking up the system time repeatedly.If desired, I could hash the produced crate in the tests, but I feel that would be a little overkill, especially since it's possible that one of our dependencies (e.g., flate2) might change and result in us producing an equivalent but different archive. Since reproducible builds use a consistent toolchain, that's not a problem here.
Fixes #8612