Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible crate builds #8864

Merged
merged 4 commits into from
Nov 18, 2020
Merged

Reproducible crate builds #8864

merged 4 commits into from
Nov 18, 2020

Conversation

bk2204
Copy link
Contributor

@bk2204 bk2204 commented Nov 15, 2020

This series introduces reproducible crate builds. Since crates are essentially gzipped tar archives, we canonicalize the fields such that they don't contain extraneous and potentially privacy-leaking data such as user and group names and IDs, device major and minor, and system timestamps. Outside of the timestamps, the user probably did not intend to share information about their user or system, so this also improves developer privacy somewhat.

The individual commit messages include copious details about the individual changes involved and the rationale for this change, but roughly, the idea is that by setting the environment variable SOURCE_DATE_EPOCH, which is the preferred way to specify a fixed timestamp by the Reproducible Builds project, we will produce a fully reproducible archive. In any event, we will now produce consistent timestamps throughout the archive and avoid looking up the system time repeatedly.

If desired, I could hash the produced crate in the tests, but I feel that would be a little overkill, especially since it's possible that one of our dependencies (e.g., flate2) might change and result in us producing an equivalent but different archive. Since reproducible builds use a consistent toolchain, that's not a problem here.

Fixes #8612

For each entry in the tar archive, we generate a new timestamp.
Normally cargo will be fast enough that we get a consistent timestamp,
but that need not be the case.  There's very little reason to produce
different timestamps for different files and it's slightly more
efficient not to need to make multiple queries, so let's instead
generate a single timestamp for all entries that we generate.
For projects supporting reproducible builds, it's possible to set the
timestamp used in artifacts by setting SOURCE_DATE_EPOCH to a decimal
Unix timestamp.  This is helpful because it allows users to produce the
exact same artifact, regardless of when the project was built, and it
also means that services which generate crates from source can generate
a consistent crate without having store previously built artifacts.

For all these reasons, let's honor the SOURCE_DATE_EPOCH environment
variable if it's set and use the current timestamp if it's not.
@rust-highfive
Copy link

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @ehuss (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Nov 15, 2020
@est31
Copy link
Member

est31 commented Nov 16, 2020

especially since it's possible that one of our dependencies (e.g., flate2) might change and result in us producing an equivalent but different archive. Since reproducible builds use a consistent toolchain, that's not a problem here.

Btw it's not that hard to make flate2 consistent between different OSs. Just revert #6317.

Currently, when reading a file from disk, we include several pieces of
data from the on-disk file, including the user and group names and IDs,
the device major and minor, the mode, and the timestamp.  This means
that our archives differ between systems, sometimes in unhelpful ways.

In addition, most users probably did not intend to share information
about their user and group settings, operating system and disk type, and
umask.  While these aren't huge privacy leaks, cargo doesn't use them
when extracting archives, so there's no value to including them.

Since using consistent data means that our archives are reproducible and
don't leak user data, both of which are desirable features, let's
canonicalize the header to strip out identifying information.

We set the user and group information to 0 and root, since that's the
only user that's typically consistent among Unix systems.  Setting
these values doesn't create a security risk since tar can't change the
ownership of files when it's running as a normal unprivileged user.

Similarly, we set the device major and minor to 0.  There is no useful
value here that's portable across systems, and it does not affect
extraction in any way.

We also set the timestamp to the same one that we use for generated
files.  This is probably the biggest loss of relevant data, but
considering that cargo doesn't otherwise use it and honoring it makes
the archives unreproducible, we canonicalize it as well.

Finally, we canonicalize the mode of an item we're storing by looking at
the executable bit and using mode 755 if it's set and mode 644 if it's
not.  We already use 644 as the default for generated files, and this is
the same algorithm that Git uses to determine whether a file should be
considered executable.  The tests don't test this case because there's
no portable way to create executable files on Windows.
@bk2204
Copy link
Contributor Author

bk2204 commented Nov 16, 2020

Btw it's not that hard to make flate2 consistent between different OSs. Just revert #6317.

Sure, if that's a thing we want to do, I can do that.

@alexcrichton
Copy link
Member

Thanks for the PR! This is definitely something we'd like to enable!

I'm curious, though, about whether things could be fixed by just not setting the mtime on generated files? We already do that I think for normal files (skip setting the mtime and other metadata). Basically I'm curious if we could just avoid setting all these fields and get the same result.

@bk2204
Copy link
Contributor Author

bk2204 commented Nov 17, 2020

I'm curious, though, about whether things could be fixed by just not setting the mtime on generated files? We already do that I think for normal files (skip setting the mtime and other metadata). Basically I'm curious if we could just avoid setting all these fields and get the same result.

No, I don't think that's going to be sufficient to enable reproducible builds. GNU tar archives store information about the user that's written into the archive. For example, on my system, we store information about my user ID in the archive, as well as the permissions of the files on my system. Two different users using the same toolchain should be able to produce identical archives, regardless of their user ID or umask settings.

@alexcrichton
Copy link
Member

Right I agree none of that should be in there, but is it currently in there for tarballs produced? AFAIK nothing puts it in there by default with the tar crate and Cargo shouldn't go out of its way to do so.

@est31
Copy link
Member

est31 commented Nov 17, 2020

The user/group names are set to be empty by default but the mode and the user id/group are included. The mode should be preserved imo, beyond just the executable bit. Git does it as well, and some build systems might require specific modes.

@est31
Copy link
Member

est31 commented Nov 17, 2020

Maybe one should just call set_metadata_in_mode with HeaderMode::Deterministic as a param instead of hand rolling an implementation here? That would have a similar mode restriction code but idk.

@alexcrichton
Copy link
Member

Ah I see that this is buried in set_metadata, but yeah in that case let's defer to the deterministic mode of the tar crate.

Currently, when reading a file from disk, we include several pieces of
data from the on-disk file, including the user and group names and IDs,
the device major and minor, the mode, and the timestamp.  This means
that our archives differ between systems, sometimes in unhelpful ways.

In addition, most users probably did not intend to share information
about their user and group settings, operating system and disk type, and
umask.  While these aren't huge privacy leaks, cargo doesn't use them
when extracting archives, so there's no value to including them.

Since using consistent data means that our archives are reproducible and
don't leak user data, both of which are desirable features, let's
canonicalize the header to strip out identifying information.

Omit the inclusion of the timestamp for generated files and tell the tar
crate to copy deterministic data.  That will omit all of the data we
don't care about and also canonicalize the mode properly.

Our tests don't check the specifics of certain fields because they
differ between the generated files and the files that are archived from
the disk format.  They are still canonicalized correctly for each type,
however.
@bk2204
Copy link
Contributor Author

bk2204 commented Nov 18, 2020

Okay, as requested, I've switched to use the tar crate's set_metadata_in_mode for canonicalization. Note that, also as requested, I've pushed up a new commit on top instead of squashing the commits, but before merge, they should be logically squashed into the final commit, since there's no longer any need to do specific prep work in separate commits. Let me know once you're happy and I'll do that.

@alexcrichton
Copy link
Member

@bors: r+

Looks good to me, thanks!

@bors
Copy link
Contributor

bors commented Nov 18, 2020

📌 Commit 449ead0 has been approved by alexcrichton

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 18, 2020
@bors
Copy link
Contributor

bors commented Nov 18, 2020

⌛ Testing commit 449ead0 with merge 8ff15f4aae822c13fcff31bdd7c58a0c09faacd7...

@bors
Copy link
Contributor

bors commented Nov 18, 2020

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Nov 18, 2020
@alexcrichton
Copy link
Member

@bors: retry

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 18, 2020
@bors
Copy link
Contributor

bors commented Nov 18, 2020

⌛ Testing commit 449ead0 with merge 83c6b4c48e25549764939a352fd5127b4912d7e7...

@bors
Copy link
Contributor

bors commented Nov 18, 2020

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Nov 18, 2020
@alexcrichton
Copy link
Member

@bors: retry

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 18, 2020
@bors
Copy link
Contributor

bors commented Nov 18, 2020

⌛ Testing commit 449ead0 with merge 668a6c6...

@bors
Copy link
Contributor

bors commented Nov 18, 2020

☀️ Test successful - checks-actions
Approved by: alexcrichton
Pushing 668a6c6 to master...

@bors bors merged commit 668a6c6 into rust-lang:master Nov 18, 2020
bors added a commit to rust-lang-ci/rust that referenced this pull request Nov 24, 2020
Update cargo

10 commits in 2af662e22177a839763ac8fb70d245a680b15214..bfca1cd22bf514d5f2b6c1089b0ded0ba7dfaa6e
2020-11-12 19:04:56 +0000 to 2020-11-24 16:33:21 +0000
- Shrink the progress bar, to give more space after it. (rust-lang/cargo#8892)
- Add some comments to the toml code (rust-lang/cargo#8887)
- Start searching git config at new path (rust-lang/cargo#8886)
- Fix documentation for CARGO_PRIMARY_PACKAGE. (rust-lang/cargo#8891)
- Bump to 0.51.0, update changelog (rust-lang/cargo#8894)
- Publish target's "doc" setting when emitting metadata (rust-lang/cargo#8869)
- Relaxes expectation of `cargo test` tests to accept test execution time (rust-lang/cargo#8884)
- Finish implementation of `-Zextra-link-arg`. (rust-lang/cargo#8441)
- Reproducible crate builds (rust-lang/cargo#8864)
- Allow resolver="1" to explicitly use the old resolver behavior. (rust-lang/cargo#8857)
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Feb 14, 2021
Pkgsrc changes:
 * Adjust patches, convert tabs to spaces so that tests pass.
 * Remove patches which are no longer needed (upstream changed)
 * Minor adjustments for SunOS, e.g. disable stack probes.
 * Adjust cargo checksum patching accordingly.
 * Remove commented-out use of PATCHELF on NetBSD, which doesn't work anyway...

Upstream changes:

Version 1.49.0 (2020-12-31)
============================

Language
-----------------------

- [Unions can now implement `Drop`, and you can now have a field in a union
  with `ManuallyDrop<T>`.][77547]
- [You can now cast uninhabited enums to integers.][76199]
- [You can now bind by reference and by move in patterns.][76119] This
  allows you to selectively borrow individual components of a type. E.g.
  ```rust
  #[derive(Debug)]
  struct Person {
      name: String,
      age: u8,
  }

  let person = Person {
      name: String::from("Alice"),
      age: 20,
  };

  // `name` is moved out of person, but `age` is referenced.
  let Person { name, ref age } = person;
  println!("{} {}", name, age);
  ```

Compiler
-----------------------

- [Added tier 1\* support for `aarch64-unknown-linux-gnu`.][78228]
- [Added tier 2 support for `aarch64-apple-darwin`.][75991]
- [Added tier 2 support for `aarch64-pc-windows-msvc`.][75914]
- [Added tier 3 support for `mipsel-unknown-none`.][78676]
- [Raised the minimum supported LLVM version to LLVM 9.][78848]
- [Output from threads spawned in tests is now captured.][78227]
- [Change os and vendor values to "none" and "unknown" for some targets][78951]

\* Refer to Rust's [platform support page][forge-platform-support] for more
information on Rust's tiered platform support.

Libraries
-----------------------

- [`RangeInclusive` now checks for exhaustion when calling `contains`
  and indexing.][78109]
- [`ToString::to_string` now no longer shrinks the internal buffer
  in the default implementation.][77997]
- [`ops::{Index, IndexMut}` are now implemented for fixed sized
  arrays of any length.][74989]

Stabilized APIs
---------------

- [`slice::select_nth_unstable`]
- [`slice::select_nth_unstable_by`]
- [`slice::select_nth_unstable_by_key`]

The following previously stable methods are now `const`.

- [`Poll::is_ready`]
- [`Poll::is_pending`]

Cargo
-----------------------
- [Building a crate with `cargo-package` should now be independently
  reproducible.][cargo/8864]
- [`cargo-tree` now marks proc-macro crates.][cargo/8765]
- [Added `CARGO_PRIMARY_PACKAGE` build-time environment
  variable.]  [cargo/8758] This variable will be set if the crate
  being built is one the user selected to build, either with `-p`
  or through defaults.
- [You can now use glob patterns when specifying packages &
  targets.][cargo/8752]


Compatibility Notes
-------------------
- [Demoted `i686-unknown-freebsd` from host tier 2 to target tier
  2 support.][78746]
- [Macros that end with a semi-colon are now treated as statements
  even if they expand to nothing.][78376]
- [Rustc will now check for the validity of some built-in attributes
  on enum variants.][77015] Previously such invalid or unused
  attributes could be ignored.
- Leading whitespace is stripped more uniformly in documentation
  comments, which may change behavior. You read [this post about
  the changes][rustdoc-ws-post] for more details.
- [Trait bounds are no longer inferred for associated types.][79904]

Internal Only
-------------
These changes provide no direct user facing benefits, but represent
significant improvements to the internals and overall performance
of rustc and related tools.

- [rustc's internal crates are now compiled using the `initial-exec` Thread
  Local Storage model.][78201]
- [Calculate visibilities once in resolve.][78077]
- [Added `system` to the `llvm-libunwind` bootstrap config option.][77703]
- [Added `--color` for configuring terminal color support to bootstrap.][79004]


[75991]: rust-lang/rust#75991
[78951]: rust-lang/rust#78951
[78848]: rust-lang/rust#78848
[78746]: rust-lang/rust#78746
[78376]: rust-lang/rust#78376
[78228]: rust-lang/rust#78228
[78227]: rust-lang/rust#78227
[78201]: rust-lang/rust#78201
[78109]: rust-lang/rust#78109
[78077]: rust-lang/rust#78077
[77997]: rust-lang/rust#77997
[77703]: rust-lang/rust#77703
[77547]: rust-lang/rust#77547
[77015]: rust-lang/rust#77015
[76199]: rust-lang/rust#76199
[76119]: rust-lang/rust#76119
[75914]: rust-lang/rust#75914
[74989]: rust-lang/rust#74989
[79004]: rust-lang/rust#79004
[78676]: rust-lang/rust#78676
[79904]: rust-lang/rust#79904
[cargo/8864]: rust-lang/cargo#8864
[cargo/8765]: rust-lang/cargo#8765
[cargo/8758]: rust-lang/cargo#8758
[cargo/8752]: rust-lang/cargo#8752
[`slice::select_nth_unstable`]: https://doc.rust-lang.org/nightly/std/primitive.slice.html#method.select_nth_unstable
[`slice::select_nth_unstable_by`]: https://doc.rust-lang.org/nightly/std/primitive.slice.html#method.select_nth_unstable_by
[`slice::select_nth_unstable_by_key`]: https://doc.rust-lang.org/nightly/std/primitive.slice.html#method.select_nth_unstable_by_key
[`hint::spin_loop`]: https://doc.rust-lang.org/stable/std/hint/fn.spin_loop.html
[`Poll::is_ready`]: https://doc.rust-lang.org/stable/std/task/enum.Poll.html#method.is_ready
[`Poll::is_pending`]: https://doc.rust-lang.org/stable/std/task/enum.Poll.html#method.is_pending
[rustdoc-ws-post]: https://blog.guillaume-gomez.fr/articles/2020-11-11+New+doc+comment+handling+in+rustdoc
@ehuss ehuss added this to the 1.50.0 milestone Feb 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reproducible crate builds
6 participants