Change generate-copyright to generate HTML, with cargo dependencies included #128353

jonathanpallant · 2024-07-29T16:22:46Z

x.py run generate-copyright now produces build/COPYRIGHT.html. This includes a new format for in-tree dependencies, and also adds out-of-tree cargo dependencies.

After consulting expert opinion, I have elected to include every top-level:

*NOTICE*
*AUTHOR*
*LICENSE*
*LICENCE*, and
*COPYRIGHT* file I can find - case-insensitive.

This is because the cargo package metadata's author field is not a list of copyright holders and does not meet the requirements of the Apache-2.0 license (which says you must include a NOTICE file with the binary if one was supplied by the author) nor the MIT license (which says you must include 'the above copyright notice').

I believe it would be appropriate to include this file with every Rust release, in order to do an even better job of appropriately recognising the efforts of the authors of the first-party and third-party libraries we are using here.

The output includes something like 524 copies of the Apache-2.0 text because they are not all identical. I think I count about 50 different variations by shasum - some differ in whitespace, while some have the boilerplate block at the bottom erroneously modified (don't modify the copy in the license, modify the copy you paste into your own source code!). Running gzip on the HTML file largely makes this problem go away, and the average browser is far happier with a ~6 MiB HTML file than the average Markdown viewer is with a ~6 MiB markdown file. But, if someone wants to, do they could submit a follow-up which de-dups the license text files and adds back-links to earlier identical copies (for some value of 'identical copy').

$ xpy run generate-copyright
$ cd build
$ gzip -c COPYRIGHT.html > COPYRIGHT.gz
$ xz -c COPYRIGHT.html > COPYRIGHT.xz
$ ls -lh COPYRIGHT.*
-rw-r--r--  1 jonathan  staff   241K 29 Jul 17:19 COPYRIGHT.gz
-rw-r--r--@ 1 jonathan  staff   6.6M 29 Jul 11:30 COPYRIGHT.html
-rw-r--r--  1 jonathan  staff    59K 29 Jul 17:19 COPYRIGHT.xz

Here's an example COPYRIGHT.gz.

rustbot · 2024-07-29T16:22:54Z

r? @Kobzol

rustbot has assigned @Kobzol.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

rustbot · 2024-07-29T16:22:56Z

These commits modify the Cargo.lock file. Unintentional changes to Cargo.lock can be introduced when switching branches and rebasing PRs.

If this was unintentional then you should revert the changes before this PR is merged.
Otherwise, you can ignore this comment.

Kobzol

Hi, thanks for the PR! Left some comments and nits.

From a high-level point of view, I have some doubts about the second commit. I don't like the approach of building a HTML manually out of strings very much. It seems quite simple to produce mangled/invalid HTML content or forget to escape something (especially since we're including data from Cargo manifests of external packages!).

Could we create something that is more structured and then generate a HTML page out of it automatically? We could either just generate the HTML from a Markdown file or generate a bunch of Rust structs and then use some simple Jinja/Rinja/Tera/Askama/whatever template to actually build the website, with proper escaping and less custom code. We already have a bunch of useful third-party crates in our dependency tree, so I don't think that it's needed to create our own manual code for rendering a HTML page.

src/tools/generate-copyright/src/cargo_metadata.rs

src/tools/generate-copyright/src/main.rs

Kobzol · 2024-07-29T18:22:39Z

src/tools/generate-copyright/src/main.rs

+
+    let root_path = std::path::absolute(".")?;
+    let workspace_paths = [
+        Path::new("./Cargo.toml"),


I think that we have more workspaces. E.g. recently src/tools/rustbook was added as a separate workspace. We should probably somehow pass their list from bootstrap, to avoid forgetting about them? 🤔

I would love to see a central slice in bootstrap listing all workspaces :D

Hmm, I took a look and "all workspaces" is not as easy as it sounds 😆 We currenty have the main workspace, bootstrap workspace, rustbook workspace, but then also a bunch of workspaces for individual tools, such as RA. I'm not sure if we can generalize that list for all use-cases. For example, this code probably doesn't want to include the workspace of bootstrap and RA (?).

I think this tool should include anything we 'ship' with each 'Rust Release', for some value of 'ship' and some value of 'Rust Release'.

We don't ship the bootstrap binary in the release, so we don't need to scan that. We do ship rustc.exe, so we do need to scan that. Other binaries and libraries will probably be somewhere between those two examples.

#128588 dedups the list of workspaces to vendor between the two locations where we run cargo vendor. Maybe generate-copyright should use the same list of workspaces. That is the set of workspaces that may actually be used by ./x.py dist and ./x.py test. Everything else would fail to build on offline systems. For example library/stdarch workspace is not meant to be used. Some of it's members are used as dependencies of the standard library, but the rest aren't vendored and bootstrap doesn't expose a way to run the stdarch tests either. Tidy also currently has a list of workspaces for which it checks licenses. That one should probably also be deduped with cargo vendor and generate-copyright.

That sounds reasonable but I'm not sure I know enough about the internal org of this repo to do it. Also I'm on vacation now.

@pietroalbini ?

src/tools/generate-copyright/src/main.rs

jonathanpallant · 2024-07-30T12:57:37Z

I'll review what's in the tree already and see if I can pick something to do the HTML escaping for sure, and perhaps the rendering too.

Kobzol · 2024-07-30T13:42:57Z

We already have https://github.com/rinja-rs/rinja for template rendering or https://github.com/pulldown-cmark/pulldown-cmark for converting Markdown into HTML.

jonathanpallant · 2024-07-30T17:09:53Z

Markdown might be a little tricky because some of the license texts are almost, but not quite, valid Markdown. And I already basically have what I want in Rust structs - so I'll look at rinja. Thanks!

jonathanpallant · 2024-07-30T18:44:15Z

OK, I tried to use rinja but I couldn't make it render a recursive type. So I manually rendered the Node type, and used html_escape to escape the output.

Kobzol · 2024-07-31T07:39:36Z

This is much better, thanks! I tried to play with the recursive templates and it seems like it should be possible using this approach.

What about this? With this approach, we can use a template to render the Node, and at the same time any content of individual licenses or filenames etc. is automatically escaped by rinja.

jonathanpallant · 2024-07-31T18:27:35Z

OK, I made it work! No idea what I was doing wrong last time - the errors you get when a template fails to compile are inscrutable.

src/tools/generate-copyright/src/cargo_metadata.rs

src/tools/generate-copyright/Cargo.toml

bors · 2024-08-04T15:20:54Z

☔ The latest upstream changes (presumably #128634) made this pull request unmergeable. Please resolve the merge conflicts.

This tool now scans for cargo dependencies and includes any important looking license files. We do this because cargo package metadata is not sufficient - the Apache-2.0 license says you have to include any NOTICE file, for example. And authors != copyright holders (cargo has the former, we must include the latter).

This format works better with large amounts of structured data. We also mark which deps are in the stdlib

I can't find a way to derive rinja::Template for Node - I think because it is a recursive type. So I rendered it manually using html_escape.

jonathanpallant · 2024-08-06T11:04:23Z

I've rebased on main locally, and this tool is now broken:

$ xpy run generate-copyright
Building bootstrap
    Finished `dev` profile [unoptimized] target(s) in 0.18s
Building stage0 tool collect-license-metadata (aarch64-apple-darwin)
    Finished `release` profile [optimized + debuginfo] target(s) in 0.21s
gathering license information from REUSE
finished gathering the license information from REUSE in 33.71s
Building stage0 tool generate-copyright (aarch64-apple-darwin)
    Finished `release` profile [optimized + debuginfo] target(s) in 0.22s
Vendoring deps into /Users/jonathan/Documents/open-source/rust/build/vendor...
error: current package believes it's in a workspace when it's not:
current:   /Users/jonathan/Documents/open-source/rust/./library/std/Cargo.toml
workspace: /Users/jonathan/Documents/open-source/rust/Cargo.toml

this may be fixable by adding `library/std` to the `workspace.members` array of the manifest located at: /Users/jonathan/Documents/open-source/rust/Cargo.toml
Alternatively, to keep it out of the workspace, add the package to the `workspace.exclude` array, or add an empty `[workspace]` table to the package's manifest.
Error: Failed to complete cargo vendor

Command CARGO="/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0/bin/cargo" DEST="/Users/jonathan/Documents/open-source/rust/build/COPYRIGHT.html" DYLD_LIBRARY_PATH="/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0-bootstrap-tools/aarch64-apple-darwin/release/deps:/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0/lib" LICENSE_METADATA="/Users/jonathan/Documents/open-source/rust/build/license-metadata.json" OUT_DIR="/Users/jonathan/Documents/open-source/rust/build" RUSTC="/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0/bin/rustc" "/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0-tools-bin/generate-copyright" (failure_mode=Exit) did not execute successfully.
Expected success, got exit status: 1
Created at: src/core/build_steps/tool.rs:1113:23
Executed at: src/core/build_steps/run.rs:222:13

Build completed unsuccessfully in 0:00:39

I'm not sure what I've done wrong, and I'm ~~not~~ now out on vacation for three weeks.

jonathanpallant · 2024-08-06T11:14:20Z

Ah, @Veykril helped me fix it. Apparently libstd joined a workspace at ./library/Cargo.toml

Kobzol

Yes, this happened in #128534.

I wonder if this is expected? The root level files should also be under MIT/Apache-2.0.

src/tools/generate-copyright/src/cargo_metadata.rs

jonathanpallant · 2024-08-07T13:51:58Z

What does your license-metadata.json say?

Mine says:

 205   │         "license": {
 206   │           "copyright": [
 207   │             "The Rust Project Developers (see https://thanks.rust-lang.org)"
 208   │           ],
 209   │           "spdx": "Apache-2.0 OR MIT"
 210   │         },
 211   │         "name": ".",
 212   │         "type": "directory"

My HTML looks like this:

Kobzol · 2024-08-07T16:24:03Z

{
  "files": {
    "children": [
      {
        "children": [...],
        "license": {
          "copyright": [
            "NONE"
          ],
          "spdx": "NONE"
        },
        "name": ".",
        "type": "directory"
      }
    ],
    "type": "root"
  }
}

Interesting 🤔

thejpster · 2024-08-07T17:55:24Z

Ok, not generate-copyright's problem then.

The license data generator does require reuse-tool 4.0, otherwise it'll ignore the new REUSE.toml file.

Kobzol · 2024-08-07T18:04:35Z

Hmm, I have reuse 4.0.3 (the Python tool). I'll try to debug what might be causing this.

Kobzol · 2024-08-07T18:18:26Z

I think it's some missing logic in collect-license-metadata that doesn't count with the fact that I have untracked files in the root. I suppose that this doesn't need to block this PR. Do you want to make some modifications here? Otherwise I think we can ship this.

thejpster · 2024-08-07T18:21:01Z

I have no further changes to generate-copyright.

Kobzol · 2024-08-07T18:21:24Z

Ok then, thank you!

@bors r+

bors · 2024-08-07T18:21:27Z

📌 Commit 99579f3 has been approved by Kobzol

It is now in the queue for this repository.

bjorn3 · 2024-08-07T18:32:31Z

Did you see #128353 (comment)?

Kobzol · 2024-08-07T19:04:43Z

Ah, right. Yeah, we now have a bunch of "workspace lists", deduping everything would be nice, although I'm not sure if we really use the same set of workspace in all places. In any case, I don't think that this needs to block this PR. I'll try to take a look at it.

bjorn3 · 2024-08-07T19:54:02Z

Both tidy and generate-copyright should probably use the exact same set to ensure generate-copyright will never show a license forbidden by tidy. cargo vendor should be a (non-strict, aka both sides of the subset operator may be identical) subset of what tidy and generate-copyright use. If we vendor anything not known by generate-copyright, the latter would miss some licenses that would be included in the source tarball.

…iaskrgr Rollup of 8 pull requests Successful merges: - rust-lang#128221 (Add implied target features to target_feature attribute) - rust-lang#128261 (impl `Default` for collection iterators that don't already have it) - rust-lang#128353 (Change generate-copyright to generate HTML, with cargo dependencies included) - rust-lang#128679 (codegen: better centralize function declaration attribute computation) - rust-lang#128732 (make `import.vis` is immutable) - rust-lang#128755 (Integrate crlf directly into related test file instead via of .gitattributes) - rust-lang#128772 (rustc_codegen_ssa: Set architecture for object crate for 32-bit SPARC) - rust-lang#128782 (unused_parens: do not lint against parens around &raw) r? `@ghost` `@rustbot` modify labels: rollup

Rollup merge of rust-lang#128353 - ferrocene:jonathanpallant/add-dependencies-to-copyright-file, r=Kobzol Change generate-copyright to generate HTML, with cargo dependencies included `x.py run generate-copyright` now produces `build/COPYRIGHT.html`. This includes a new format for in-tree dependencies, and also adds out-of-tree cargo dependencies. After consulting expert opinion, I have elected to include every top-level: * `*NOTICE*` * `*AUTHOR*` * `*LICENSE*` * `*LICENCE*`, and * `*COPYRIGHT*` file I can find - case-insensitive. This is because the cargo package metadata's `author` field is not a list of copyright holders and does not meet the requirements of the Apache-2.0 license (which says you must include a NOTICE file with the binary if one was supplied by the author) nor the MIT license (which says you must include 'the above copyright notice'). I believe it would be appropriate to include this file with every Rust release, in order to do an even better job of appropriately recognising the efforts of the authors of the first-party and third-party libraries we are using here. The output includes something like 524 copies of the Apache-2.0 text because they are not all identical. I think I count about 50 different variations by shasum - some differ in whitespace, while some have the boilerplate block at the bottom erroneously modified (don't modify the copy in the license, modify the copy you paste into your own source code!). Running `gzip` on the HTML file largely makes this problem go away, and the average browser is far happier with a ~6 MiB HTML file than the average Markdown viewer is with a ~6 MiB markdown file. But, if someone wants to, do they could submit a follow-up which de-dups the license text files and adds back-links to earlier identical copies (for some value of 'identical copy'). ```console $ xpy run generate-copyright $ cd build $ gzip -c COPYRIGHT.html > COPYRIGHT.gz $ xz -c COPYRIGHT.html > COPYRIGHT.xz $ ls -lh COPYRIGHT.* -rw-r--r-- 1 jonathan staff 241K 29 Jul 17:19 COPYRIGHT.gz -rw-r--r--@ 1 jonathan staff 6.6M 29 Jul 11:30 COPYRIGHT.html -rw-r--r-- 1 jonathan staff 59K 29 Jul 17:19 COPYRIGHT.xz ``` Here's an example [COPYRIGHT.gz](https://github.com/user-attachments/files/16416147/COPYRIGHT.gz).

Kobzol · 2025-02-21T16:24:08Z

Both tidy and generate-copyright should probably use the exact same set to ensure generate-copyright will never show a license forbidden by tidy. cargo vendor should be a (non-strict, aka both sides of the subset operator may be identical) subset of what tidy and generate-copyright use. If we vendor anything not known by generate-copyright, the latter would miss some licenses that would be included in the source tarball.

This was implemented in #137020.

rustbot assigned Kobzol Jul 29, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) labels Jul 29, 2024

Kobzol reviewed Jul 29, 2024

View reviewed changes

This comment has been minimized.

Sign in to view

thejpster reviewed Jul 31, 2024

View reviewed changes

src/tools/generate-copyright/src/cargo_metadata.rs Outdated Show resolved Hide resolved

thejpster reviewed Jul 31, 2024

View reviewed changes

src/tools/generate-copyright/Cargo.toml Outdated Show resolved Hide resolved

jonathanpallant added 9 commits August 6, 2024 11:04

generate-copyright: Produce HTML, not Markdown

204e3ea

This format works better with large amounts of structured data. We also mark which deps are in the stdlib

generate-copyright: Fix typo

56f8479

generate-copyright: use cargo-metadata

dbab595

generate-copyright: use rinja to format the output

f7e6bf6

I can't find a way to derive rinja::Template for Node - I think because it is a recursive type. So I rendered it manually using html_escape.

REUSE.toml: Copyright text isn't parsed as Markdown.

37ab090

generate-copyright: Render Node with rinja too.

30ac7c9

generate-copyright: gather files inside interesting folders

5277b67

Update to rinja 0.3

4e24e9b

jonathanpallant force-pushed the jonathanpallant/add-dependencies-to-copyright-file branch from 6382c17 to 4e24e9b Compare August 6, 2024 11:04

Apparently library/std is now part of a workspace at library/

99579f3

Kobzol reviewed Aug 6, 2024

View reviewed changes

src/tools/generate-copyright/src/cargo_metadata.rs Show resolved Hide resolved

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 7, 2024

matthiaskrgr mentioned this pull request Aug 7, 2024

Rollup of 8 pull requests #128796

Merged

bors merged commit e342295 into rust-lang:master Aug 7, 2024
6 checks passed

rustbot added this to the 1.82.0 milestone Aug 7, 2024

thejpster mentioned this pull request Aug 16, 2024

Add rp-rs copyright notice embassy-rs/embassy#3261

Merged

jonathanpallant mentioned this pull request Aug 28, 2024

Draft: Add dependencies to copyright file ferrocene/rust#3

Closed

Kobzol mentioned this pull request Nov 22, 2024

generate-copyright: Now generates a library file too. #133208

Merged

2 tasks

jyn514 mentioned this pull request Feb 13, 2025

GenerateCopyright attempts to vendor sources during installation #136955

Closed

jonathanpallant deleted the jonathanpallant/add-dependencies-to-copyright-file branch February 22, 2025 17:12

Change generate-copyright to generate HTML, with cargo dependencies included #128353

Change generate-copyright to generate HTML, with cargo dependencies included #128353

Uh oh!

Conversation

jonathanpallant commented Jul 29, 2024

Uh oh!

rustbot commented Jul 29, 2024

Uh oh!

rustbot commented Jul 29, 2024

Uh oh!

Kobzol left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Kobzol Jul 29, 2024

Choose a reason for hiding this comment

Uh oh!

pietroalbini Jul 30, 2024

Choose a reason for hiding this comment

Uh oh!

Kobzol Aug 1, 2024

Choose a reason for hiding this comment

Uh oh!

jonathanpallant Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjorn3 Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

thejpster Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonathanpallant commented Jul 30, 2024

Uh oh!

Kobzol commented Jul 30, 2024

Uh oh!

jonathanpallant commented Jul 30, 2024

Uh oh!

jonathanpallant commented Jul 30, 2024

Uh oh!

This comment has been minimized.

Kobzol commented Jul 31, 2024

Uh oh!

jonathanpallant commented Jul 31, 2024

Uh oh!

Uh oh!

Uh oh!

bors commented Aug 4, 2024

Uh oh!

jonathanpallant commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonathanpallant commented Aug 6, 2024

Uh oh!

Kobzol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonathanpallant commented Aug 7, 2024

Uh oh!

Kobzol commented Aug 7, 2024

Uh oh!

thejpster commented Aug 7, 2024

Uh oh!

Kobzol commented Aug 7, 2024

Uh oh!

Kobzol commented Aug 7, 2024

Uh oh!

thejpster commented Aug 7, 2024

Uh oh!

Kobzol commented Aug 7, 2024

Uh oh!

bors commented Aug 7, 2024

Uh oh!

bjorn3 commented Aug 7, 2024

Uh oh!

Kobzol left a comment •

edited

Loading

jonathanpallant Aug 6, 2024 •

edited

Loading

jonathanpallant commented Aug 6, 2024 •

edited

Loading

bjorn3 commented Aug 7, 2024 •

edited

Loading