generate-copyright: Now generates a library file too. #133208

jonathanpallant · 2024-11-19T11:37:51Z

We only run reuse once, so the output has to be filtered to find only the files that are relevant to the library tree.

Outputs COPYRIGHT.html and COPYRIGHT-library.html.

The license-metadata.json file is also now in the tree. We need a CI tool to check that it's correct.

r? kobzol

Remaining steps:

Teach CI to double-check the license-metadata.json file is correct
Add the COPYRIGHT.html and COPYRIGHT-license.html to the releases

Kobzol · 2024-11-19T12:26:24Z

It's a bit surprising to me that we commit the JSON file. My expectation was that the only reason we need to commit anything at all is to have a "pre-built" copyright file (in HTML format?) in the repo, which would be easily checkable by people without the need to run bootstrap. So in that case, I'd expect that on CI, we render the HTML file, and check if it is updated in git.

Is the JSON actually needed to be in git?

jonathanpallant · 2024-11-19T13:37:58Z

My expectation was that the only reason we need to commit anything at all is to have a "pre-built" copyright file (in HTML format?) in the repo, which would be easily checkable by people without the need to run bootstrap. So in that case, I'd expect that on CI, we render the HTML file, and check if it is updated in git.

As I understood it, building the COPYRIGHT-*.html files is trivial using bootstrap, because it's written in Rust. So we can let bootstrap build those.

But the license-metadata.json file requires reuse, which is a Python package, and we didn't want to force people to install it before they could run x.py dist or whatever. So that's the file we commit to git and check in CI.

Edit: Also Copyright-*.html contains the results of scraping the cargo dependency tree for the workspace(s), so it would very quickly get out of sync with Cargo.lock, and that could be annoying. license-metadata.json only contains information that is already in git.

jonathanpallant · 2024-11-19T13:40:15Z

Ah ha ha ha:

The following files have no copyright and licensing information:
* ../license-metadata.json

Kobzol · 2024-11-19T19:08:51Z

But the license-metadata.json file requires reuse, which is a Python package, and we didn't want to force people to install it before they could run x.py dist or whatever. So that's the file we commit to git and check in CI.

Bootstrap also requires Python (as of right now), plus a bunch of other dependencies, so that doesn't seem like a gigantic win to me, tbh.

Edit: Also Copyright-*.html contains the results of scraping the cargo dependency tree for the workspace(s), so it would very quickly get out of sync with Cargo.lock, and that could be annoying. license-metadata.json only contains information that is already in git.

It wouldn't ever get out of sync for the same reason the JSON won't get out of sync - we'd check in CI that it is in fact in sync :) Our Cargo.locks are also in git.

Maybe to take a step back - what does this PR try to achieve? My understanding was that we want to get rid of LICENSE-MIT, LICENSE-APACHE and the COPYRIGHT files (and maybe also a bunch of others), and automatically generate a copyright file instead. This is also what Pietro has originally described on Zulip:

Hook collect-license-metadata and generate-copyright into CI, to ensure COPYRIGHT.md reflects the state of things.

So from this point of view, it makes more sense to me to commit the human-readable (and partially machine readable) COPYRIGHT.md/html file, rather than the JSON file, which is mostly a temporary build artifact anyway.

The question is: do we want to generate and commit copyright metadata that is readable by humans, or metadata that is readable by machines, or both? CC @pietroalbini

jonathanpallant · 2024-11-19T19:25:37Z

From the Zulip:

Created the src/tools/collect-license-metadata tool, which takes the REUSE output and condenses it in a compact JSON file. By the time the effort is complete, the goal is to commit that JSON file in the repository, and enforce in CI that the file is up to date. That way, contributors don't need REUSE installed unless they change the licensing of the repo.

Created the src/tools/generate-copyright tool, which takes the JSON generated by the previous tool and renders a COPYRIGHT.md file. An example of how that looks is in this gist.

The main goal of this PR is to generate two COPYRIGHT files, one for the standard library and one for the toolchain as a whole. Tomorrow I will back out the changes in this PR to where the JSON lives so we can have that discussion separately.

Kobzol · 2024-11-19T19:28:16Z

Sorry, I completely missed the sentence about committing the JSON file 🤦‍♂️ I'm still interested to discuss whether we really need to commit the JSON file, vs just committing something human readable..

jonathanpallant · 2024-11-20T09:03:41Z

So my view is you shouldn't commit anything you can reasonably regenerate in CI. So to me the choice is, should we commit nothing, or should we commit the JSON to avoid people needing to have reuse installed.

Kobzol · 2024-11-20T09:52:47Z

I never needed to handle JSON copyright data, but I find it very useful to have a human readable COPYRIGHT file in the root of a repo - it is a standard thing that many repositories have and people kind of expect it will be there, IMO. So it makes sense to me to automatically pregenerate this file.

jonathanpallant · 2024-11-20T10:09:07Z

The plan is to deploy it with every release, so it's in /usr/local/share/rust/whatever, or wherever your toolchain is installed to. Because it's the product of a relatively expensive complication process, like rustc itself, or any of the other binaries we ship.

It's also only really useful to people who don't have the source code checked out. If you have the source code, you have much more fine-grained copyright information - in the source files themselves. This is just a summary of the source tree.

Kobzol · 2024-11-20T21:47:54Z

It's also only really useful to people who don't have the source code checked out. If you have the source code, you have much more fine-grained copyright information - in the source files themselves. This is just a summary of the source tree.

Sure, but it's still common to have a COPYRIGHT file in the root of a repo anyway. People don't want to go through the source tree to figure out copyright, not to mention that most of our Rust files don't actually have any. And you have created the awesome HTML copyright page, so let's use it and commit it so that it is easily accessible!

Anyway, I didn't want to derail this PR, sorry. IMO, we should definitely remove the previous COPYRIGHT file(s) and replace them with automatically generated COPYRIGHT files, commit these, and ensure that they stay fresh in CI. As long as we do that, I don't mind if we also commit the JSON file, even though I find it a bit redundant.

We only run reuse once, so the output has to be filtered to find only the files that are relevant to the library tree. Outputs build/COPYRIGHT.html and build/COPYRIGHT-library.html.

jonathanpallant · 2024-11-21T10:03:12Z

I dropped the changes regarding committing the JSON file.

I can do another PR where I add ./x test tools/check-copyright which will run ./x run tools/generate-copyright and then do a diff to tell you if ./COPYRIGHT.html and ./COPYRIGHT-library.html are OK. But I need a bit of guidance on the difference between ./x run and ./x test.

Kobzol

Thank you for reducing the scope of this PR. The code looks good. When I tried to generate the libstd copyright, I got these weird empty entries:

But I suppose it's still the same case as this.

Kobzol · 2024-11-22T07:36:41Z

src/tools/generate-copyright/src/main.rs

 /// Describes a tree of metadata for our filesystem tree
-#[derive(serde::Deserialize)]
+///
+/// Must match the JSON emitted by the `CollectLicenseMetadata` bootstrap tool.


I only noticed now that the definition of these (non-trivial) JSON structures is duplicated here and in collect-license-metadata. Eventually it would be nice to share these e.g. through the build_helper crate, but let's not complicate this PR with that.

src/tools/generate-copyright/src/cargo_metadata.rs

jonathanpallant · 2024-11-22T09:30:14Z

But I suppose it's still the same case as this.

I'll poke at generate-license-metadata and see if we can make that better. But in general, reuse really doesn't like files without copyright information (which is kind of the point).

Kobzol · 2024-11-22T09:43:15Z

I haven't found an easy way to tell reuse to ignore untracked files, sadly. Anyway, that's not relevant to this PR, that's a separate issue.

jonathanpallant · 2024-11-22T10:15:28Z

If only CI runs the program, and CI doesn't have untracked files, it'll be moot.

Kobzol · 2024-11-22T10:48:07Z

Indeed. Thank you!

@bors r+

bors · 2024-11-22T10:48:10Z

📌 Commit 2932833 has been approved by Kobzol

It is now in the queue for this repository.

…Kobzol generate-copyright: Now generates a library file too. We only run reuse once, so the output has to be filtered to find only the files that are relevant to the library tree. Outputs COPYRIGHT.html and COPYRIGHT-library.html. The license-metadata.json file is also now in the tree. We need a CI tool to check that it's correct. r? kobzol Remaining steps: * [ ] Teach CI to double-check the license-metadata.json file is correct * [ ] Add the COPYRIGHT.html and COPYRIGHT-license.html to the releases

pietroalbini · 2024-11-22T18:42:39Z

It's a bit surprising to me that we commit the JSON file. My expectation was that the only reason we need to commit anything at all is to have a "pre-built" copyright file (in HTML format?) in the repo, which would be easily checkable by people without the need to run bootstrap. So in that case, I'd expect that on CI, we render the HTML file, and check if it is updated in git.

Back when this effort started @Mark-Simulacrum stated that he didn't want to require anyone running ./x dist to have REUSE installed, which was fair. That of course required committing, and my thought was that it's better to commit the minimal set of information needed to regenerate the COPYRIGHT files (the JSON) rather than committing 8MB of HTML.

Also, the frequency of changes to the JSON and the HTML is vastly different. The JSON will change very rarely (only when we add files with different copyright to this repo), while the HTML will be regenerated fairly often (every time a dependency is bumped in a lockfile. Committing the HTML increases the annoyance for contributors (who will have to run an extra step) and requires everyone who updates dependencies to have REUSE installed.

Kobzol · 2024-11-22T18:51:17Z

Also, the frequency of changes to the JSON and the HTML is vastly different. The JSON will change very rarely (only when we add files with different copyright to this repo), while the HTML will be regenerated fairly often (every time a dependency is bumped in a lockfile. Committing the HTML increases the annoyance for contributors (who will have to run an extra step) and requires everyone who updates dependencies to have REUSE installed.

Hmm, but the difference in that is not JSON vs HTML, but rather the fact that the JSON only contains licenses of in-tree source code, while the HTML right now contains also out-of-tree licenses, so strictly more data. We could also generate the HTML file with only in-tree data, which would make it much smaller. In other words, the JSON file is not the minimal set of information needed to generate COPYRIGHT, because it does not contain the out-of-tree licenses, right?

EDIT: Ah, I think I understand now. It is the "minimal set of information needed to generate COPYRIGHT, if you don't want to install REUSE". Well, I understand that, but it seems to me that this is quite a bit of complexity required just that you can avoid typing pip install reuse..

FWIW, REUSE is IMO quite easy to install. You already need to have Python installed to even run bootstrap, and some Python packages are needed to run extra tidy checks, which can also trip you up on CI after making certain changes.

pietroalbini · 2024-11-22T23:35:49Z

FWIW, REUSE is IMO quite easy to install. You already need to have Python installed to even run bootstrap, and some Python packages are needed to run extra tidy checks, which can also trip you up on CI after making certain changes.

Sure, if we are ok with more people installing REUSE it can be less complexity, the choice was made since there was a strong preference not to have people install it. I guess the three options are:

Don't commit anything, and require everyone building docs or dist tarballs to have REUSE installed and to download all the vendored sources.
Only commit the JSON with the REUSE data, requiring only people changing the copyright status of the in-tree files to install REUSE, while still requiring everyone building docs or dist tarballs to download all the vendored sources. The file committed to the repo will be tiny and rarely updated.
Commit the resulting HTML, requiring everyone changing the lockfile to install REUSE and run a non-instantaneous extra command after the lockfile is updated. People building docs or dist tarballs won't have to do anything. The file committed to the repo will be large and semi-frequently updated.

In the end it's all tradeoffs 🙂

…mpiler-errors Rollup of 8 pull requests Successful merges: - rust-lang#132090 (Stop being so bail-y in candidate assembly) - rust-lang#132658 (Detect const in pattern with typo) - rust-lang#132911 (Pretty print async fn sugar in opaques and trait bounds) - rust-lang#133102 (aarch64 softfloat target: always pass floats in int registers) - rust-lang#133159 (Don't allow `-Zunstable-options` to take a value ) - rust-lang#133208 (generate-copyright: Now generates a library file too.) - rust-lang#133215 (Fix missing submodule in `./x vendor`) - rust-lang#133264 (implement OsString::truncate) r? `@ghost` `@rustbot` modify labels: rollup

Rollup merge of rust-lang#133208 - ferrocene:split-copyright-html, r=Kobzol generate-copyright: Now generates a library file too. We only run reuse once, so the output has to be filtered to find only the files that are relevant to the library tree. Outputs COPYRIGHT.html and COPYRIGHT-library.html. The license-metadata.json file is also now in the tree. We need a CI tool to check that it's correct. r? kobzol Remaining steps: * [ ] Teach CI to double-check the license-metadata.json file is correct * [ ] Add the COPYRIGHT.html and COPYRIGHT-license.html to the releases

rustbot assigned Kobzol Nov 19, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) labels Nov 19, 2024

This comment has been minimized.

Sign in to view

generate-copyright: Now generates a library file too.

9dfc682

We only run reuse once, so the output has to be filtered to find only the files that are relevant to the library tree. Outputs build/COPYRIGHT.html and build/COPYRIGHT-library.html.

jonathanpallant force-pushed the split-copyright-html branch from c9292c3 to 9dfc682 Compare November 21, 2024 10:01

Kobzol reviewed Nov 22, 2024

View reviewed changes

generate-copyright: Fixup comment for get_metadata_and_notices.

2932833

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 22, 2024

jieyouxu mentioned this pull request Nov 22, 2024

Rollup of 10 pull requests #133340

Closed

jonathanpallant mentioned this pull request Nov 22, 2024

Check copyright html #133341

Closed

compiler-errors mentioned this pull request Nov 23, 2024

Rollup of 8 pull requests #133360

Merged

bors merged commit ef433a3 into rust-lang:master Nov 23, 2024
6 checks passed

rustbot added this to the 1.85.0 milestone Nov 23, 2024

generate-copyright: Now generates a library file too. #133208

generate-copyright: Now generates a library file too. #133208

Uh oh!

Conversation

jonathanpallant commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

Kobzol commented Nov 19, 2024

Uh oh!

jonathanpallant commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonathanpallant commented Nov 19, 2024

Uh oh!

Kobzol commented Nov 19, 2024

Uh oh!

jonathanpallant commented Nov 19, 2024

Uh oh!

Kobzol commented Nov 19, 2024

Uh oh!

jonathanpallant commented Nov 20, 2024

Uh oh!

Kobzol commented Nov 20, 2024

Uh oh!

jonathanpallant commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kobzol commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonathanpallant commented Nov 21, 2024

Uh oh!

Kobzol left a comment

Choose a reason for hiding this comment

Uh oh!

Kobzol Nov 22, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonathanpallant commented Nov 22, 2024

Uh oh!

Kobzol commented Nov 22, 2024

Uh oh!

jonathanpallant commented Nov 22, 2024

Uh oh!

Kobzol commented Nov 22, 2024

Uh oh!

bors commented Nov 22, 2024

Uh oh!

pietroalbini commented Nov 22, 2024

Uh oh!

Kobzol commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pietroalbini commented Nov 22, 2024

Uh oh!

Uh oh!

Uh oh!

jonathanpallant commented Nov 19, 2024 •

edited

Loading

jonathanpallant commented Nov 19, 2024 •

edited

Loading

jonathanpallant commented Nov 20, 2024 •

edited

Loading

Kobzol commented Nov 20, 2024 •

edited

Loading

Kobzol commented Nov 22, 2024 •

edited

Loading