Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check copyright html #133341

Closed

Conversation

jonathanpallant
Copy link
Contributor

Builds upon #133208, but now:

  • COPYRIGHT.html and `COPYRIGHT-library.html are committed
  • ./x run generate-copyright will replace them
  • ./x test generate-copyright will check they are correct, without changing them
  • mingw-check will run ./x test generate-copyright

COPYRIGHT.html is 7,531,570 bytes long, so you might prefer that we gzip them or something.

In a later PR, we should probably change dist to include these files with every release.

r? kobzol

We only run reuse once, so the output has to be filtered to find only the files that are relevant to the library tree.

Outputs build/COPYRIGHT.html and build/COPYRIGHT-library.html.
…correct.

Run ./x run generate-copyright to rebuild them.
Run ./x test generate-copyright to check them.
@rustbot rustbot added A-testsuite Area: The testsuite used to check the correctness of rustc S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. labels Nov 22, 2024
Copy link
Contributor

@Kobzol Kobzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, crap, 8 MiB is a lot, I didn't realize that. Not so much for git, as it's a text file, but this will be hard to open on GitHub, which was something that I really wanted from this. Gzipping also won't help, as that won't be openable on GitHub (neither locally, without running it through gzip first). And in general, 8 MiB copyright file is probably a bit too much..

I think that we should do some work on reducing the size of the copyright file first. Some ideas:

  • Is In libstd needed anymore, now that we have also COPYRIGHT-library.html? I think that we could skip this.
  • We probably ship about 500 copies of the MIT and Apache licenses in the HTML file for the out-of-tree dependencies, which is less than ideal. We should deduplicate them somehow. If I remove the license notices completely, the size is just ~670 KiB, so about a tenth of the original. What if we:
    • Completely removed the notices, and only left the license ID? It's SPDX compliant in most cases anyway.
    • Replaced notices with links (URLs) to the notice, pointing to the source repository of the crate (should work for crates that have their repo configured).
    • Replaced notices with links to our own LICENSES directory (but this won't work for all licenses, and the external licenses might not 1:1 match ours).
    • Deduplicate the notices. Keep a list of licenses at the bottom of the HTML page, deduplicate them, and add links to them. We might need to strip the "copyright header" from the licenses, if used, to achieve significant savings, but maybe even before that it could help a lot.

@jonathanpallant
Copy link
Contributor Author

You can't dedup MIT because it contains the copyright notice.

There are I think 50-odd versions of the Apache license contained within, with minor typographical variations. I am not a lawyer and I don't know if it's ok to replace the copy provided by a project with a different one. At what point are the differences significant? We asked for legal help and they said to just include each file as we found it.

@jonathanpallant
Copy link
Contributor Author

Also, GitHub won't render HTML so the argument against zipping is moot.

@Kobzol
Copy link
Contributor

Kobzol commented Nov 22, 2024

Also, GitHub won't render HTML so the argument against zipping is moot.

True, I confused that with MD. Well, you can still download it with a single button click and open it in a browser easily, which is much simpler without GZIP. There's another argument against GZIPping though - it would no longer allow git to efficiently store changes to the file. Instead, it would become a binary file. It's only 300 KiB after gzipping, but that's still a 300 KiB binary file that will change maybe once a few weeks.

You can't dedup MIT because it contains the copyright notice.

My understanding was that the notice is useless and doesn't really mean anything anyway, as copyright is assigned through other means, but IANAL, of course.

Anyway, the technical side of this looks good to me, but I feel like I lack the proper context to decide whether it's fine to commit a 8 MiB text file. I could try to interpret the original goals from MCP 519, but I'm not sure if that's useful.

CC @pietroalbini Could you please comment on how did you envision the usage of the generated COPYRIGHT files? Should it be rather machine readable or human readable? HTML or Markdown or JSON? Do you consider it OK if the file has several MiBs?

@rust-log-analyzer

This comment has been minimized.

@jonathanpallant
Copy link
Contributor Author

Ah ha ha, I accepted your suggestion and it failed tidy.

Co-authored-by: Jakub Beránek <berykubik@gmail.com>
@jonathanpallant
Copy link
Contributor Author

My understanding was that the notice is useless and doesn't really mean anything anyway, as copyright is assigned through other means, but IANAL, of course.

The MIT text literally says:

The above copyright notice and this permission notice
shall be included in all copies or substantial portions
of the Software.

My naive reading is that including it is not optional because to not include it would be to not comply with the license.

But let's wait for Pietro.

Also fixes a bug where we were checking the *wrong* copyright file...
@jonathanpallant
Copy link
Contributor Author

The diff is failing because when you check the file in, Git replaces the \r\n newlines with \n and that changes the file. I'll need to strip those out for comparison.

@jonathanpallant
Copy link
Contributor Author

jonathanpallant commented Nov 22, 2024

We may also want to skip generating license-metadata.json if the file exists, and skip vendoring the dependencies if the folders exist, because otherwise it takes two minutes and downloads ~1.2 GB every single time you run it.

@Kobzol
Copy link
Contributor

Kobzol commented Nov 22, 2024

I don't think we should do JSON and vendor caching in this way. If you change something locally and need to regenerate the copyright (e.g. because CI tells you to run x run tools/generate-copyright), you will do that, and then.. it just silently does nothing if you already have the (stale) JSON file and vendor directory available. Forcing uses to remove files from the build directory for the command to work, without even giving them an indication they should do that, is not a good approach.

We should either perform no caching or have a mechanism for actually breaking the cache (but that would be very difficult here, I think). So I would suggest to just not do the caching. It should not be required to run the command often (unless you're actually developing the command, and in that case you can just comment out the JSON file generation and vendoring if you need a fast feedback loop).

@rust-log-analyzer
Copy link
Collaborator

The job mingw-check failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
#16 3.125 Building wheels for collected packages: reuse
#16 3.126   Building wheel for reuse (pyproject.toml): started
#16 3.375   Building wheel for reuse (pyproject.toml): finished with status 'done'
#16 3.376   Created wheel for reuse: filename=reuse-4.0.3-cp310-cp310-manylinux_2_35_x86_64.whl size=132720 sha256=026f3bb0f1aa8090b861fd0a0939cb1a782396d84c8aab7875096557d637a0f6
#16 3.377   Stored in directory: /tmp/pip-ephem-wheel-cache-tl8otjkp/wheels/3d/8d/0a/e0fc6aba4494b28a967ab5eaf951c121d9c677958714e34532
#16 3.379 Installing collected packages: boolean-py, binaryornot, tomlkit, reuse, python-debian, markupsafe, license-expression, jinja2, chardet, attrs
#16 3.782 Successfully installed attrs-23.2.0 binaryornot-0.4.4 boolean-py-4.0 chardet-5.2.0 jinja2-3.1.4 license-expression-30.3.0 markupsafe-2.1.5 python-debian-0.1.49 reuse-4.0.3 tomlkit-0.13.0
#16 3.783 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#16 DONE 3.9s
---
   Compiling rinja v0.3.5
   Compiling generate-copyright v0.1.0 (/checkout/src/tools/generate-copyright)
    Finished `release` profile [optimized] target(s) in 13.12s
##[endgroup]
Error: Failed get output from cargo-metadata: CargoMetadata { stderr: "error: manifest path `./Cargo.toml` does not exist\n" }
Caused by:
Caused by:
    `cargo metadata` exited with an error: error: manifest path `./Cargo.toml` does not exist
    
Command has failed. Rerun with -v to see more details.
  local time: Fri Nov 22 14:47:27 UTC 2024
  network time: Fri, 22 Nov 2024 14:47:27 GMT
##[error]Process completed with exit code 1.
Post job cleanup.

@jonathanpallant
Copy link
Contributor Author

Yeah I keep commenting it out, and sometimes accidentally commit the changes.

How about a special environment variable that if set, skips the downloads? RUST_SKIP_LICENSE_GENERATE or something?

@jonathanpallant
Copy link
Contributor Author

cargo metadata exited with an error: error: manifest path ./Cargo.toml does not exist

I think I assumed this would only run from the root of the tree, but the ../x test... line suggests we are not at the root. I guess I'll need to fix that.

@Kobzol
Copy link
Contributor

Kobzol commented Nov 22, 2024

Yeah I keep commenting it out, and sometimes accidentally commit the changes.

How about a special environment variable that if set, skips the downloads? RUST_SKIP_LICENSE_GENERATE or something?

We already have tens of similar one-off environment variables in bootstrap, and I'm not a fan of adding yet another one for what I see to be a niche use-case (once we merge this PR and maybe a few additional ones, it probably won't be something that people work on that often). It would also have to be threaded through two separate tools.

But if it makes your life easier and you think it's worth it, then add it :)

@pietroalbini
Copy link
Member

In general, I don't think we have to commit the HTML files to the repository.

We have to provide the copyright notices for all of our dependencies when we redistribute them, either in source or binary form. We distribute them in dist tarballs, and so the copyright file should be included there. We are not distributing them in our git repository though, so there is no need to include the HTML files in-tree.

That is the reason why in my original PR I added a split between generating the JSON and then rendering the HTML output: we can commit the JSON (removing the need to have REUSE installed locally), and generate the HTML on the fly only when it's actually needed. There is no need otherwise to have the split between the JSON step and the HTML step.

What I suggest we do is:

  • We include the HTML files in dist tarballs (both binary and source).
  • We include the HTML files in the documentation, so that they show up in doc.rust-lang.org
  • In the repository we state the license of the repository ("in general MIT OR Apache-2.0, look at reuse and the annotations on top of the files for more details"), and a link to the rendered version on doc.rust-lang.org (for nightly maybe?).

I don't think we should do JSON and vendor caching in this way. [...]

Assuming what I wrote above holds (we commit the JSON file and generate the HTML on the fly), JSON caching is moot as generate-copyright would always use the JSON file from disk, and collect-copyright-metadata would always invoke REUSE from scratch.

Regarding vendor caching (either with the current approach, or the "commit JSON" approach) a trivial way to bust the cache is to clear it when any Cargo lockfile changes.

We asked for legal help and they said to just include each file as we found it.

We received this recommendation from the Foundation's legal counsel, so I'd say we should follow it.

@Kobzol
Copy link
Contributor

Kobzol commented Nov 22, 2024

Thank you for the context! Now it makes much more sense to me, why there was the JSON vs HTML split. I agree that having the HTML hosted at our docs is essentially the best of both worlds, giving easy access to it through GitHub, while avoiding the committing of a huge HTML file to git.

What I don't understand yet is the stance regarding in-tree vs out-of-tree dependencies, because the JSON file does not contain the latter, as far as I'm aware.

@pietroalbini
Copy link
Member

What I don't understand yet is the stance regarding in-tree vs out-of-tree dependencies, because the JSON file does not contain the latter, as far as I'm aware.

In-tree vs out-of-tree is unrelated to the JSON file. The JSON discussion is just about whether to require people to invoke REUSE.

What I meant with the in-tree distinction is that in the git repository we only need to document the licensing of the source code contained in the git repository, not of any crates we depend on, and the majority of the file size for those reports is about crates we ship.

That information is already scattered around the repository in either file comments or REUSE.toml, and I don't think there is a need to duplicate it in a COPYRIGHT file committed in the git repository.

I think we should still provide a COPYRIGHT file, but it can be a manually written one saying that by default we are MIT OR Apache-2.0, where to find more granular information (REUSE.toml and individual files), and a link to the rendered version of the full copyright statement on doc.rust-lang.org.

@bors
Copy link
Collaborator

bors commented Nov 25, 2024

☔ The latest upstream changes (presumably #133236) made this pull request unmergeable. Please resolve the merge conflicts.

@jonathanpallant
Copy link
Contributor Author

OK, let me try again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testsuite Area: The testsuite used to check the correctness of rustc S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants