-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check copyright html #133341
Check copyright html #133341
Conversation
We only run reuse once, so the output has to be filtered to find only the files that are relevant to the library tree. Outputs build/COPYRIGHT.html and build/COPYRIGHT-library.html.
…correct. Run ./x run generate-copyright to rebuild them. Run ./x test generate-copyright to check them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, crap, 8 MiB is a lot, I didn't realize that. Not so much for git, as it's a text file, but this will be hard to open on GitHub, which was something that I really wanted from this. Gzipping also won't help, as that won't be openable on GitHub (neither locally, without running it through gzip
first). And in general, 8 MiB copyright file is probably a bit too much..
I think that we should do some work on reducing the size of the copyright file first. Some ideas:
- Is
In libstd
needed anymore, now that we have alsoCOPYRIGHT-library.html
? I think that we could skip this. - We probably ship about 500 copies of the MIT and Apache licenses in the HTML file for the out-of-tree dependencies, which is less than ideal. We should deduplicate them somehow. If I remove the license notices completely, the size is just ~670 KiB, so about a tenth of the original. What if we:
- Completely removed the notices, and only left the license ID? It's SPDX compliant in most cases anyway.
- Replaced notices with links (URLs) to the notice, pointing to the source repository of the crate (should work for crates that have their repo configured).
- Replaced notices with links to our own
LICENSES
directory (but this won't work for all licenses, and the external licenses might not 1:1 match ours). - Deduplicate the notices. Keep a list of licenses at the bottom of the HTML page, deduplicate them, and add links to them. We might need to strip the "copyright header" from the licenses, if used, to achieve significant savings, but maybe even before that it could help a lot.
You can't dedup MIT because it contains the copyright notice. There are I think 50-odd versions of the Apache license contained within, with minor typographical variations. I am not a lawyer and I don't know if it's ok to replace the copy provided by a project with a different one. At what point are the differences significant? We asked for legal help and they said to just include each file as we found it. |
Also, GitHub won't render HTML so the argument against zipping is moot. |
True, I confused that with MD. Well, you can still download it with a single button click and open it in a browser easily, which is much simpler without GZIP. There's another argument against GZIPping though - it would no longer allow git to efficiently store changes to the file. Instead, it would become a binary file. It's only 300 KiB after gzipping, but that's still a 300 KiB binary file that will change maybe once a few weeks.
My understanding was that the notice is useless and doesn't really mean anything anyway, as copyright is assigned through other means, but IANAL, of course. Anyway, the technical side of this looks good to me, but I feel like I lack the proper context to decide whether it's fine to commit a 8 MiB text file. I could try to interpret the original goals from MCP 519, but I'm not sure if that's useful. CC @pietroalbini Could you please comment on how did you envision the usage of the generated COPYRIGHT files? Should it be rather machine readable or human readable? HTML or Markdown or JSON? Do you consider it OK if the file has several MiBs? |
This comment has been minimized.
This comment has been minimized.
Ah ha ha, I accepted your suggestion and it failed tidy. |
Co-authored-by: Jakub Beránek <berykubik@gmail.com>
3899f7b
to
7424afe
Compare
The MIT text literally says:
My naive reading is that including it is not optional because to not include it would be to not comply with the license. But let's wait for Pietro. |
Also fixes a bug where we were checking the *wrong* copyright file...
The diff is failing because when you check the file in, Git replaces the |
We may also want to skip generating license-metadata.json if the file exists, and skip vendoring the dependencies if the folders exist, because otherwise it takes two minutes and downloads ~1.2 GB every single time you run it. |
I don't think we should do JSON and vendor caching in this way. If you change something locally and need to regenerate the copyright (e.g. because CI tells you to run We should either perform no caching or have a mechanism for actually breaking the cache (but that would be very difficult here, I think). So I would suggest to just not do the caching. It should not be required to run the command often (unless you're actually developing the command, and in that case you can just comment out the JSON file generation and vendoring if you need a fast feedback loop). |
The job Click to see the possible cause of the failure (guessed by this bot)
|
Yeah I keep commenting it out, and sometimes accidentally commit the changes. How about a special environment variable that if set, skips the downloads? |
I think I assumed this would only run from the root of the tree, but the |
We already have tens of similar one-off environment variables in bootstrap, and I'm not a fan of adding yet another one for what I see to be a niche use-case (once we merge this PR and maybe a few additional ones, it probably won't be something that people work on that often). It would also have to be threaded through two separate tools. But if it makes your life easier and you think it's worth it, then add it :) |
In general, I don't think we have to commit the HTML files to the repository. We have to provide the copyright notices for all of our dependencies when we redistribute them, either in source or binary form. We distribute them in dist tarballs, and so the copyright file should be included there. We are not distributing them in our git repository though, so there is no need to include the HTML files in-tree. That is the reason why in my original PR I added a split between generating the JSON and then rendering the HTML output: we can commit the JSON (removing the need to have REUSE installed locally), and generate the HTML on the fly only when it's actually needed. There is no need otherwise to have the split between the JSON step and the HTML step. What I suggest we do is:
Assuming what I wrote above holds (we commit the JSON file and generate the HTML on the fly), JSON caching is moot as Regarding vendor caching (either with the current approach, or the "commit JSON" approach) a trivial way to bust the cache is to clear it when any Cargo lockfile changes.
We received this recommendation from the Foundation's legal counsel, so I'd say we should follow it. |
Thank you for the context! Now it makes much more sense to me, why there was the JSON vs HTML split. I agree that having the HTML hosted at our docs is essentially the best of both worlds, giving easy access to it through GitHub, while avoiding the committing of a huge HTML file to git. What I don't understand yet is the stance regarding in-tree vs out-of-tree dependencies, because the JSON file does not contain the latter, as far as I'm aware. |
In-tree vs out-of-tree is unrelated to the JSON file. The JSON discussion is just about whether to require people to invoke REUSE. What I meant with the in-tree distinction is that in the git repository we only need to document the licensing of the source code contained in the git repository, not of any crates we depend on, and the majority of the file size for those reports is about crates we ship. That information is already scattered around the repository in either file comments or I think we should still provide a |
☔ The latest upstream changes (presumably #133236) made this pull request unmergeable. Please resolve the merge conflicts. |
OK, let me try again. |
Builds upon #133208, but now:
COPYRIGHT.html
and `COPYRIGHT-library.html are committed./x run generate-copyright
will replace them./x test generate-copyright
will check they are correct, without changing themmingw-check
will run./x test generate-copyright
COPYRIGHT.html
is 7,531,570 bytes long, so you might prefer that we gzip them or something.In a later PR, we should probably change
dist
to include these files with every release.r? kobzol