Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meta issue: Improve Debian package reported license #103

Closed
pombredanne opened this issue Feb 15, 2021 · 8 comments
Closed

meta issue: Improve Debian package reported license #103

pombredanne opened this issue Feb 15, 2021 · 8 comments
Assignees

Comments

@pombredanne
Copy link
Member

See these issues for details:

@pombredanne
Copy link
Member Author

See also #128

@pombredanne
Copy link
Member Author

We have many levels of problems:

1. finding the copyright file of a package.

There are case where we have the copyright file of a package which is a symlink to the copyright file of another package and we fail to get the copyright file in this case.
For instance in debian-unstable-slim, the directory /usr/share/doc/libstdc++6 is a symlink to the directory /usr/share/doc/gcc-10-base therefore the copyright file is /usr/share/doc/gcc-10-base/copyright
Short of followig this link we cannot access the copyright file, because the source package gcc-10 does not have a copyright file and is not installed and we cannot use the heuristic to use the source package copyright when we cannot find one for the binary.

2. dealing with copyright formats

2.1 not machine-readable

we do not partition files that are not machine-readable and this may impact license detection accuracy. There are several opportunities to improve this for instance with a heuristic that would split text regions in paragraph-like chunks based on the presence of some typical statements or even license rules such as:

On Debian systems, the complete text of the GNU General Public
License version 2 can be found in /usr/share/common-licenses/GPL-2.

Also in some almost structured files, we could split on lines starting with "License:" or "Copyright:" or "Copyright notice:" such as:
https://metadata.ftp-master.debian.org/changelogs//main/u/unzip/unzip_6.0-23+deb10u2_copyright or https://metadata.ftp-master.debian.org/changelogs//main/e/e2fsprogs/

2.2 structured copyright files

when we detect license on structured copyright files, we do not handle correctly the fact that a license is a known common license or not

Only known common licenses symbols as used in the first line of a license declaration have a meaning. Other symbols (even when they look like an SPDX license id such as BSD-2-Clause) should be interpreted first based on the detection of the license text they point to a license paragraph.
This is not done yet and impacts the quality of detection on the declared licenses

3. incorrect license simplification

We have incorrect license simplification that is applied on the detected license expressions. We should not apply simplification for now and rather fix it in the license_expression library. See aboutcode-org/license-expression#49

4. Inaccurate license detection proper

We have incorrect license detection on multiple levels:

4.1 Incorrect mapping of common debian licenses

We do not have correct mapping for known license symbols of common licenses when we are trying to detect a license as an expression. The set of these is limited in the ones found in `/usr/share/common-licenses/. For instance:

Apache-2.0
Artistic
BSD
GFDL -> GFDL-1.3
GFDL-1.2
GFDL-1.3
GPL -> GPL-3
GPL-1
GPL-2
GPL-3
LGPL -> LGPL-3
LGPL-2
LGPL-2.1
LGPL-3

And also the symbols with a trailing +
(NB: Artistic would need to be detected to find what we map it to)

4.2 we do not detect correctly some license expression syntax from the declared license

For instance, this weird "academic free license >= 2.1, modified bsd license" where using a mapping in debian_licenses.txt may be the only way out.

Though we may be able to apply heuristics where we could replace a comma by " AND " before parsing a license declaration line as an expression.

Because of 4.2 and 4.3 we return way too many unknown licenses

4.3 we are missing license detection rules to detect accurately the licenses

This is a matter of adding new license rules.

5. diagnose detection errors is hard

We cannot easily diagnose and fix license detection issues because the details of the detection are not returned. For instance we cannot easily use scancode-analyzer to help spot and fix issues.

pombredanne added a commit to aboutcode-org/scancode-toolkit that referenced this issue Apr 8, 2021
This is the set of files found in a recent debian-unstable-slim Docker
image. The expectations have been regenerated as-is but not yet
revewied.

See also:
- aboutcode-org/scancode.io#128
- aboutcode-org/scancode.io#103

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member Author

@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented Jun 1, 2021

From aboutcode-org/scancode-toolkit#2518, detailing the improvements made in each level of the problem.

On the specific issue reported in #128, we have the unstructured copyright file of gcc-10-base debian package.

The updated debian copyright system has a complete overhaul of the license detection and fixing of certain bugs which made possible the improvement here:

Before Changes vs After Changes

Now, thare are still minor inaccuracies here which are being fixed.

On the progress made in the specific levels of issues discussed in this comment above:

In 2. Dealing with copyright formats:

2.1 Machine Readable Copyrights:

Status: Some critical bugs were fixed, this is now sent directly into scancode license detection as a whole, getting much better results. WIP: Break this file into parts of texts using common paragraphs seperators seen in debian copyright files, for better detection.

2.2 structured Copyright Files

Status: Mostly done, now working on handling rare cases by running tests on dataset of collected debian copyright samples Debian -(320K from 2019-11) and Ubuntu (200K files from 2020-06).

In debian copyright files, there are license paragraphs with license text and a license name after License: . Sometimes there are license texts in the file paragraphs also, and there also exists common debian licenses.

These licenses are then referenced in File and Header paragraphs in license expression like strings, and they reference to the license texts by their name. Now we fully parse these names and resolve the references to the license texts (instead of having a hand crafted mapping), even resolve unparsable expressions if these are also present as names of license texts.

Also filters are added when reporting license detections to summarize detection based on Primary License Paragraph, Debian paragraphs and to only return unique license detections. Also the option to simplify would be added after aboutcode-org/license-expression#53 is merged and released.

This significantly improves license detection in structured copyright files.

In 3. incorrect license simplification

Status: This is fixed at license-expression, in the process of being merged.

In 4. Inaccurate License Detection proper

4.1 Common Licenses present in /usr/share/common-licenses/

Status: These are now handled correctly.

4.2 we do not detect correctly some license expression syntax from the declared license

Status: We now can parse the debian license expressions correctly, with cleaning and some specialized parsing of commas, according to the debian guidelines.

Previously in debian_licenses.txt there was a mapping of all seen debian license expression present after License:, and the corresponding license expression.

Now, instead of having a mapping, these are now handled via cleaning up symbols which aren't supported by nexB/license-expression, and then parsing these as proper license expressions.

4.3 we are missing license detection rules to detect accurately the licenses

Status: WIP, this has been made possible by making the license detection diagnosable [in 5. Diagnosing License Detection Problems ]

New rules are added for common license detections, more rules are being added based on the added debian test files.

Then even more rules can be added by running nexB/scancode-analyzer on more debian copyright files.

In 5. Diagnosing License Detection Problems:

Status: License detections are now fully diagnoseable.

Previously, the license detections in a debian copyright file had as it's output only a license-expression string carrying all the detections, and hence it was hard to diagnose license detection problems, Now the license and copyright detection function returns a DebianDetector object with a list of LicenseDetection objects, which has the original LicenseMatch objects created by scancode LicenseDetection. This makes it possible to diagnose the root cause of license detection issues and also makes it possible to plug in the results from license detections in debian copyright files directly to https://github.com/nexB/scancode-analyzer for unique issue detection.

in 1. Bug in symlinks

Status: This is yet to be fixed.

@tdruez
Copy link
Contributor

tdruez commented Aug 26, 2021

@AyanSinhaMahapatra @pombredanne anything else coming on this one or are we ready to close?

@tdruez
Copy link
Contributor

tdruez commented Nov 15, 2021

@AyanSinhaMahapatra @pombredanne gentle ping, what's the latest status on this one?

@AyanSinhaMahapatra
Copy link
Member

The PRs were

There are two sub issues remaining,

  1. correctly handling symlinks, as debain copyrights are often symlinked as elaborated here in 1. This could be tracked seperately, I can open an issue for that then.
  2. simplifying license-expressions with AND which is tracked seperately here : AND statements not flattened in dedup() license-expression#67

And @pombredanne opened some more relatively minor ones, I have these on my to-do list:

  1. The order of license matches as found in a Debian copyright files does not match the file order scancode-toolkit#2646
  2. Debian copyright file matched text are returned lowercased incorrectly scancode-toolkit#2645
  3. Missing license in Debian copyright file scancode-toolkit#2644
  4. Debian copyright license matches lines are not correct scancode-toolkit#2643
  5. Processing debian copyright files seems slow scancode-toolkit#2642

As these are tracked seperately, and the major issues tracked here with debian was resolved and detection improved significantly, this meta issue could be closed,

@pombredanne pombredanne removed this from the 2021-08 milestone Apr 15, 2022
@AyanSinhaMahapatra
Copy link
Member

Closing this as mentioned above :

As these are tracked seperately, and the major issues tracked here with debian was resolved and detection improved significantly, this meta issue could be closed,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants