Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scancode reports on "originally licensed under X" as if it's a license notice #1794

Closed
MartinPetkov opened this issue Oct 25, 2019 · 2 comments

Comments

@MartinPetkov
Copy link

Description

Not sure if this is a "bug" or if there's any good way to solve it short of embedding an NLP parser, but it's at least an edge case. In https://github.com/gettalong/kramdown/blob/179b81dcf057f8079fd9df5296ba858114d30f7a/README.md, there's this text:

kramdown was originally licensed under the GPL until the 1.0.0 release. However, due to the many requests it is now released under the MIT license and therefore can easily be used in commercial projects, too.

Try scanning it:

$ git clone https://github.com/gettalong/kramdown
$ cd kramdown && git checkout 179b81dcf057f8079fd9df5296ba858114d30f7a
$ scancode --verbose --license -n 12 --json-pp ./results.json --only-findings --info --strip-root --license-text ./README.md
$ cat ./results.json

You see this:

{
  "key": "gpl-1.0-plus",
  "score": 100,
  "name": "GNU General Public License 1.0 or later",
  "short_name": "GPL 1.0 or later",
  "category": "Copyleft",
  "is_exception": false,
  "owner": "Free Software Foundation (FSF)",
  "homepage_url": "http://www.gnu.org/licenses/old-licenses/gpl-1.0-standalone.html",
  "text_url": "http://www.gnu.org/licenses/old-licenses/gpl-1.0-standalone.html",
  "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:gpl-1.0-plus",
  "spdx_license_key": "GPL-1.0-or-later",
  "spdx_url": "https://spdx.org/licenses/GPL-1.0-or-later",
  "start_line": 5,
  "end_line": 5,
  "matched_rule": {
    "identifier": "gpl_85.RULE",
    "license_expression": "gpl-1.0-plus",
    "licenses": [
      "gpl-1.0-plus"
    ],
    "is_license_text": false,
    "is_license_notice": true,
    "is_license_reference": false,
    "is_license_tag": false,
    "matcher": "2-aho",
    "rule_length": 4,
    "matched_length": 4,
    "match_coverage": 100,
    "rule_relevance": 100
  },
  "matched_text": "licensed under the GPL"
}

It (correctly imo) zeroes in on the "licensed under..." text, but it would be nice if it could take into account context and ignore this particular blurb. I understand if it's not feasible, but wanted to bring it up.

System configuration

For bug reports, it really helps us to know:

  • What OS are you running on? (Windows/MacOS/Linux)
    Linux
  • What version of scancode-toolkit was used to generate the scan file?
    ScanCode version 3.0.2.post1114.8b6916601
  • What installation method was used to install/run scancode? (pip/source download/other)
    pip
@pombredanne
Copy link
Member

Hi Martin!

Thank you for this report.

Comments about licenses are always tougher ones: (for the record there is also two MIT references in the same file, detected alright)...

There are two ways to deal with this:

  1. Using a rule with a is_negative: yes flag. For instance with this text was originally licensed under the GPL until the 1.0.0 release ... then the exact text will be entirely removed from the scanned streams. This happens at the very start of license detection. This is something that need to be done with caution but that sentence here is specific enough.

  2. The other way is a rule with a is_false_positive: yes flag. This is a rule that is matched (but only exactly) like any other rule, and then if still present after merging matches is filtered out of the matches. It could have the same text as above. This happens at the very end of license detection.

Both ways work: usually negative is better for short non-license related words such as some thing that refers to https://www.gpl.com. False positive is better when there is some license related words that may be about a license in a large context and not about a license in a more specific context.

BTW I see that you use ScanCode version 3.0.2.post1114.8b6916601 which is a rather old version.... BUT a somewhat recent commit. You may want to pull the latest develop or tag and run ./configure --clean && ./configure ;)

MartinPetkov added a commit to MartinPetkov/scancode-toolkit that referenced this issue Oct 27, 2019
This rule is from a scan of https://github.com/gettalong/kramdown/.
Specifically, the README notes that it was originally GPL licensed. This blurb
should not actually be matched.
MartinPetkov added a commit to MartinPetkov/scancode-toolkit that referenced this issue Oct 27, 2019
This rule is from a scan of https://github.com/gettalong/kramdown/.
Specifically, the README notes that it was originally GPL licensed. This blurb
should not actually be matched.

Signed-off-by: Martin Petkov <mpetkov@google.com>
@MartinPetkov
Copy link
Author

Thank you for the in-depth explanation! I've tried my hand at fixing it and opened #1797. When I run the reproducing steps above, I correctly no longer see gpl-1.0-plus in the results. Let me know if that's acceptable.

You're right about the old version, I hadn't run ./configure on my local fork of the code.

pombredanne added a commit that referenced this issue Oct 28, 2019
viragumathe5 pushed a commit to viragumathe5/scancode-toolkit that referenced this issue Mar 13, 2020
This rule is from a scan of https://github.com/gettalong/kramdown/.
Specifically, the README notes that it was originally GPL licensed. This blurb
should not actually be matched.

Signed-off-by: Martin Petkov <mpetkov@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants