Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce license detection false positives #3300

Open
AyanSinhaMahapatra opened this issue Mar 26, 2023 · 0 comments
Open

Reduce license detection false positives #3300

AyanSinhaMahapatra opened this issue Mar 26, 2023 · 0 comments

Comments

@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented Mar 26, 2023

This issue summarizes remaining work on scancode license detection false positives:

  1. Adding required phrases in all scancode license detection rules: Add required phrase rules automatically #3254
  2. Integrating the unknown license detection --unknown-licenses step in a better way with the current license detection post processing such that it is always used in the cases where we have no detections or sub-par detections with multiple fragmented matches. This is complementary to step 1. as it is restricts false positive detections, but also increases false negatives, by discarding approximate matches, and to cover there, we need the unknown license detection, to make sure we don't lose the good things about scancode license detection: strong approximate matching.

There are two main steps/solution elements here, which are WIP/already implemented. But we need to put them together in a more effective way and test this better to make sure we're doing much better on false positives and to also prove that we are not failing to detect any piece of license related test due to the stricter restrictions put on the detection rules.

Further follow up is required to test/validate that false positives were actually reduced:

  1. To test, we will run license detection on existing scancode license rules selectively such that we run license detection on a group of rules with a specific license_expression, and we configure scancode to run with a license index such that rules of this particular license_expression is not present, but all other rules are indexed there.
  2. We also need to do a review pass on all the open license detection related issues and make sure they are fixed/closed or summarized into more action items.

See also the long-running RFC issue on this topic: #2878, which has two more action items:

  1. This issue
  2. Improve gibberish license/copyright detection #2402
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant