Reduce license detection false positives #3300

AyanSinhaMahapatra · 2023-03-26T06:31:56Z

This issue summarizes remaining work on scancode license detection false positives:

Adding required phrases in all scancode license detection rules: Add required phrase rules automatically #3254
Integrating the unknown license detection --unknown-licenses step in a better way with the current license detection post processing such that it is always used in the cases where we have no detections or sub-par detections with multiple fragmented matches. This is complementary to step 1. as it is restricts false positive detections, but also increases false negatives, by discarding approximate matches, and to cover there, we need the unknown license detection, to make sure we don't lose the good things about scancode license detection: strong approximate matching.

There are two main steps/solution elements here, which are WIP/already implemented. But we need to put them together in a more effective way and test this better to make sure we're doing much better on false positives and to also prove that we are not failing to detect any piece of license related test due to the stricter restrictions put on the detection rules.

Further follow up is required to test/validate that false positives were actually reduced:

To test, we will run license detection on existing scancode license rules selectively such that we run license detection on a group of rules with a specific license_expression, and we configure scancode to run with a license index such that rules of this particular license_expression is not present, but all other rules are indexed there.
We also need to do a review pass on all the open license detection related issues and make sure they are fixed/closed or summarized into more action items.

See also the long-running RFC issue on this topic: #2878, which has two more action items:

This issue
Improve gibberish license/copyright detection #2402

The text was updated successfully, but these errors were encountered:

AyanSinhaMahapatra added enhancement license scan major Priority: high labels Mar 26, 2023

AyanSinhaMahapatra self-assigned this Mar 26, 2023

AyanSinhaMahapatra added this to the v32.1 milestone Mar 26, 2023

AyanSinhaMahapatra mentioned this issue Mar 28, 2023

Hazelcast Community License wrongly identified as BSD-3-Clause #3293

Closed

AyanSinhaMahapatra mentioned this issue Apr 25, 2023

crash when using --unknown-licenses #3343

Closed

AyanSinhaMahapatra mentioned this issue Aug 27, 2023

MISDETECTION: AGPL detected when it isn't there #3498

Closed

AyanSinhaMahapatra mentioned this issue Sep 15, 2023

Wrong license detection in oauthlib #3512

Closed

AyanSinhaMahapatra modified the milestones: v32.2, v32.1 Jan 15, 2024

AyanSinhaMahapatra modified the milestones: v32.1, v32.2 Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce license detection false positives #3300

Reduce license detection false positives #3300

AyanSinhaMahapatra commented Mar 26, 2023 •

edited

Loading

Reduce license detection false positives #3300

Reduce license detection false positives #3300

Comments

AyanSinhaMahapatra commented Mar 26, 2023 • edited Loading

AyanSinhaMahapatra commented Mar 26, 2023 •

edited

Loading