-
-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
high relevance for some very low character count rules find lots of GPL false positives #2484
Comments
Add "only_known_words: yes" flag to these short GPL rules that are otherwise too often spurious false positive detections. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
In refine_matches() call filter_if_only_known_words_rule() later in the process to ensure that small contained rules are not left at the end. Also format code Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
We need this such that we later treat stopwords as if they are unknown words. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Calling tokens_by_line() with location and query_string arguments makes the code clearer and easier to read. Also apply minor formattings Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Run tokens_by_line() first without a stopwords argument. This allows stopword tokens to be included in the token stream and later to be treated as "unknown" tokens. This way the presence of stopwords in a match can impact a license match score and a license rule with `with_only_known_words: yes` annot be matched not only if there are unknown words but also if there are stopwords mixed its rule words. Also add a simple end-to-end integration test for this. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Yes, it makes sense to lower this, likely to something such as 50?
Not at the moment. But if you are deleting these there are other things to consider:
Here we have a bug when this is used, as the
and this file:
So let me push something so that we can get something working and cleaner for a start. Also do you have some common patterns of texts around these false matches of yours that you can share and that we could consider as I reckon |
@tardyp See the recent commits in https://github.com/nexB/scancode-toolkit/compare/2021-04-license-updates tagged with #2484 |
hi @pombredanne 60 looks good. I get the idea there is a need for a mode where scancode detects as much as possible. Not sure that we want to add a false positive detection for this file type. It is very specific to my company. I am not even sure what this formula is about, but I think it might be GPL, as for Gaz de Pétrole Liquéfié, LPG in english. Thanks for the fix, I think they will clearly work for us. |
I have always feared that we would get someday some false positive for GPL either from the LPG gas or from this https://gplinc.com ... the day has come! |
Add "only_known_words: yes" flag to these short GPL rules that are otherwise too often spurious false positive detections. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
In refine_matches() call filter_if_only_known_words_rule() later in the process to ensure that small contained rules are not left at the end. Also format code Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
We need this such that we later treat stopwords as if they are unknown words. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Calling tokens_by_line() with location and query_string arguments makes the code clearer and easier to read. Also apply minor formattings Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Run tokens_by_line() first without a stopwords argument. This allows stopword tokens to be included in the token stream and later to be treated as "unknown" tokens. This way the presence of stopwords in a match can impact a license match score and a license rule with `with_only_known_words: yes` annot be matched not only if there are unknown words but also if there are stopwords mixed its rule words. Also add a simple end-to-end integration test for this. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@tardyp At this stage this has been merged in the develop branch. It took a bit longer as there was an unrelated bug discovered as a side effect on how and when false positive rules were filtered and also the impact of "stop words" on the overall tokenization and score of a matched license and how rules that should be matched verbatim were handled. |
One of our configuration xml contains snippets like these (lots of them, number may vary):
scancode find tons of GPL due to those rules:
gpl-3.0_126.RULE : "gpl 30"
gpl-1.0_16.RULE: "gpl 10"
gpl-2.0_693.RULE: "gpl 20"
They all have a relevance of 100, which looks a lot to me.
would it make sense to decrease the relevence?
Is there a nice way to blacklist the rule?
(I'll just rm them for now from the data directory...)
The text was updated successfully, but these errors were encountered: