-
-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spurious detection of CeCILL and GPL in French translation of GFDL #553
Comments
@richardfontana thanks! |
License translations and non-English licenses are interesting cases: because they are reasonably rare at scale, they introduce subtle bias that often lead to some somewhat false positive detections (even though at very low scores) of legalese in that language. Also because the words used in these licenses are also much less frequent than the words used in English licenses, they tend to be given more prominence the by detection engine machinery. Here for instance we have to CeCILL rules that are matched but with a very low "coverage", e.g. very few words of the original rule text were matched:
and:
These are very few words BUT interestingly enough there are two cases:
So a resolution is going to comprise all of these:
Note also that, as part of #139 we will add a language attribute so we can know what is the language that was matched for a license or license rule. This will help for instance with the addition of all the CC licenses translations as part of #514. |
BTW, the spurious GPL match is because of the FSF address: |
So I am adding a new GFDL 1.1 detection rule with the plain text version of the docbook file. I am using @jgm excellent pandoc for the conversion: |
FWIW, these GNOME licenses are a treasure trove of quality markup: the docbook texts have hyperlinks to the essential conditions of the license. Quite a find and a great work that could have many reuse, e.g. helping with the legal review of a text. |
* ensure that CeCILL rules for French texts have a minimum coverage set to 10. Cleanup and rename CeCILL rules along the way to include language in the rule names and attributes. * refactor frequent_tokens.py to start having lists of frequent words per-language. Add a first list for a few common French words. * add new GFDL 1.1 rule for its French translation and add test for the corresponding docbook file. Add new "negative" rules for some docbook tags. Reported-by: Richard Fontana <rfontana@redhat.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@richardfontana The latest code in the |
@pombredanne I have been running ScanCode on my laptop. |
Closing this as it was fixed. |
In gnome-desktop 3.14.2 the file desktop-docs/fdl/fr/index.docbook contains what I believe is a French translation of the GNU FDL. ScanCode detects GPL and CeCILL in this text.
index.docbook-scancode.txt
The text was updated successfully, but these errors were encountered: