Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious detection of CeCILL and GPL in French translation of GFDL #553

Closed
richardfontana opened this issue Mar 11, 2017 · 8 comments
Closed

Comments

@richardfontana
Copy link

In gnome-desktop 3.14.2 the file desktop-docs/fdl/fr/index.docbook contains what I believe is a French translation of the GNU FDL. ScanCode detects GPL and CeCILL in this text.
index.docbook-scancode.txt

@pombredanne
Copy link
Member

@richardfontana thanks!

@pombredanne
Copy link
Member

License translations and non-English licenses are interesting cases: because they are reasonably rare at scale, they introduce subtle bias that often lead to some somewhat false positive detections (even though at very low scores) of legalese in that language. Also because the words used in these licenses are also much less frequent than the words used in English licenses, they tend to be given more prominence the by detection engine machinery.

Here for instance we have to CeCILL rules that are matched but with a very low "coverage", e.g. very few words of the original rule text were matched:

          "matched_rule": {
            "identifier": "cecill-2.0_2.RULE",
[...]
            "matched_length": 16,
            "match_coverage": 0.46,
            "rule_relevance": 100
          },

and:

          "matched_rule": {
            "identifier": "cecill-2.0_2.RULE",
[...]
            "matched_length": 5,
            "match_coverage": 0.14,

These are very few words BUT interestingly enough there are two cases:

  1. some matched words are very frequent words in French
  2. or the matched words are French legalese

So a resolution is going to comprise all of these:

  1. set a minimum numbers of words to be matched (e.g. a minimum coverage) for these CeCILL rules, say at least 10% of the words. And eventually set the same on the other non-English licenses.

  2. add a new rule specifically for this translation of the GFDL, either in the raw docbook variant or in the plain cleaned text variant (the later is likely better). This will take care of getting a proper, high coverage match to this docbook file and handle the false positive detection of the GPL too.

  3. add a new set of "frequent words" for French. Eventually we will track a list of these common words for each and every language and will split this list from the English frequent words. I will need to find later per-language lists of these "stop-word"-like words.

Note also that, as part of #139 we will add a language attribute so we can know what is the language that was matched for a license or license rule. This will help for instance with the addition of all the CC licenses translations as part of #514.

@pombredanne
Copy link
Member

BTW, the spurious GPL match is because of the FSF address:
"matched_text": "51 Franklin [Street], Fifth Floor</[street]>, \n <[city]>Boston"

@pombredanne
Copy link
Member

So I am adding a new GFDL 1.1 detection rule with the plain text version of the docbook file. I am using @jgm excellent pandoc for the conversion:
$pandoc -f docbook -t plain -o index.txt index.docbook

@pombredanne
Copy link
Member

FWIW, these GNOME licenses are a treasure trove of quality markup: the docbook texts have hyperlinks to the essential conditions of the license. Quite a find and a great work that could have many reuse, e.g. helping with the legal review of a text.

pombredanne added a commit that referenced this issue Mar 13, 2017
 * ensure that CeCILL rules for French texts have a minimum coverage
   set to 10. Cleanup and rename CeCILL rules along the way to include
   language in the rule names and attributes.
 * refactor frequent_tokens.py to start having lists of frequent words
   per-language. Add a first list for a few common French words.
 * add new GFDL 1.1 rule for its French translation and add test for
   the corresponding docbook file. Add new "negative" rules for
   some docbook tags.

Reported-by: Richard Fontana <rfontana@redhat.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member

@richardfontana The latest code in the develop branch fixes your bug and is ready for your review. I am tracking other related foreign language licenses improvements in #139
Thanks++ for this report: please continue sending our way any oddity you would find. Any other feedback for improvements is welcomed too of course.
BTW, do you run ScanCode on your desktop or on a server?

@richardfontana
Copy link
Author

@pombredanne I have been running ScanCode on my laptop.

@richardfontana
Copy link
Author

Closing this as it was fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants