Add all the CC licenses to the ScanCode detectable licenses #514

pombredanne · 2017-02-23T11:52:55Z

See https://github.com/creativecommons/creativecommons.org/tree/45420471049bbf7a6420a12c737376b6bf3fc9dd/docroot/legalcode
With all the translations and variants, this would be a good addition even if some are rather exotic...

pombredanne · 2017-02-23T11:53:46Z

Another source is https://github.com/creativecommons/license.rdf/tree/3b47cda59c2ef60ad713ca14e111b9867a1be997/cc/licenserdf/licenses

pombredanne · 2017-02-23T12:01:27Z

And this contains some code to handle them: https://github.com/warpr/licensedb.git

singh1114 · 2017-03-03T23:22:35Z

What is needed to be done to solve this issue?

pombredanne · 2017-03-03T23:24:34Z

well, this is about adding new licenses and rules and detection tests as needed by checking that each and every of the CC licenses, including all the translations exist as ScanCode licenses in src/licensedcode/data/licenses

pombredanne · 2017-03-03T23:25:13Z

the best way to do this would be to use some simple script to automate the basics

pombredanne · 2017-03-03T23:28:02Z

You should also check recent PRs related to adding new licenses, rules and tests for examples.

singh1114 · 2017-03-03T23:44:37Z

Any particular one that you want to point out?

pombredanne · 2017-03-03T23:46:28Z

Check the opened and closed PR. There are plenty of these and some are linked in the wiki doc on licenses

singh1114 · 2017-03-03T23:49:13Z

So what you want to say is that I need to create a script that will check the file names and compares the file names in the directory and the link?

And it will tell about the files names that does not exist in the directory you mentioned.

singh1114 · 2017-03-04T13:23:58Z

Please see if the procedure that I am going to is right or wrong.

Let's say that I am adding the license for this one: https://github.com/creativecommons/creativecommons.org/blob/45420471049bbf7a6420a12c737376b6bf3fc9dd/docroot/legalcode/by-nc-sa_3.0_es_gl.html

First of all, I will check if the license already exist or not. As this is an HTML file I need open it in the browser. After that, I copy the content and save it in src/licensedcode/data/license, with a file name ending with .LICENSE and corresponding .yml file. And then I am done.

pombredanne · 2017-03-04T15:56:02Z

@singh1114 let me come back with more details tomorrow. I have actually a script for something similar that I am working for SPDX licenses as part #41 .... Once I push this you could use it as a template and/or update this to support CC licenses.

aviaryan · 2017-03-31T19:42:20Z

Hi @pombredanne .
I understand what is to be done in this issue. Should I try to solve it?
Here is what I am thinking as of now -
Since there are lots of licenses in that first link, I will create a script to parse those licenses, extract plain text from them and place them in the data/license folder if they don't exist. Tests would be created in the same way.
What do you think?

pombredanne · 2017-04-01T14:58:26Z

@aviaryan PR welcomed. The works consist in:

adding the license texts as-is and the corresponding .yml data files. Not sure were parsing is needed.
may be add new rules for direct URLs, short forms and notices for these licenses
run the tests and ensure things are not broken.

Not much more would be needed for now, as adding a license or rule automatically creates a test for it.
Adding more tests of detection of these licenses as found in the wild would be a nice to have of course.

aviaryan · 2017-04-01T16:28:57Z

adding the license texts as-is and the corresponding .yml data files. Not sure were parsing is needed.

In the first repo, licenses are in HTML form with HTML formatting, not plain text. I assumed we won't be able to use them as they are, would we? Also some html pages there point to an external link for the LICENSE. (Example)

I am kind of occupied for the next 48 hours so will only be able to start work on this after that. I hope this isn't a problem.

singh1114 · 2017-04-01T18:46:13Z

@aviaryan I think for the HTML format, we need to open the file in the browser and get the real content from it in the browser.

@pombredanne was also mentioning something about the script. @pombredanne, Any update on that part?

aviaryan · 2017-04-01T19:11:48Z

@aviaryan I think for the HTML format, we need to open the file in the browser and get the real content from it in the browser.

Yes, but I would just write a script to do that for me.

singh1114 · 2017-04-01T19:13:39Z

@aviaryan That was what I wanted to say. Probably I missed it in the last comment.

pombredanne · 2017-04-02T08:57:02Z

You could open the pages in lynx alright and you would get clean text.
for instance:

                             [1]Creative Commons

                         Creative Commons Legal Code

                   Attribution-NonCommercial-NoDerivs 2.0
   CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
[........]
 usage guidelines, as may be published on its website or otherwise made
   available upon request from time to time.

   Creative Commons may be contacted at [2]https://creativecommons.org/.

                                                [3]« Back to Commons Deed

References

   1. https://creativecommons.org/
   2. https://creativecommons.org/
   3. https://creativecommons.org/licenses/by-nc-nd/2.0/

This would eschew most of the parsing.

pombredanne · 2017-04-02T09:05:21Z

@singh1114 my mention of a script is something different to deal with the need for a reconciliation with occurrences of existing license instances. (eg there are some cc licenses already there).
The simple way to deal with that for now is to run the detection on each text and if matched 100% to the same license id, then this means the license is already there and present, for instance: https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/rules/cc-by-2.0.yml

pombredanne · 2017-04-02T09:09:42Z

Note also that some .txt file exist ... but most translations and older licenses do not have a plain text

singh1114 · 2017-04-02T18:02:20Z

@pombredanne That script will be very useful.

Does that mean, if the licenses have an HTML format we have to put the whole HTML text into the .txt file?

pombredanne · 2017-04-02T19:52:53Z

@singh1114 we want the plain text, not the html. e.g. a lynx dump or similar, e.g no markup

aviaryan · 2017-04-04T06:03:20Z

@pombredanne ~~I have almost finished work on this issue.~~ There are 800 approx new licenses that will be added. Some of them have notices too so we will have to create rules for that. Should I send them all in one PR or something else? Please suggest an approach.

How should we name translated licenses? cc-by-nd-3.0_au or cc-by-nd-3.0-au ?

pombredanne · 2017-04-04T12:45:48Z

Thanks!
For the translated licenses, use dash and no underscores.
For a start, I suggest you send a small PR first with a few licenses. That way I can review and provide feedback that may likely apply to many licenses. Once we have a a few clean ones, then you could submit several PR for batches may be all the translations of a given license at once.

A massive 800 files PR would take too long to review.

pombredanne · 2017-04-04T12:47:16Z

@aviaryan note that what is as much of interest as the licenses is a script to automate adding these CC licenses if you wrote such thing. Manually is OK too as once we are over the big hurdle new additions will likely be small in the future

aviaryan · 2017-04-04T14:42:28Z

@pombredanne Thanks for your feedback. Yes, I have written a script for the task. I am busy trying to make it as robust as possible.

* cc-GPL-2.0-pt * cc-LGPL-2.1-pt * cc-by-nc-nd-2.0-at * cc-by-nc-nd-2.0-au Signed-off-by: Avi Aryan <avi.aryan123@gmail.com>

Also fix spdx key values Signed-off-by: Avi Aryan <avi.aryan123@gmail.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* These introduce a bias in word frequencies that needs to be supported first. It otherwise skews license detection too much and is a risk for low perf and false positives. Link: #514 Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2022-02-02T07:18:13Z

At this stage I checked we have all the CC licenses except the non-english translations. This is tracked in #139 so I am closing this now.

pombredanne added license scan new and improved data labels Feb 23, 2017

pombredanne added the easy label Feb 28, 2017

pombredanne mentioned this issue Mar 11, 2017

Spurious detection of CeCILL and GPL in French translation of GFDL #553

Closed

pombredanne changed the title ~~Add all the CC licenses~~ Add all the CC licenses to the ScanCode detectable licenses Apr 2, 2017

aviaryan added a commit to aviaryan/scancode-toolkit that referenced this issue Apr 4, 2017

aboutcode-org#514 Add some CC licenses

0082efc

aviaryan mentioned this issue Apr 4, 2017

#514 Add some CC licenses #587

Merged

aviaryan added a commit to aviaryan/scancode-toolkit that referenced this issue Apr 5, 2017

aboutcode-org#514 Add some CC licenses

d0d8992

aviaryan added a commit to aviaryan/scancode-toolkit that referenced this issue Apr 5, 2017

aboutcode-org#514 Add some CC licenses

78d3b3d

* cc-GPL-2.0-pt * cc-LGPL-2.1-pt * cc-by-nc-nd-2.0-at * cc-by-nc-nd-2.0-au Signed-off-by: Avi Aryan <avi.aryan123@gmail.com>

aviaryan added a commit to aviaryan/scancode-toolkit that referenced this issue Apr 5, 2017

aboutcode-org#514 Fix issues with license naming conventions

5f06d5b

Also fix spdx key values Signed-off-by: Avi Aryan <avi.aryan123@gmail.com>

pombredanne added a commit that referenced this issue Apr 18, 2017

#514 Fix license position test after new CC licenses

39a76e5

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne added the good first issue label Nov 3, 2017

AyanSinhaMahapatra mentioned this issue Dec 11, 2019

All Good First Issues List #1825

Closed

21 tasks

AyanSinhaMahapatra mentioned this issue Mar 16, 2020

Add or test all licenses in https://github.com/okfn/licenses #863

Open

pombredanne closed this as completed Feb 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add all the CC licenses to the ScanCode detectable licenses #514

Add all the CC licenses to the ScanCode detectable licenses #514

pombredanne commented Feb 23, 2017

pombredanne commented Feb 23, 2017

pombredanne commented Feb 23, 2017

singh1114 commented Mar 3, 2017

pombredanne commented Mar 3, 2017

pombredanne commented Mar 3, 2017

pombredanne commented Mar 3, 2017

singh1114 commented Mar 3, 2017

pombredanne commented Mar 3, 2017

singh1114 commented Mar 3, 2017 •

edited

Loading

singh1114 commented Mar 4, 2017

pombredanne commented Mar 4, 2017

aviaryan commented Mar 31, 2017

pombredanne commented Apr 1, 2017

aviaryan commented Apr 1, 2017

singh1114 commented Apr 1, 2017

aviaryan commented Apr 1, 2017

singh1114 commented Apr 1, 2017 •

edited

Loading

pombredanne commented Apr 2, 2017

pombredanne commented Apr 2, 2017

pombredanne commented Apr 2, 2017

singh1114 commented Apr 2, 2017 •

edited

Loading

pombredanne commented Apr 2, 2017

aviaryan commented Apr 4, 2017 •

edited

Loading

pombredanne commented Apr 4, 2017

pombredanne commented Apr 4, 2017

aviaryan commented Apr 4, 2017

pombredanne commented Feb 2, 2022

Add all the CC licenses to the ScanCode detectable licenses #514

Add all the CC licenses to the ScanCode detectable licenses #514

Comments

pombredanne commented Feb 23, 2017

pombredanne commented Feb 23, 2017

pombredanne commented Feb 23, 2017

singh1114 commented Mar 3, 2017

pombredanne commented Mar 3, 2017

pombredanne commented Mar 3, 2017

pombredanne commented Mar 3, 2017

singh1114 commented Mar 3, 2017

pombredanne commented Mar 3, 2017

singh1114 commented Mar 3, 2017 • edited Loading

singh1114 commented Mar 4, 2017

pombredanne commented Mar 4, 2017

aviaryan commented Mar 31, 2017

pombredanne commented Apr 1, 2017

aviaryan commented Apr 1, 2017

singh1114 commented Apr 1, 2017

aviaryan commented Apr 1, 2017

singh1114 commented Apr 1, 2017 • edited Loading

pombredanne commented Apr 2, 2017

pombredanne commented Apr 2, 2017

pombredanne commented Apr 2, 2017

singh1114 commented Apr 2, 2017 • edited Loading

pombredanne commented Apr 2, 2017

aviaryan commented Apr 4, 2017 • edited Loading

pombredanne commented Apr 4, 2017

pombredanne commented Apr 4, 2017

aviaryan commented Apr 4, 2017

pombredanne commented Feb 2, 2022

singh1114 commented Mar 3, 2017 •

edited

Loading

singh1114 commented Apr 1, 2017 •

edited

Loading

singh1114 commented Apr 2, 2017 •

edited

Loading

aviaryan commented Apr 4, 2017 •

edited

Loading