Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add all the CC licenses to the ScanCode detectable licenses #514

Closed
Tracked by #1825
pombredanne opened this issue Feb 23, 2017 · 27 comments
Closed
Tracked by #1825

Add all the CC licenses to the ScanCode detectable licenses #514

pombredanne opened this issue Feb 23, 2017 · 27 comments

Comments

@pombredanne
Copy link
Member

See https://github.com/creativecommons/creativecommons.org/tree/45420471049bbf7a6420a12c737376b6bf3fc9dd/docroot/legalcode
With all the translations and variants, this would be a good addition even if some are rather exotic...

@pombredanne
Copy link
Member Author

And this contains some code to handle them: https://github.com/warpr/licensedb.git

@singh1114
Copy link
Collaborator

What is needed to be done to solve this issue?

@pombredanne
Copy link
Member Author

well, this is about adding new licenses and rules and detection tests as needed by checking that each and every of the CC licenses, including all the translations exist as ScanCode licenses in src/licensedcode/data/licenses

@pombredanne
Copy link
Member Author

the best way to do this would be to use some simple script to automate the basics

@pombredanne
Copy link
Member Author

You should also check recent PRs related to adding new licenses, rules and tests for examples.

@singh1114
Copy link
Collaborator

Any particular one that you want to point out?

@pombredanne
Copy link
Member Author

Check the opened and closed PR. There are plenty of these and some are linked in the wiki doc on licenses

@singh1114
Copy link
Collaborator

singh1114 commented Mar 3, 2017

So what you want to say is that I need to create a script that will check the file names and compares the file names in the directory and the link?

And it will tell about the files names that does not exist in the directory you mentioned.

@singh1114
Copy link
Collaborator

Please see if the procedure that I am going to is right or wrong.

Let's say that I am adding the license for this one: https://github.com/creativecommons/creativecommons.org/blob/45420471049bbf7a6420a12c737376b6bf3fc9dd/docroot/legalcode/by-nc-sa_3.0_es_gl.html

First of all, I will check if the license already exist or not. As this is an HTML file I need open it in the browser. After that, I copy the content and save it in src/licensedcode/data/license, with a file name ending with .LICENSE and corresponding .yml file. And then I am done.

@pombredanne
Copy link
Member Author

@singh1114 let me come back with more details tomorrow. I have actually a script for something similar that I am working for SPDX licenses as part #41 .... Once I push this you could use it as a template and/or update this to support CC licenses.

@aviaryan
Copy link
Contributor

Hi @pombredanne .
I understand what is to be done in this issue. Should I try to solve it?
Here is what I am thinking as of now -
Since there are lots of licenses in that first link, I will create a script to parse those licenses, extract plain text from them and place them in the data/license folder if they don't exist. Tests would be created in the same way.
What do you think?

@pombredanne
Copy link
Member Author

@aviaryan PR welcomed. The works consist in:

  1. adding the license texts as-is and the corresponding .yml data files. Not sure were parsing is needed.
  2. may be add new rules for direct URLs, short forms and notices for these licenses
  3. run the tests and ensure things are not broken.

Not much more would be needed for now, as adding a license or rule automatically creates a test for it.
Adding more tests of detection of these licenses as found in the wild would be a nice to have of course.

@aviaryan
Copy link
Contributor

aviaryan commented Apr 1, 2017

adding the license texts as-is and the corresponding .yml data files. Not sure were parsing is needed.

In the first repo, licenses are in HTML form with HTML formatting, not plain text. I assumed we won't be able to use them as they are, would we? Also some html pages there point to an external link for the LICENSE. (Example)


I am kind of occupied for the next 48 hours so will only be able to start work on this after that. I hope this isn't a problem.

@singh1114
Copy link
Collaborator

@aviaryan I think for the HTML format, we need to open the file in the browser and get the real content from it in the browser.

@pombredanne was also mentioning something about the script. @pombredanne, Any update on that part?

@aviaryan
Copy link
Contributor

aviaryan commented Apr 1, 2017

@aviaryan I think for the HTML format, we need to open the file in the browser and get the real content from it in the browser.

Yes, but I would just write a script to do that for me.

@singh1114
Copy link
Collaborator

singh1114 commented Apr 1, 2017

@aviaryan That was what I wanted to say. Probably I missed it in the last comment.

@pombredanne pombredanne changed the title Add all the CC licenses Add all the CC licenses to the ScanCode detectable licenses Apr 2, 2017
@pombredanne
Copy link
Member Author

You could open the pages in lynx alright and you would get clean text.
for instance:

                             [1]Creative Commons

                         Creative Commons Legal Code

                   Attribution-NonCommercial-NoDerivs 2.0
   CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
[........]
 usage guidelines, as may be published on its website or otherwise made
   available upon request from time to time.

   Creative Commons may be contacted at [2]https://creativecommons.org/.

                                                [3]« Back to Commons Deed

References

   1. https://creativecommons.org/
   2. https://creativecommons.org/
   3. https://creativecommons.org/licenses/by-nc-nd/2.0/

This would eschew most of the parsing.

@pombredanne
Copy link
Member Author

@singh1114 my mention of a script is something different to deal with the need for a reconciliation with occurrences of existing license instances. (eg there are some cc licenses already there).
The simple way to deal with that for now is to run the detection on each text and if matched 100% to the same license id, then this means the license is already there and present, for instance: https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/rules/cc-by-2.0.yml

@pombredanne
Copy link
Member Author

Note also that some .txt file exist ... but most translations and older licenses do not have a plain text

@singh1114
Copy link
Collaborator

singh1114 commented Apr 2, 2017

@pombredanne That script will be very useful.

Does that mean, if the licenses have an HTML format we have to put the whole HTML text into the .txt file?

@pombredanne
Copy link
Member Author

@singh1114 we want the plain text, not the html. e.g. a lynx dump or similar, e.g no markup

@aviaryan
Copy link
Contributor

aviaryan commented Apr 4, 2017

@pombredanne I have almost finished work on this issue. There are 800 approx new licenses that will be added. Some of them have notices too so we will have to create rules for that. Should I send them all in one PR or something else? Please suggest an approach.


How should we name translated licenses? cc-by-nd-3.0_au or cc-by-nd-3.0-au ?

@pombredanne
Copy link
Member Author

Thanks!
For the translated licenses, use dash and no underscores.
For a start, I suggest you send a small PR first with a few licenses. That way I can review and provide feedback that may likely apply to many licenses. Once we have a a few clean ones, then you could submit several PR for batches may be all the translations of a given license at once.

A massive 800 files PR would take too long to review.

@pombredanne
Copy link
Member Author

@aviaryan note that what is as much of interest as the licenses is a script to automate adding these CC licenses if you wrote such thing. Manually is OK too as once we are over the big hurdle new additions will likely be small in the future

@aviaryan
Copy link
Contributor

aviaryan commented Apr 4, 2017

@pombredanne Thanks for your feedback. Yes, I have written a script for the task. I am busy trying to make it as robust as possible.

aviaryan added a commit to aviaryan/scancode-toolkit that referenced this issue Apr 4, 2017
aviaryan added a commit to aviaryan/scancode-toolkit that referenced this issue Apr 5, 2017
aviaryan added a commit to aviaryan/scancode-toolkit that referenced this issue Apr 5, 2017
* cc-GPL-2.0-pt
* cc-LGPL-2.1-pt
* cc-by-nc-nd-2.0-at
* cc-by-nc-nd-2.0-au

Signed-off-by: Avi Aryan <avi.aryan123@gmail.com>
aviaryan added a commit to aviaryan/scancode-toolkit that referenced this issue Apr 5, 2017
Also fix spdx key values

Signed-off-by: Avi Aryan <avi.aryan123@gmail.com>
pombredanne added a commit that referenced this issue Apr 18, 2017
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Apr 20, 2017
 * These introduce a bias in word frequencies that needs to be supported
   first. It otherwise skews license detection too much and is a risk
   for low perf and false positives.

Link: #514
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member Author

At this stage I checked we have all the CC licenses except the non-english translations. This is tracked in #139 so I am closing this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants