Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An app to detect licenses from the provided input license text #450

Draft
wants to merge 79 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
0a4915a
create an app to detect licenses from input text
lf32 Jun 16, 2022
d5c4018
Merge branch 'main' into lf32-licensetext-devel
lf32 Jun 16, 2022
4c1483a
Merge branch 'main' of nexB/scancode.io into lf32-licensetext-devel
lf32 Jun 17, 2022
9c25575
Improved UI, Changed Scan Path
lf32 Jun 20, 2022
49fe7a5
changed card title from name to short_name
lf32 Jun 20, 2022
cb9de2d
Improved License Input and Details UI #450
lf32 Jun 20, 2022
d2bd8d0
Run License Detection On Text Submission with tempfile #450
lf32 Jun 20, 2022
a113c14
Rename license form, prevent text stripping, temp_file attr update #450
lf32 Jun 22, 2022
0fa7fbb
Detect unknown licenses and return matched text by default #450
lf32 Jun 22, 2022
ea6f4bb
Flush contents of the temfile before passing as an argument #450
lf32 Jun 22, 2022
947faaf
Flush contents of the tempfile before passing as an argument #450
lf32 Jun 22, 2022
dd62121
Merge branch 'lf32-licensetext-devel' of https://github.com/nexB/scan…
lf32 Jun 22, 2022
60af913
Deleted models.py and migrations #450
lf32 Jun 22, 2022
28d0dcf
Added comment related to flushing tempfile #450
lf32 Jun 22, 2022
e46107a
Commenting inside views with pound over docstring #450
lf32 Jun 22, 2022
4468085
Write comments in multiple lines inside views #450
lf32 Jun 22, 2022
8e7c72b
Input text is rendered inside the box over the textarea #450
lf32 Jun 22, 2022
a87ace7
Merge branch 'main' of https://github.com/nexB/scancode.io into lf32-…
lf32 Jun 24, 2022
91ca751
Display form errors, detect licenses for input text file #450
lf32 Jun 27, 2022
cfb7a37
Improve file error response in views #450
lf32 Jul 1, 2022
7eedf84
Moved license summary to details with new UI #450
lf32 Jul 1, 2022
2d4f7f6
Fixed broken short lines #450
lf32 Jul 1, 2022
bbe58d0
New line at the end of the page #450
lf32 Jul 1, 2022
5602a6e
Merge branch 'main' of nexB/scancode.io -> lf32-licensetext-devel
lf32 Jul 2, 2022
a19a762
Ace Editor Restored with improved UI #450
lf32 Jul 4, 2022
b77976d
Highlight and scroll down to matched text #450
lf32 Jul 5, 2022
0333932
Close all cards while opening a new one #450
lf32 Jul 5, 2022
b206145
Removed details option in the view #450
lf32 Jul 7, 2022
b53e45c
Restore details tab with matched text #450
lf32 Jul 16, 2022
fe66aa2
Merge remote-tracking branch 'origin' into lf32-licensetext-devel
lf32 Jul 23, 2022
ce9b294
Merge remote-tracking branch 'origin' into lf32-licensetext-devel
lf32 Jul 25, 2022
1293066
Highlight text with colors #450
lf32 Jul 28, 2022
dddb4ad
Merge branch 'main' of https://github.com/nexB/scancode.io into lf32-…
lf32 Jul 28, 2022
a6b34de
Fix failing text for `make valid` #450
lf32 Jul 28, 2022
1bdf9d1
Fix failing test for `make valid` #450
lf32 Jul 28, 2022
ab56010
Merge branch 'main' of into lf32-licensetext-devel
lf32 Jul 29, 2022
fda2356
Merge branch 'main' into lf32-licensetext-devel
lf32 Aug 4, 2022
b82d1a2
Merge branch 'main' into lf32-licensetext-devel
lf32 Aug 6, 2022
67462c5
Testing out new UI to match the projects page #450
lf32 Aug 7, 2022
0f675ff
Add match_text tests to scantext #450
lf32 Aug 9, 2022
76e8a03
Make scancode match text test pass
pombredanne Aug 9, 2022
8727cb7
Add list of matches to Token
pombredanne Aug 9, 2022
04b8d21
Merge branch 'main' into lf32-licensetext-devel
lf32 Aug 11, 2022
3f7b640
Add all detected values into the table #450
lf32 Aug 12, 2022
658df1f
Add details page to view matched license details #450
lf32 Aug 17, 2022
8edf207
Add a mini licenses file to test for development #450
lf32 Aug 17, 2022
787133a
Move import of Token to top #450
lf32 Aug 17, 2022
4a08ea9
Add `license_chart_data` to render charts #450
lf32 Aug 17, 2022
32c3931
Add eye friendly green color to match highlights #450
lf32 Aug 17, 2022
4442f7c
Highlight all detected licenses in one match #450
lf32 Aug 18, 2022
840b395
Fix tests by running make valid
lf32 Aug 18, 2022
60683d0
Match token args correction #450
lf32 Aug 18, 2022
68205f8
Add hyperlinks to urls in details #450
lf32 Aug 19, 2022
968c538
Update details, Improve UI #450
lf32 Aug 25, 2022
3108089
Merge v31.0.0 of 'main' into lf32-licensetext-devel
lf32 Aug 27, 2022
6cd0243
Ace editor upgraded from v1.4.12 to v1.9.5
lf32 Aug 27, 2022
65277f1
Improve details, highlights and tests
lf32 Aug 31, 2022
bdae386
Add report functionality to the summary
lf32 Sep 2, 2022
81d1c6d
Add new user interface and cleanup the old one
lf32 Sep 6, 2022
bd15385
Merge branch 'main' into lf32-licensetext-devel
lf32 Sep 7, 2022
5c88367
Remove unused tests and validate code format #450
lf32 Sep 7, 2022
57a1d62
Indent html, handle input, fix modules & tests#450
lf32 Sep 18, 2022
3def711
Add more licenses, highlight text properly #450
lf32 Sep 19, 2022
ebcb6e2
Set cursor style to help, add token attr desc #450
lf32 Sep 19, 2022
cda3a40
Connect left and right part of ui #450
lf32 Sep 21, 2022
9c404cc
Set predefined colors to 3 #450
lf32 Sep 21, 2022
6fab923
Add more colors & new test licenses
lf32 Sep 26, 2022
4760050
Add highlight color and get rid of modal #450
lf32 Sep 28, 2022
e3281f7
Merge branch 'main' into lf32-licensetext-devel
lf32 Oct 11, 2022
45acd64
Fix failing tests #450
lf32 Oct 17, 2022
b719948
Fix width for dropdown #450
lf32 Oct 17, 2022
e3c8d79
Merge branch 'main' into lf32-licensetext-devel
lf32 Oct 21, 2022
3bca8e7
Merge branch 'main' into lf32-licensetext-devel
lf32 Nov 23, 2022
a165bd1
Merge branch 'main' into lf32-licensetext-devel
lf32 Dec 10, 2022
73a6571
Merge branch 'main' into lf32-licensetext-devel
lf32 Jan 25, 2023
dd2e8ab
Write input file either in chunks or text #450
lf32 Feb 5, 2023
5c30510
Move scan to utility dropdown #450
lf32 Feb 5, 2023
a5fa2b4
Merge branch 'main' into lf32-licensetext-devel
lf32 Feb 27, 2023
9787f44
(tests) code format is vaildated #450
lf32 Feb 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions scancodeio/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@
# Local apps
# Must come before Third-party apps for proper templates override
"scanpipe",
"scantext",
# Django built-in
"django.contrib.auth",
"django.contrib.contenttypes",
Expand Down
1 change: 1 addition & 0 deletions scancodeio/urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@
path("admin/", admin.site.urls),
path("api/", include(api_router.urls)),
path("license/", include(licenses.urls)),
path("scantext/", include("scantext.urls")),
path("", include("scanpipe.urls")),
path("", RedirectView.as_view(url="project/")),
]
3 changes: 3 additions & 0 deletions scanpipe/templates/scanpipe/includes/navbar_header.html
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@
<a class="navbar-item" href="{% url 'project_list' %}">
Projects
</a>
<a class="navbar-item" href="{% url 'license_scan' %}">
Scan
</a>
<a class="navbar-item" href="https://scancodeio.readthedocs.org/" target="_blank">
Documentation
</a>
Expand Down
21 changes: 21 additions & 0 deletions scantext/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# SPDX-License-Identifier: Apache-2.0
#
# http://nexb.com and https://github.com/nexB/scancode.io
# The ScanCode.io software is licensed under the Apache License version 2.0.
# Data generated with ScanCode.io is provided as-is without warranties.
# ScanCode is a trademark of nexB Inc.
#
# You may not use this software except in compliance with the License.
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software distributed
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
# CONDITIONS OF ANY KIND, either express or implied. See the License for the
# specific language governing permissions and limitations under the License.
#
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from
# ScanCode.io should be considered or used as legal advice. Consult an Attorney
# for any legal advice.
#
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
# Visit https://github.com/nexB/scancode.io for support and download.
27 changes: 27 additions & 0 deletions scantext/apps.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# SPDX-License-Identifier: Apache-2.0
#
# http://nexb.com and https://github.com/nexB/scancode.io
# The ScanCode.io software is licensed under the Apache License version 2.0.
# Data generated with ScanCode.io is provided as-is without warranties.
# ScanCode is a trademark of nexB Inc.
#
# You may not use this software except in compliance with the License.
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software distributed
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
# CONDITIONS OF ANY KIND, either express or implied. See the License for the
# specific language governing permissions and limitations under the License.
#
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from
# ScanCode.io should be considered or used as legal advice. Consult an Attorney
# for any legal advice.
#
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
# Visit https://github.com/nexB/scancode.io for support and download.

from django.apps import AppConfig


class ScantextConfig(AppConfig):
name = "scantext"
43 changes: 43 additions & 0 deletions scantext/forms.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# SPDX-License-Identifier: Apache-2.0
#
# http://nexb.com and https://github.com/nexB/scancode.io
# The ScanCode.io software is licensed under the Apache License version 2.0.
# Data generated with ScanCode.io is provided as-is without warranties.
# ScanCode is a trademark of nexB Inc.
#
# You may not use this software except in compliance with the License.
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software distributed
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
# CONDITIONS OF ANY KIND, either express or implied. See the License for the
# specific language governing permissions and limitations under the License.
#
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from
# ScanCode.io should be considered or used as legal advice. Consult an Attorney
# for any legal advice.
#
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
# Visit https://github.com/nexB/scancode.io for support and download.

from django import forms


class LicenseScanForm(forms.Form):
input_text = forms.CharField(
tdruez marked this conversation as resolved.
Show resolved Hide resolved
strip=False,
widget=forms.Textarea(
attrs={
"rows": 15,
"class": "textarea has-fixed-size",
"placeholder": "Paste your license text here.",
}
),
required=False,
)
input_file = forms.FileField(
required=False,
widget=forms.ClearableFileInput(
attrs={"class": "file-input", "multiple": False},
),
)
198 changes: 198 additions & 0 deletions scantext/match_text.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
#
# Copyright (c) nexB Inc. and others. All rights reserved.
# ScanCode is a trademark of nexB Inc.
# SPDX-License-Identifier: Apache-2.0
# See http://www.apache.org/licenses/LICENSE-2.0 for the license text.
# See https://github.com/nexB/scancode-toolkit for support or download.
# See https://aboutcode.org for more information about nexB OSS projects.
#

import attr
from licensedcode import query
from licensedcode.spans import Span
from licensedcode.stopwords import STOPWORDS
from licensedcode.tokenize import index_tokenizer
from licensedcode.tokenize import matched_query_text_tokenizer

TRACE = False
TRACE_MATCHED_TEXT = False
TRACE_MATCHED_TEXT_DETAILS = False


def logger_debug(*args):
pass


if TRACE or TRACE_MATCHED_TEXT or TRACE_MATCHED_TEXT_DETAILS:

use_print = True
if use_print:
prn = print
else:
import logging
import sys

logger = logging.getLogger(__name__)
# logging.basicConfig(level=logging.DEBUG, stream=sys.stdout)
logging.basicConfig(stream=sys.stdout)
logger.setLevel(logging.DEBUG)
prn = logger.debug

def logger_debug(*args):
return prn(" ".join(isinstance(a, str) and a or repr(a) for a in args))

def _debug_print_matched_query_text(match, extras=5):
"""
Print a matched query text including `extras` tokens before and after
the match. Used for debugging license matches.
"""
# Create a fake new match with extra tokens before and after
new_match = match.combine(match)
new_qstart = max([0, match.qstart - extras])
new_qend = min([match.qend + extras, len(match.query.tokens)])
new_qspan = Span(new_qstart, new_qend)
new_match.qspan = new_qspan

logger_debug(new_match)
logger_debug(" MATCHED QUERY TEXT with extras")
qt = new_match.matched_text(whole_lines=False)
logger_debug(qt)


@attr.s(slots=True)
class Token:
"""
Used to represent a token in collected query-side matched texts and SPDX
identifiers.

``matches`` is a lits of LicenseMatch to accomodate for overlapping matches.
For example, say we have these two matched text portions:
QueryText: this is licensed under GPL or MIT
Match1: this is licensed under GPL
Match2: licensed under GPL or MIT

Each Token would be to assigned one or more LicenseMatch:
this: Match1 : yellow
is: Match1 : yellow
licensed: Match1, Match2 : orange (mixing yellow and pink colors)
under: Match1, Match2 : orange (mixing yellow and pink colors)
GPL: Match1, Match2 : orange (mixing yellow and pink colors)
or: Match2 : pink
MIT: Match2 : pink
"""

# original text value for this token.
value = attr.ib()

# line number, one-based
line_num = attr.ib()

# absolute position for known tokens, zero-based. -1 for unknown tokens
pos = attr.ib(default=-1)

# True if text/alpha False if this is punctuation or spaces
is_text = attr.ib(default=False)

# True if part of a match
is_matched = attr.ib(default=False)

# True if this is a known token
is_known = attr.ib(default=False)

# List of LicenseMatch ids that match this token
match_ids = attr.ib(attr.Factory(list))


def tokenize_matched_text(
location,
query_string,
dictionary,
start_line=1,
trace=TRACE_MATCHED_TEXT_DETAILS,
):
"""
Yield Token objects with pos and line number collected from the file at
`location` or the `query_string` string. `dictionary` is the index mapping
of tokens to token ids.
"""
pos = 0
qls = query.query_lines(
location=location,
query_string=query_string,
strip=False,
start_line=start_line,
)
for line_num, line in qls:
if trace:
logger_debug(
" tokenize_matched_text:", "line_num:", line_num, "line:", line
)

for is_text, token_str in matched_query_text_tokenizer(line):
if trace:
logger_debug(" is_text:", is_text, "token_str:", repr(token_str))

# Determine if a token is is_known in the license index or not. This
# is essential as we need to realign the query-time tokenization
# with the full text to report proper matches.
if is_text and token_str and token_str.strip():

# we retokenize using the query tokenizer:
# 1. to lookup for is_known tokens in the index dictionary

# 2. to ensure the number of tokens is the same in both
# tokenizers (though, of course, the case will differ as the
# regular query tokenizer ignores case and punctuations).
qtokenized = list(index_tokenizer(token_str))
if not qtokenized:

yield Token(
value=token_str,
line_num=line_num,
is_text=is_text,
is_known=False,
pos=-1,
)

elif len(qtokenized) == 1:
is_known = qtokenized[0] in dictionary
if is_known:
p = pos
pos += 1
else:
p = -1

yield Token(
value=token_str,
line_num=line_num,
is_text=is_text,
is_known=is_known,
pos=p,
)
else:
# we have two or more tokens from the original query mapped
# to a single matched text tokenizer token.
for qtoken in qtokenized:
is_known = qtoken in dictionary
if is_known:
p = pos
pos += 1
else:
p = -1

yield Token(
value=qtoken,
line_num=line_num,
is_text=is_text,
is_known=is_known,
pos=p,
)
else:

yield Token(
value=token_str,
line_num=line_num,
is_text=False,
is_known=False,
pos=-1,
)
66 changes: 66 additions & 0 deletions scantext/templates/scantext/includes/license_detail_modal.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
<div class="modal license-details-modal">
<div class="modal-background"></div>
<div class="modal-card" style="margin-top: 10vh">
<header class="modal-card-head">
<p class="modal-card-title">{{ license.license_expression }}</p>
<button class="delete license-details-close-modal" aria-label="close"></button>
</header>
<section class="modal-card-body is-4by4">
<table class="table is-striped is-hoverable is-fullwidth is-size-6">
<tbody>
<tr>
<td><strong>Score</strong></td>
<td>{{ license.score }}</td>
</tr>
<tr>
<td><strong>Matched Line(s)</strong></td>
<td>{% if license.start_line == license.end_line %} {{ license.start_line }} {% else %} {{ license.start_line }} - {{ license.end_line }} {% endif %}</td>
</tr>
<tr>
<td><strong>Rule Identifier</strong></td>
<td>
{% if license.rule_text_url %}
<a href="{{ license.rule_text_url }}" target="_blank">{{ license.rule_identifier }}</a>
{% else %}
{{ license.rule_identifier }}
{% endif %}
</td>
</tr>
<tr>
<td><strong>Matcher</strong></td>
<td>{{ license.matcher }}</td>
</tr>
<tr>
<td><strong>Match Coverage</strong></td>
<td>{{ license.match_coverage }}</td>
</tr>
<tr>
<td><strong>Matched Length</strong></td>
<td>{{ license.matched_length }}</td>
</tr>
<tr>
<td><strong>Key(s)</strong></td>
<td>
{% for key in license.licenses %}
<a href="{{ key.reference_url }}" target="_blank"><span class="mr-2">{{ key.key }}</span></a>
{% endfor %}
</td>
</tr>
<tr>
<td><strong>Rule Relevance</strong></td>
<td>{{ license.rule_relevance }}</td>
</tr>
<tr>
<td><strong>Rule Length</strong></td>
<td>{{ license.rule_length }}</td>
</tr>
</tbody>
</table>
</section>
<footer class="modal-card-foot">
<button class="button is-outlined has-text-weight-semibold">
{% include 'scantext/includes/license_report.html' with license=license %}
</button>
</footer>
</div>
</div>
2 changes: 2 additions & 0 deletions scantext/templates/scantext/includes/license_report.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
<a class="has-text-danger" href="https://github.com/nexB/scancode.io/issues/new?labels=bug&title=License+detection+error+as+`{{ license.license_expression|pprint }}`
&body=Detection+level+details%0A```python%0A{%0A%20%20%20%20score+:+{{ license.score }}+%0A%20%20%20%20start_line+:+{{ license.start_line }}+%0A%20%20%20%20end_line+:+{{ license.end_line }}+%0A%20%20%20%20matched_length+:+{{ license.matched_length }}+%0A%20%20%20%20match_coverage+:+{{ license.match_coverage }}+%0A%20%20%20%20rule_identifier+:+{{ license.rule_identifier }}%0A}%0A```+%0A%0AMatched+Text%0A```%0A{{ license.matched_text }}%0A```+%0A%0AInput+Text%0A```%0A{{ license.matched_text }}%0A```" target="_blank">Report on Github</a>
Copy link
Author

@lf32 lf32 Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point me to the place were the design for the reporting to GitHub feature was discussed?

@tdruez here it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a "design discussion", this is code...

Copy link
Author

@lf32 lf32 Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😕 Discussed only in the meet.

Copy link
Author

@lf32 lf32 Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reporting license details should be done at scancode-toolkit's issues page.

Copy link
Author

@lf32 lf32 Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The project idea says It will also allow the integrated reporting of license detection issues in the app based on the results.

Loading