Improve gibberish license/copyright detection #2402

JonoYang · 2021-02-16T01:07:32Z

I have compiled a text file that contains erroneous copyright detection values. I have removed quote characters and separated each copyright value by several lines.
bad-copyright-detections.txt

pombredanne · 2021-02-16T06:55:01Z

That's an interesting class of errors!
I guess that they all come from binaries? And because it is useful, we cannot stop detecting in binaries.

Some remarks:
These are at first considered as candidates by the copyright detector because they contains (c) or © (which is normalized to (c)). Then they are somehow matching some of the lexing rules in our part of speech tagger and are tagged as being recognized names somehow, and as names they end up recognized finally by the grammar.

We could refine the candidate detection when in binaries such that we are more strict in that case?
We could improve the crude/dump "Named Entity Recognition"(NER) we have today to skip some. There are features in NLTK and we could give a shot to https://github.com/explosion/spaCy and several other NLP tools that help there
We could detect gibberish, either early in candidate or NER/tagging or after the fact in a post scan plugin. For that https://github.com/casics/nostril/ which is now LGPL (and no longer LGPL) Would you consider a license suitable for use as a library in non GPL apps? casics/nostril#9
That's a gibberish detector.

pombredanne · 2021-07-07T08:46:11Z

Note that the adoption of https://github.com/nexB/pygmars/ as a replacement for NLTK should allow the easier reuse and integration of other libraries in the lexing process including NER and giberish detection. In pygmars, a tokenization rule can now be an arbitrary callable that behaves like re.match so it may enable using other tools as part of token recognition aka. lexing

pombredanne · 2021-07-10T18:49:40Z

Another candidate for gibberish that works quite well is https://github.com/domanchi/gibberish-detector

pombredanne · 2021-08-19T16:00:54Z

See

C source code line mistaken for MPL licence #2304
Discard matches to single GPL word and other very short rules with mixed, non-matching case and/or in a binary an/or not on a single line and/or in giberish #2403

pombredanne · 2021-09-17T08:05:17Z

I ran this with bad-copyright-detections.txt

pip install gibberish-detector
gibberish-detector train examples/big.txt > big.model
in python:

from gibberish_detector import detector
Detector = detector.create_from_model('big.model')
data = sorted(set(open('bad-copyright-detections.txt').read().split()))
for d in data:
    print(repr(d),',', Detector.is_gibberish(d))

I then loaded this in libreoffice to do some evaluation. Here are the results:

string	gibberish	correct
!e	False	no
?a	False	no
.	False	no
(c)	False	no
(c)i	False	no
(c)u	False	no
\x00	False	no
\x0e	False	no
\x191/2/\x18$?R	False	no
A	False	no
A-	False	no
A\x7f\x7fl?	False	no
A&	False	no
A1	False	no
A8l	False	no
B	False	no
C	False	no
Cakd	False	no
Ci	False	no
co	False	yes
co9	False	no
Copyright	False	yes
CZal	False	no
D	False	no
E	False	no
E?	False	no
E1	False	no
E2	False	no
Eu	False	no
Ev	False	no
Evo	False	no
ew	False	no
FBee	False	no
G	False	no
H	False	no
H1o	False	no
I	False	no
io-ha\x15deg\x12q?	False	no
Mr	False	no
N	False	no
N1/2	False	no
NaluAd	False	yes
O	False	no
OEm	False	no
Ok\x19s1	False	no
Optional	False	yes
RO	False	yes
Sthg	False	yes
Thw	False	yes
U	False	no
Wad?	False	no
WrapNonExceptionThrows	False	yes
Xr66	False	no
Y	False	no
YO	False	no
(c)s	True	yes
\x00\x03\x00$LN2007\x00A\x00\x00\x00\x00\x00\x00\x06\x00$LN2008\x00th\x00\x00\x00\x00\x00\x00\x06\x00$LN2005\x00\x14	True	yes
\x7fCdegA-Aa*	True	yes
$?OAE+-OA	True	yes
1NE3	True	yes
1PS1	True	yes
3/4R\x08U*	True	yes
3\x12ae.Oi6	True	yes
A2C/a2EPSOA-D1/4	,C/3A True	yes
Aa	True	yes
AaASSA	True	yes
AAE	True	no
AaU	True	yes
AC3E	True	yes
AE1rvfZaUNXNJXNJ	True	yes
aeaaeO1/2	True	yes
AEIQDfR	True	yes
AEo	True	yes
AErA	True	yes
Ajc	True	yes
aMp@KPj\x13o*O BNeAa!'o-O\x06\x11R\x1boX\x04	True	yes
aO1rqaetet\x19il	True	yes
Ao3Vs	True	yes
AoD$?xAAA	True	yes
AoDIA	True	yes
ASS	True	yes
ASS?SSE-E2I*	True	yes
AThY	True	yes
AuNTO	True	yes
BjAO	True	yes
bvnay$?\x08xVx,i\x0fveoPSS	True	yes
C/ae3e,!cU	True	yes
C2CDCgC	True	yes
Cd\x03AU,AEoAd\x03AU,AEoA$?xAAA	True	yes
Cdegs	True	yes
CkI3H	True	yes
CR	True	yes
Cu	True	yes
CuN	True	yes
D8DYDoD	True	yes
Dlss	True	yes
dnOWEWly1	True	yes
DoPSOA	True	yes
Dq	True	yes
E,C/A+-I1PSxA(r)OA!O3/4OAaOAaO3/4	True	yes
E2EuA	True	yes
E3P	True	yes
EdegA	True	yes
Ee	True	yes
EP	True	yes
EuI1Y	True	yes
EvoS	True	yes
FaUSS1u	True	yes
FKo2	True	yes
FPS	True	yes
G3eeAIUq5w9\x11Wis2Wu\x000PS	Gae2	True
HPS3Ho	True	yes
I1PSEu	True	yes
I1PSIoY	True	yes
I1Y	True	yes
I3A	True	yes
IA5e	True	yes
IC-cC_	True	yes
IeIc	True	yes
IIIdOi09QBn	True	yes
IjKMr	True	yes
IldY	True	yes
Imu7UTa7eeem	True	yes
IOuOeO	True	yes
IoY	True	yes
IP	True	yes
Iu	True	yes
IU1u	True	yes
IUU	True	yes
IvR33	True	yes
IYte9thdegA8Ioie5	True	yes
Jw6	True	yes
Ks	True	yes
n$i1/44ThYV\x10th\x07C.'cOnVAoE	True	yes
nA+	True	yes
NE82	True	yes
NicI	True	yes
NN	True	yes
Nxe	True	yes
o//O\x7f\x17\x12O	True	yes
OA	True	yes
OAaE3E33/4	True	yes
OAaOA	True	yes
OaC2A	True	yes
OACn	True	yes
OAdegOA!ThEPaN	True	yes
Oae	True	yes
OAEauy	True	yes
OC2YI1UC2OA	True	yes
OE	True	yes
OeA-?Fbt3\x19Uea	True	yes
OeYeAE	True	yes
OPSE3e	True	yes
OuA	True	yes
Ox	True	yes
OXMssU	True	yes
OyA	True	yes
OYEYG4w2hssaeNUhbbAJMi	True	yes
OyN	True	yes
OYThx	True	yes
PI5uWi(r)&O\x08unO	True	yes
PJ	True	yes
PSa	True	yes
QAPYINae2A	True	yes
QLdXPo	True	yes
RSDS	True	yes
SA	True	yes
Sl	True	yes
SP	True	yes
SS	True	yes
SSA	True	yes
TCYu	True	yes
ThA3I\x18	True	yes
ThIoO(r)D1/2	True	yes
ThrEIuexOUoth	True	yes
Ua	True	yes
UAA7a	True	yes
UAA7ao	True	yes
UE	True	yes
UE3	True	yes
UE9HSS	True	yes
UIuUGw7	True	yes
UO	True	yes
Uos3	True	yes
V1oVaoA	True	yes
VDOI	True	yes
VkSS	True	yes
WY	True	yes
xA	True	yes
Xi	True	yes
XY	True	yes
yA!uAW	True	yes
YC	True	yes
YUUuIUI	True	yes
YV	True	yes

This is pretty good and I would expect even better from using a proper training set.

AyanSinhaMahapatra · 2022-08-02T18:48:06Z

I was testing out the very basic gibberish detector at https://github.com/rrenaud/Gibberish-Detector, which https://github.com/domanchi/gibberish-detector (mentioned above is based on), as it is much easier to integrate.

I used our scancode license texts and rules (minus the test set) as the training data and false positives and some of the license tags as test data.

It is pretty good at detecting non gibberish but is not so good when these are ambigious.

attaching the results of the test here for reference, here the probability is of the text being non-gibberish.

bad -> false positives
good -> license tags

gibberish-test.csv

pombredanne · 2022-08-03T14:05:43Z

very nice! what's your take on applicability to license then? Did you apply some boosting to legalese words?

AyanSinhaMahapatra · 2022-08-03T14:10:48Z

I still think we need better performance to integrate, i was looking into the other implementation which is a library. There are some additional steps there so I'll try that with the same data. Also thing to note is this only uses positive training, i.e. only trains on good non gibberish values, so if we could do some negative boosting for legalese gibberish that could improve the performance.

I did extend the character set to include a-z, A-Z, number and other characters.

Do you think it's worth spending time on this?

pombredanne · 2022-08-03T14:20:38Z

Do you think it's worth spending time on this?

Let's be mindful not to get too much into the weeds as this can be hairy and yield only small improvements.
It may be best to start using this on copyright before applying to license?

pombredanne · 2024-06-28T11:06:58Z

Some samples for copyrights:

pombredanne mentioned this issue Aug 19, 2021

Discard matches to single GPL word and other very short rules with mixed, non-matching case and/or in a binary an/or not on a single line and/or in giberish #2403

Open

AyanSinhaMahapatra changed the title ~~Collection of erroneous copyright detection values~~ Improve gibberish license/copyright detection Mar 26, 2023

AyanSinhaMahapatra mentioned this issue Mar 26, 2023

Reduce license detection false positives #3300

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve gibberish license/copyright detection #2402

Improve gibberish license/copyright detection #2402

JonoYang commented Feb 16, 2021

pombredanne commented Feb 16, 2021

pombredanne commented Jul 7, 2021

pombredanne commented Jul 10, 2021

pombredanne commented Aug 19, 2021

pombredanne commented Sep 17, 2021

AyanSinhaMahapatra commented Aug 2, 2022

pombredanne commented Aug 3, 2022

AyanSinhaMahapatra commented Aug 3, 2022

pombredanne commented Aug 3, 2022

pombredanne commented Jun 28, 2024

Improve gibberish license/copyright detection #2402

Improve gibberish license/copyright detection #2402

Comments

JonoYang commented Feb 16, 2021

pombredanne commented Feb 16, 2021

pombredanne commented Jul 7, 2021

pombredanne commented Jul 10, 2021

pombredanne commented Aug 19, 2021

pombredanne commented Sep 17, 2021

AyanSinhaMahapatra commented Aug 2, 2022

pombredanne commented Aug 3, 2022

AyanSinhaMahapatra commented Aug 3, 2022

pombredanne commented Aug 3, 2022

pombredanne commented Jun 28, 2024