Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve gibberish license/copyright detection #2402

Open
JonoYang opened this issue Feb 16, 2021 · 10 comments
Open

Improve gibberish license/copyright detection #2402

JonoYang opened this issue Feb 16, 2021 · 10 comments

Comments

@JonoYang
Copy link
Member

I have compiled a text file that contains erroneous copyright detection values. I have removed quote characters and separated each copyright value by several lines.
bad-copyright-detections.txt

@pombredanne
Copy link
Member

That's an interesting class of errors!
I guess that they all come from binaries? And because it is useful, we cannot stop detecting in binaries.

Some remarks:
These are at first considered as candidates by the copyright detector because they contains (c) or © (which is normalized to (c)). Then they are somehow matching some of the lexing rules in our part of speech tagger and are tagged as being recognized names somehow, and as names they end up recognized finally by the grammar.

  1. We could refine the candidate detection when in binaries such that we are more strict in that case?
  2. We could improve the crude/dump "Named Entity Recognition"(NER) we have today to skip some. There are features in NLTK and we could give a shot to https://github.com/explosion/spaCy and several other NLP tools that help there
  3. We could detect gibberish, either early in candidate or NER/tagging or after the fact in a post scan plugin. For that https://github.com/casics/nostril/ which is now LGPL (and no longer LGPL) Would you consider a license suitable for use as a library in non GPL apps? casics/nostril#9
    That's a gibberish detector.

@pombredanne
Copy link
Member

Note that the adoption of https://github.com/nexB/pygmars/ as a replacement for NLTK should allow the easier reuse and integration of other libraries in the lexing process including NER and giberish detection. In pygmars, a tokenization rule can now be an arbitrary callable that behaves like re.match so it may enable using other tools as part of token recognition aka. lexing

@pombredanne
Copy link
Member

Another candidate for gibberish that works quite well is https://github.com/domanchi/gibberish-detector

@pombredanne
Copy link
Member

I ran this with bad-copyright-detections.txt

  • pip install gibberish-detector
  • gibberish-detector train examples/big.txt > big.model
  • in python:
from gibberish_detector import detector
Detector = detector.create_from_model('big.model')
data = sorted(set(open('bad-copyright-detections.txt').read().split()))
for d in data:
    print(repr(d),',', Detector.is_gibberish(d))

I then loaded this in libreoffice to do some evaluation. Here are the results:

string gibberish correct
!e False no
?a False no
. False no
(c) False no
(c)i False no
(c)u False no
\x00 False no
\x0e False no
\x191/2/\x18$?R False no
A False no
A- False no
A\x7f\x7fl? False no
A& False no
A1 False no
A8l False no
B False no
C False no
Cakd False no
Ci False no
co False yes
co9 False no
Copyright False yes
CZal False no
D False no
E False no
E? False no
E1 False no
E2 False no
Eu False no
Ev False no
Evo False no
ew False no
FBee False no
G False no
H False no
H1o False no
I False no
io-ha\x15deg\x12q? False no
Mr False no
N False no
N1/2 False no
NaluAd False yes
O False no
OEm False no
Ok\x19s1 False no
Optional False yes
RO False yes
Sthg False yes
Thw False yes
U False no
Wad? False no
WrapNonExceptionThrows False yes
Xr66 False no
Y False no
YO False no
(c)s True yes
\x00\x03\x00$LN2007\x00A\x00\x00\x00\x00\x00\x00\x06\x00$LN2008\x00th\x00\x00\x00\x00\x00\x00\x06\x00$LN2005\x00\x14 True yes
\x7fCdegA-Aa* True yes
$?OAE+-OA True yes
1NE3 True yes
1PS1 True yes
3/4R\x08U* True yes
3\x12ae.Oi6 True yes
A2C/a2EPSOA-D1/4 ,C/3A True yes
Aa True yes
AaASSA True yes
AAE True no
AaU True yes
AC3E True yes
AE1rvfZaUNXNJXNJ True yes
aeaaeO1/2 True yes
AEIQDfR True yes
AEo True yes
AErA True yes
Ajc True yes
aMp@KPj\x13o*O BNeAa!'o-O\x06\x11R\x1boX\x04 True yes
aO1rqaetet\x19il True yes
Ao3Vs True yes
AoD$?xAAA True yes
AoDIA True yes
ASS True yes
ASS?SSE-E2I* True yes
AThY True yes
AuNTO True yes
BjAO True yes
bvnay$?\x08xVx,i\x0fveoPSS True yes
C/ae3e,!cU True yes
C2CDCgC True yes
Cd\x03AU,AEoAd\x03AU,AEoA$?xAAA True yes
Cdegs True yes
CkI3H True yes
CR True yes
Cu True yes
CuN True yes
D8DYDoD True yes
Dlss True yes
dnOWEWly1 True yes
DoPSOA True yes
Dq True yes
E,C/A+-I1PSxA(r)OA!O3/4OAaOAaO3/4 True yes
E2EuA True yes
E3P True yes
EdegA True yes
Ee True yes
EP True yes
EuI1Y True yes
EvoS True yes
FaUSS1u True yes
FKo2 True yes
FPS True yes
G3eeAIUq5w9\x11Wis2Wu\x000PS Gae2 True
HPS3Ho True yes
I1PSEu True yes
I1PSIoY True yes
I1Y True yes
I3A True yes
IA5e True yes
IC-cC_ True yes
IeIc True yes
IIIdOi09QBn True yes
IjKMr True yes
IldY True yes
Imu7UTa7eeem True yes
IOuOeO True yes
IoY True yes
IP True yes
Iu True yes
IU1u True yes
IUU True yes
IvR33 True yes
IYte9thdegA8Ioie5 True yes
Jw6 True yes
Ks True yes
n$i1/44ThYV\x10th\x07C.'cOnVAoE True yes
nA+ True yes
NE82 True yes
NicI True yes
NN True yes
Nxe True yes
o//O\x7f\x17\x12O True yes
OA True yes
OAaE3E33/4 True yes
OAaOA True yes
OaC2A True yes
OACn True yes
OAdegOA!ThEPaN True yes
Oae True yes
OAEauy True yes
OC2YI1UC2OA True yes
OE True yes
OeA-?Fbt3\x19Uea True yes
OeYeAE True yes
OPSE3e True yes
OuA True yes
Ox True yes
OXMssU True yes
OyA True yes
OYEYG4w2hssaeNUhbbAJMi True yes
OyN True yes
OYThx True yes
PI5uWi(r)&O\x08unO True yes
PJ True yes
PSa True yes
QAPYINae2A True yes
QLdXPo True yes
RSDS True yes
SA True yes
Sl True yes
SP True yes
SS True yes
SSA True yes
TCYu True yes
ThA3I\x18 True yes
ThIoO(r)D1/2 True yes
ThrEIuexOUoth True yes
Ua True yes
UAA7a True yes
UAA7ao True yes
UE True yes
UE3 True yes
UE9HSS True yes
UIuUGw7 True yes
UO True yes
Uos3 True yes
V1oVaoA True yes
VDOI True yes
VkSS True yes
WY True yes
xA True yes
Xi True yes
XY True yes
yA!uAW True yes
YC True yes
YUUuIUI True yes
YV True yes

This is pretty good and I would expect even better from using a proper training set.

@AyanSinhaMahapatra
Copy link
Member

I was testing out the very basic gibberish detector at https://github.com/rrenaud/Gibberish-Detector, which https://github.com/domanchi/gibberish-detector (mentioned above is based on), as it is much easier to integrate.

I used our scancode license texts and rules (minus the test set) as the training data and false positives and some of the license tags as test data.

It is pretty good at detecting non gibberish but is not so good when these are ambigious.

attaching the results of the test here for reference, here the probability is of the text being non-gibberish.

bad -> false positives
good -> license tags

gibberish-test.csv

@pombredanne
Copy link
Member

very nice! what's your take on applicability to license then? Did you apply some boosting to legalese words?

@AyanSinhaMahapatra
Copy link
Member

I still think we need better performance to integrate, i was looking into the other implementation which is a library. There are some additional steps there so I'll try that with the same data. Also thing to note is this only uses positive training, i.e. only trains on good non gibberish values, so if we could do some negative boosting for legalese gibberish that could improve the performance.

I did extend the character set to include a-z, A-Z, number and other characters.

Do you think it's worth spending time on this?

@pombredanne
Copy link
Member

Do you think it's worth spending time on this?

Let's be mindful not to get too much into the weeds as this can be hairy and yield only small improvements.
It may be best to start using this on copyright before applying to license?

@AyanSinhaMahapatra AyanSinhaMahapatra changed the title Collection of erroneous copyright detection values Improve gibberish license/copyright detection Mar 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants