-
-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve gibberish license/copyright detection #2402
Comments
That's an interesting class of errors! Some remarks:
|
Note that the adoption of https://github.com/nexB/pygmars/ as a replacement for NLTK should allow the easier reuse and integration of other libraries in the lexing process including NER and giberish detection. In pygmars, a tokenization rule can now be an arbitrary callable that behaves like re.match so it may enable using other tools as part of token recognition aka. lexing |
Another candidate for gibberish that works quite well is https://github.com/domanchi/gibberish-detector |
I ran this with bad-copyright-detections.txt
from gibberish_detector import detector
Detector = detector.create_from_model('big.model')
data = sorted(set(open('bad-copyright-detections.txt').read().split()))
for d in data:
print(repr(d),',', Detector.is_gibberish(d)) I then loaded this in libreoffice to do some evaluation. Here are the results:
This is pretty good and I would expect even better from using a proper training set. |
I was testing out the very basic gibberish detector at https://github.com/rrenaud/Gibberish-Detector, which https://github.com/domanchi/gibberish-detector (mentioned above is based on), as it is much easier to integrate. I used our scancode license texts and rules (minus the test set) as the training data and false positives and some of the license tags as test data. It is pretty good at detecting non gibberish but is not so good when these are ambigious. attaching the results of the test here for reference, here the probability is of the text being non-gibberish.
|
very nice! what's your take on applicability to license then? Did you apply some boosting to legalese words? |
I still think we need better performance to integrate, i was looking into the other implementation which is a library. There are some additional steps there so I'll try that with the same data. Also thing to note is this only uses positive training, i.e. only trains on good non gibberish values, so if we could do some negative boosting for legalese gibberish that could improve the performance. I did extend the character set to include a-z, A-Z, number and other characters. Do you think it's worth spending time on this? |
Let's be mindful not to get too much into the weeds as this can be hairy and yield only small improvements. |
Some samples for copyrights: |
I have compiled a text file that contains erroneous copyright detection values. I have removed quote characters and separated each copyright value by several lines.
bad-copyright-detections.txt
The text was updated successfully, but these errors were encountered: