-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove_common_tokens
poorly handles "odd" licenses
#42
Comments
Yikes -- that's no good at all. Good digging. I'm going to:
Going to see how far I can get on all of this today. :) |
Sounds good, thanks for the quick response, but no need to rush! :) |
Oh gosh. I just realized that remove_common_tokens isn't even in 0.3.0, which explains why my cherry pick didn't resolve cleanly. But it probably snuck into the @Jake-Shadle were you using the static build off of GitHub releases, or was it via |
I was using cargo install. |
cargo install from crates.io ( |
Oh sorry, cargo install from crates.io, that's how I narrowed down the cause, by looking at the commits that had happened after the 0.3.0 version bump, as initially I had assumed some of the changes on my fork had been the cause, but when going back to an unmodified HEAD, it exhibited the same behavior. |
Ok, I think what might have happened is you did
So there is definitely a bug in If I've missed something, please let me know and try to get a reproducible test case and I'll dig into this more for that version. |
This is particularly diffucult, as the example that this LCS removal enables is one of Adobe's "%Copyright%" prefixes. Guess what else has Copyright at the start of a line? Yep, actual copyright statements. Think that this might need more of a refactoring, but I need to think on it for a bit.
Track the counts of common prefixes and store them. Take the most common and re-count against other found prefixes. If that count is at least 80% of the text, consider it to be a junk line prefix and remove it. There are probably better ways to do this, but this fits the bill for places where licenses tend to appear. Fixes #42
I am using askalono as a library and was trying to figure out why I was getting extremely low confidence scores vs the 0.3.0 CLI that I installed, which was correctly identifying one of the problematic license files, eg. https://github.com/rust-random/rand/blob/master/rand_core/LICENSE-MIT
Here's an example run where I just print the result of each preprocess step,
remove_common_tokens
runs first, and basically truncates all of relevant license text which results in the analysis being unable to do much of anything.This license is a bit odd with the 2 copyright headers at the beginning, and indeed, removing one of them won't trigger the truncation any longer.
The text was updated successfully, but these errors were encountered: