Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing a single word in a license shows as exact match when it should show a difference #140

Closed
Tracked by #309
goneall opened this issue Sep 29, 2019 · 14 comments
Labels

Comments

@goneall
Copy link
Member

goneall commented Sep 29, 2019

To reproduce, create a new license request and copy/paste the text for the Apple MIT License. Then change the copyright from Copyright: Copyright (c) 2006 by Apple Computer, Inc., All Rights Reserved. to Copyright: Me, All Rights Reserved.. It should show a close match and give you the option of submitting an issue or a new license request.

@goneall
Copy link
Member Author

goneall commented Sep 29, 2019

I did some testing and the spdx_license_matcher/computation.py score calculated on line 31 is 1.0 even though there are some changes.

One possible solution is to run the SPDX tools java compare even on the "perfect matches" to make sure it really matches. This could be simplified by just treating all matches as close matches and checking all licenses.

Another approach would be to improve the accuracy so that a single word change causes the score to be less than 1.0 and changing the limit to 1.0.

@Ugtan Let me know what you think.

@Ugtan
Copy link
Collaborator

Ugtan commented Sep 30, 2019

@goneall I haven't got a chance to look at it but I will definitely do after the end of this week(I am having exams this week).

But as far as I could remember the copyright text is not considered as substantial based on the SPDX guidelines that might be the reason as the copyright is ignored and hence the changes are ignored as well. But I do think there is a possibility of improving the process of matching and I will try to look at the best possible solution in some time soon. Thanks

@goneall
Copy link
Member Author

goneall commented Sep 30, 2019

But as far as I could remember the copyright text is not considered as substantial based on the SPDX guidelines

True - the problem can also be duplicated by changing one word in the body of the license text outside the copyright. The code will come back with a 1.0 score and a match even though there is a difference in the license text.

Since we are dealing with licenses where a single word can make a very big difference (e.g. inserting the word "not"), we need to make sure they don't show up as matching.

BTW - the matching algorithm depends on the license XML marking copyright text as alt or optional. There are quite a few licenses like the Apple MIT license which do not have proper markup. The scenerio I was testing was where it should match (as in a copyright) but didn't because the alt tag was not used for the copyright field. Since alt was not used, the license should not match. Other licenses, like the Apache 2.0 license, have the proper alt tags and will match.

@Ugtan
Copy link
Collaborator

Ugtan commented Oct 11, 2019

You are right. The matching algorithm definite needs some necessary tweaks to be able give accurate results.

BTW - the matching algorithm depends on the license XML marking copyright text as alt or optional. There are quite a few licenses like the Apple MIT license which do not have proper markup. The scenerio I was testing was where it should match (as in a copyright) but didn't because the alt tag was not used for the copyright field. Since alt was not used, the license should not match. Other licenses, like the Apache 2.0 license, have the proper alt tags and will match.

This might require a bit more involvement but I would definitely take the problem into consideration as the alt and optional texts has an overall impact on the performance of the matching algorithm. If you find any other issue, then I will request you to please report it here itself.

I will try my best to work on the issue(it will require a bit more involvement) but the problem I'm currently facing is the lack of time. So, it might take a while. @goneall

@goneall
Copy link
Member Author

goneall commented Oct 11, 2019

This might require a bit more involvement but I would definitely take the problem into consideration as the alt and optional texts has an overall impact on the performance of the matching algorithm. If you find any other issue, then I will request you to please report it here itself.

The Java license matching already takes into account the alt/optional tags. You are correct in that it is a lot more involved - it took me a few tries to get the Java version right.

If you can fix the code so that it would return a score of less than 1.0 when even one word changes, it should work. This is because the Java matching code would be called if it has a high enough score but is less than 1.0. The Java code will only match if it complies with the alt/optional markup.

With the current implementation returning a score of 1.0 when there is a single word different, it matches and the Java code doesn't get invoked.

@Ugtan
Copy link
Collaborator

Ugtan commented Oct 12, 2019

If you can fix the code so that it would return a score of less than 1.0 when even one word changes, it should work. This is because the Java matching code would be called if it has a high enough score but is less than 1.0. The Java code will only match if it complies with the alt/optional markup.

Oh.. Makes sense. I will try to improve the matching algorithm and I will ping you If I need help with something. Thanks for the clarification. :)

@goneall
Copy link
Member Author

goneall commented Jan 8, 2020

@Ugtan Ping - do you have some bandwidth to work on this issue? I want to get this issue resolved before deploying.

@Ugtan
Copy link
Collaborator

Ugtan commented Jan 8, 2020

Hello @goneall, I do want to work on the issue. Tomorrow is my last exam of the semester and I think I will get enough time to work on the issue from the day after tomorrow.

@goneall
Copy link
Member Author

goneall commented Feb 8, 2020

@Ugtan Pinging you on status of the issue - let me know how it is going.

@rtgdk
Copy link
Collaborator

rtgdk commented Feb 18, 2020

@Ugtan Are you working on this one? Or should we open it for new contributors?

@Ugtan
Copy link
Collaborator

Ugtan commented Feb 18, 2020

Hello @rtgdk. I'm indeed working on the issue. Sorry, it took me a while to write it all over again. I will try to push the changes by tomorrow.

@goneall
Copy link
Member Author

goneall commented Mar 19, 2020

@Ugtan Pinging you again on the issue - we could open this for others to work on if you don't have the bandwidth

@jlovejoy
Copy link
Member

@rtgdk - I take it we should leave this one open?

@goneall
Copy link
Member Author

goneall commented Jun 24, 2023

@rtgdk @jlovejoy
Actually - this was fixed a while ago. Closing the issue.

@goneall goneall closed this as completed Jun 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants