-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working version, needs improvement [WIP] #39
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
licensedb/internal/investigation.go
Outdated
candidates := [][]byte{} | ||
langs := []string{} | ||
for _, file := range files { | ||
lang, safe := enry.GetLanguageByExtension(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use GetLanguage
which is more accurate.
Hi @mariabg So cool that you've decided to help! Regarding (1), your approach makes much sense. We need to define the mapping between the languages and the functions which extract comments within the first 1024 bytes (supposing that we cover 99% of cases by this, the other 1% is too crazy to support). They can be regexps, can be something else. I guess regexps. Regarding (2), yeah this is not obvious. We need to take 50-100 examples and see how different heuristics work on them, then pick the best. I used some simple scans for e.g. "---" or "\n\n\n" to split licenses inside a file, some of those rules may work here too. The good news is that there is almost always the single license in a source file. (3) is better to create a separate function, say Finally, there should be the separate database for license headers extracted from SPDX github. |
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @vmarkovtsev, I've just pushed the first fully working version of my code, and I left a couple of comments on it. Also, could you explain this "there should be the separate database for license headers extracted from SPDX github" a bit further? I don't understand where/how I should implement this 😅.
Let me know your thoughts on it, cheers!
licensedb/internal/db.go
Outdated
} | ||
append(db.QueryLicenseText(string(text))) | ||
if len(candidates) == 0 { | ||
// TO DO: split license-comments from description-comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As all of the files in the project are scanned for licenses I though about doing this split here for performance's sake, what do you think? Should it be done before to improve % on the output? If the license is clear, like on this project, is is found without problem.
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
Hi 👋I am currently working on this issue and I would like to discuss three questions I have in order to know which may be the best approach. Following the schema presented in the issue, I already coded the first step, along with the structure of the whole implementation. Please let me know if anything should be change from what I pushed so far.
Question 1: is there a simpler way than the following one to implement the parsing of the comments in the header? I thought about mapping comment syntaxes for languages (language as key, next to regex for different type of comments, both line and block). This seams to me a bit convoluted and you may have a more straight forward implementation in mind. Also, this selects every comment in the code, not just the header ones.
Question 2 is about the heuristics to split the license information from the descriptive comments. Should this be done through a fuzzy search of license keywords like “this file”, “contributor”, “you may”, “you may not” within every line? Were you thinking about more complex heuristics, and if so, which ones and where can I read about them?
Question 3: once these two separations are done, should I create a querying function for the comments or can globalLicenseDatabase().QueryLicenseText() be used for this purpose?