Working version, needs improvement [WIP] #39

mariabg · 2018-07-17T08:47:08Z

Hi 👋I am currently working on this issue and I would like to discuss three questions I have in order to know which may be the best approach. Following the schema presented in the issue, I already coded the first step, along with the structure of the whole implementation. Please let me know if anything should be change from what I pushed so far.
Question 1: is there a simpler way than the following one to implement the parsing of the comments in the header? I thought about mapping comment syntaxes for languages (language as key, next to regex for different type of comments, both line and block). This seams to me a bit convoluted and you may have a more straight forward implementation in mind. Also, this selects every comment in the code, not just the header ones.
Question 2 is about the heuristics to split the license information from the descriptive comments. Should this be done through a fuzzy search of license keywords like “this file”, “contributor”, “you may”, “you may not” within every line? Were you thinking about more complex heuristics, and if so, which ones and where can I read about them?
Question 3: once these two separations are done, should I create a querying function for the comments or can globalLicenseDatabase().QueryLicenseText() be used for this purpose?

Signed-off-by: María Benavente <mariabenaventeg@gmail.com>

vmarkovtsev · 2018-07-17T22:18:51Z

licensedb/internal/investigation.go

+	candidates := [][]byte{}
+	langs := []string{}
+	for _, file := range files {
+		lang, safe := enry.GetLanguageByExtension(file)


Better to use GetLanguage which is more accurate.

vmarkovtsev · 2018-07-17T22:31:38Z

Hi @mariabg

So cool that you've decided to help!

Regarding (1), your approach makes much sense. We need to define the mapping between the languages and the functions which extract comments within the first 1024 bytes (supposing that we cover 99% of cases by this, the other 1% is too crazy to support). They can be regexps, can be something else. I guess regexps.

Regarding (2), yeah this is not obvious. We need to take 50-100 examples and see how different heuristics work on them, then pick the best. I used some simple scans for e.g. "---" or "\n\n\n" to split licenses inside a file, some of those rules may work here too. The good news is that there is almost always the single license in a source file.

(3) is better to create a separate function, say QuerySourceFile(). We can further extend the command-line tool to optionally output the full report for all the files. Bonus points for generating an SPDX document from it.

Finally, there should be the separate database for license headers extracted from SPDX github.

Signed-off-by: María Benavente <mariabenaventeg@gmail.com>

mariabg

Hey @vmarkovtsev, I've just pushed the first fully working version of my code, and I left a couple of comments on it. Also, could you explain this "there should be the separate database for license headers extracted from SPDX github" a bit further? I don't understand where/how I should implement this 😅.

Let me know your thoughts on it, cheers!

mariabg · 2018-07-30T16:40:40Z

licensedb/internal/db.go

+	}
+	append(db.QueryLicenseText(string(text)))
+	if len(candidates) == 0 {
+		// TO DO: split license-comments from description-comments.


As all of the files in the project are scanned for licenses I though about doing this split here for performance's sake, what do you think? Should it be done before to improve % on the output? If the license is clear, like on this project, is is found without problem.

licensedb/internal/investigation.go

Signed-off-by: María Benavente <mariabenaventeg@gmail.com>

Find source code files

56f2257

Signed-off-by: María Benavente <mariabenaventeg@gmail.com>

vmarkovtsev requested changes Jul 17, 2018

View reviewed changes

mariabg added 4 commits July 18, 2018 10:31

use enry's GetLanguage() instead of GetLanguageByExtension()

9f98da6

Signed-off-by: María Benavente <mariabenaventeg@gmail.com>

Parse header comments

48cf528

Signed-off-by: María Benavente <mariabenaventeg@gmail.com>

Fix typo

901ab7e

Signed-off-by: María Benavente <mariabenaventeg@gmail.com>

Scan-individual-files implementation working (but needs improvement)

e691fdc

Signed-off-by: María Benavente <mariabenaventeg@gmail.com>

mariabg commented Jul 30, 2018

View reviewed changes

mariabg changed the title ~~First steps + some questions on this issue [WIP]~~ Working version, needs improvement [WIP] Jul 30, 2018

mariabg added 4 commits July 30, 2018 20:24

change var name for consistency

1c941d3

Signed-off-by: María Benavente <mariabenaventeg@gmail.com>

add investigateSourceFile() structure

8061b5a

Signed-off-by: María Benavente <mariabenaventeg@gmail.com>

add support future printing output the full report for all the files

7d2ccc0

Signed-off-by: María Benavente <mariabenaventeg@gmail.com>

minor changes + getting filenames where licenses were found

f8395c1

Signed-off-by: María Benavente <mariabenaventeg@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working version, needs improvement [WIP] #39

Working version, needs improvement [WIP] #39

mariabg commented Jul 17, 2018

vmarkovtsev Jul 17, 2018

vmarkovtsev commented Jul 17, 2018 •

edited

Loading

mariabg left a comment •

edited

Loading

mariabg Jul 30, 2018 •

edited

Loading

Working version, needs improvement [WIP] #39

Are you sure you want to change the base?

Working version, needs improvement [WIP] #39

Conversation

mariabg commented Jul 17, 2018

vmarkovtsev Jul 17, 2018

Choose a reason for hiding this comment

vmarkovtsev commented Jul 17, 2018 • edited Loading

mariabg left a comment • edited Loading

Choose a reason for hiding this comment

mariabg Jul 30, 2018 • edited Loading

Choose a reason for hiding this comment

vmarkovtsev commented Jul 17, 2018 •

edited

Loading

mariabg left a comment •

edited

Loading

mariabg Jul 30, 2018 •

edited

Loading