Track lines when parsing copyright files #13

pombredanne · 2021-04-19T12:23:45Z

This would be helpful when reviewing results

pombredanne · 2021-09-06T09:31:11Z

I have started to work on this as the lack of proper start and end line in Debian copyright license detection makes reviewing test failures rather hard.

I searched for a library that track line numbers and could not find any. And none seems really suitable to patch and add line numbers.

So my overall approach is going to be:

add a new module that can parse RFC822/deb822 formats and track lines for each elements at the low level (essentially replacing our use of the standard emailmodule)
create a new debcon2.py module modeled after debcon.py that will track line numbers at the paragraphs and field levels
create a new copyright2.py module modeled after copyright.py that will track line numbers at the paragraphs and field levels and is backed by debcon2.py
add tests
try copyright2.py in ScanCode for #2643
See if debcon2 and copyright2 can be dumbed down and also not track licenses when not needed and replace the older debcon and copyright. Otherwise, keep the debcon and copyright around with deprecation warnings and drop them in the future.

This adds support for tracking line number when processing copyright files. The approach is to only support this for copyright files and keep a mapping of start/end line numbers by field name at the paragraph level. This way existing fields do not need modifications and the core code update is when paragraphs are created which is limited to a single place. There is also a new deb822 module replacing the email parser to parse copyright files keep track of line numbers. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Track line numbers in copyright files #13 Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

This allows to pass a string which we known starts at some line offset. This is useful for Debian copyright parsing for instance See: #2643 See: aboutcode-org/debian-inspector#13 Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new licensing.get_license_matches_from_query_string() function that accepts a text and a start line (defaulting at 1). This is used to build a Query which is then passed to the new Index.match_query() method. See: #2643 See: aboutcode-org/debian-inspector#13 Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2021-09-21T22:59:47Z

This has been implemented and released

pombredanne mentioned this issue Sep 6, 2021

Debian copyright license matches lines are not correct aboutcode-org/scancode-toolkit#2643

Open

pombredanne mentioned this issue Sep 7, 2021

Track line numbers in copyright files #13 #22

Merged

pombredanne added a commit that referenced this issue Sep 14, 2021

Merge pull request #22 from nexB/13-track-line-numbers

5825272

Track line numbers in copyright files #13 Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne closed this as completed Sep 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track lines when parsing copyright files #13

Track lines when parsing copyright files #13

pombredanne commented Apr 19, 2021

pombredanne commented Sep 6, 2021

pombredanne commented Sep 21, 2021

Track lines when parsing copyright files #13

Track lines when parsing copyright files #13

Comments

pombredanne commented Apr 19, 2021

pombredanne commented Sep 6, 2021

pombredanne commented Sep 21, 2021