Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor scancode license texts #4523

Merged
merged 11 commits into from
Nov 12, 2021
Merged

Conversation

mnonnenmacher
Copy link
Member

Read the ScanCode license texts from the ScanCode installation instead of importing them to spdx-utils. Please see the commit messages for details.

@mnonnenmacher
Copy link
Member Author

@fviernau @sschuberth This is a proposal for how we could decouple the resolving of ScanCode license texts from the imported files, as a preparation for making the ScanCode version configurable. Please have a look and provide feedback, no need for code review until we agree on the approach. As an alternative we could also keep the imported texts and read license texts from the ScanCode installation only as a fallback, in case the installed version differs from the version for which we imported the files.

@sschuberth
Copy link
Member

Please have a look and provide feedback, no need for code review until we agree on the approach.

I agree with the approach.

As an alternative we could also keep the imported texts and read license texts from the ScanCode installation only as a fallback, in case the installed version differs from the version for which we imported the files.

I like the fact that we can get rid of the license resources this way. After all, ScanCode is just one of the scanners that can be used, so I always felt a bit uncomfortable giving its license texts a special role in ORT.

@fviernau
Copy link
Member

fviernau commented Oct 7, 2021

I also agree with the approach. I have some use cases and questions I'd like to discuss though:

  1. Shall we double check with ScanCode owners whether there is plans to keep the file structure of the resources installed with
    the relase
  2. Do we have a mechanism for bootstrapping ScanCode still and would that be affected? Or do we just
    require ScanCode to be installed?
  3. IIUC correctly the implementation derives the license resource dir from the path where the binary is installed.
    For e.g. running a different ScanCode version e.g. on staging vs. production using the same docker image should
    we have a way to inject the ScanCode path e.g. via configuration or environment variable?
  4. Do we need a way to inject the path from withing tests?

@sschuberth
Copy link
Member

Shall we double check with ScanCode owners whether there is plans to keep the file structure of the resources installed with
the relase

I don't believe that's necessary. I'd rather make the logic more generic, e.g. by recursively searching for a fixed example license file below the installation directory, and then using that directory from then on to look up license files.

Do we have a mechanism for bootstrapping ScanCode still and would that be affected? Or do we just
require ScanCode to be installed?

We still have that mechanism in place, but we should indeed discuss whether we keep the bootstrapping logic in general for command line tools, scanners and other tools.

should we have a way to inject the ScanCode path e.g. via configuration or environment variable?

That should be as simple as putting the desired ScanCode version in PATH / make it the first one found via PATH.

Do we need a way to inject the path from withing tests?

I don't think so, as also currently we do not have a mechanism to change the path for license files in tests. But also see my previous answer, we could tamper with PATH in tests after all.

val scanCodeDir = getPathFromEnvironment("scancode")?.parentFile?.absoluteFile

return scanCodeDir?.resolve("src/licensedcode/data/licenses")?.takeIf { it.isDirectory }
?: scanCodeDir?.resolve("../lib/site-packages/licensedcode/data/licenses")?.takeIf { it.isDirectory }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne I'm looking for a simple heuristic to locate the ScanCode license directory based on the path of the scancode executable. The two paths above seem to work for the PIP install and the binary distribution of ScanCode. Are there any other paths you would suggest adding here? It's good enough it works for the standard ways to install ScanCode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's sounds fine. We could also publish a tarball of the licenses alone if this helps? There is also https://scancode-licensedb.aboutcode.org/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link, I think we could use this as a fallback in case the license directory can not be found.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and it will soon be also available at https://licensedb.org which has kindly been transferred and gifted over by the previous owner!

@mnonnenmacher
Copy link
Member Author

Shall we double check with ScanCode owners whether there is plans to keep the file structure of the resources installed with
the relase

I don't believe that's necessary. I'd rather make the logic more generic, e.g. by recursively searching for a fixed example license file below the installation directory, and then using that directory from then on to look up license files.

The thing is that the directory is not always below the path of ScanCode, and in most cases a simple heuristic might be good enough. We could add a way to configure the path to the licenses if required.

Do we have a mechanism for bootstrapping ScanCode still and would that be affected? Or do we just
require ScanCode to be installed?

We still have that mechanism in place, but we should indeed discuss whether we keep the bootstrapping logic in general for command line tools, scanners and other tools.

But in general the bootstrapping topic is separate to this change.

should we have a way to inject the ScanCode path e.g. via configuration or environment variable?

That should be as simple as putting the desired ScanCode version in PATH / make it the first one found via PATH.

I don't think we need this for the ScanCode path, but maybe for the path to the licenses.

@sschuberth
Copy link
Member

But in general the bootstrapping topic is separate to this change.

Agreed. I was meaning to say we need to discuss this, but outside the scope of this PR.

I don't think we need this for the ScanCode path, but maybe for the path to the licenses.

Well, with your current code the license path is calculated relative to the scancode executable's path, so making ORT find another path for the scancode executable should do the trick, or?

@mnonnenmacher mnonnenmacher force-pushed the refactor-scancode-license-texts branch 2 times, most recently from 6a8f3f1 to bea7bd3 Compare October 11, 2021 12:02
@mnonnenmacher
Copy link
Member Author

I have updated the PR to continue using the ScanCode version of the SPDX license texts. I still have to fix some tests before removing the draft status.

Copy link
Member

@sschuberth sschuberth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stopped the review for now at spdx-utils: Remove the task to generate license ref text resources.

utils/spdx/src/main/kotlin/SpdxUtils.kt Outdated Show resolved Hide resolved
utils/spdx/src/main/kotlin/SpdxUtils.kt Outdated Show resolved Hide resolved
utils/spdx/src/main/kotlin/SpdxUtils.kt Outdated Show resolved Hide resolved
cli/src/main/kotlin/commands/ReporterCommand.kt Outdated Show resolved Hide resolved
utils/spdx/src/main/kotlin/SpdxUtils.kt Outdated Show resolved Hide resolved
utils/spdx/src/main/kotlin/SpdxUtils.kt Outdated Show resolved Hide resolved
@@ -21,6 +21,7 @@ variables:
GO_DEP_VERSION: 0.5.4
PYTHON_PIPENV_VERSION: 2018.11.26
RUST_VERSION: 1.35.0
SCANCODE_VERSION: 3.2.1rc2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be squashed to the previous commit, so really each commit in this PR would pass tests on its own?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is this change really necessary as ScanCode still would get bootstrapped, or?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be squashed to the previous commit, so really each commit in this PR would pass tests on its own?

I have instead merged it with a new commit that removes the committed ScanCode license files which I forgot to do previously. FYI, I have also rewritten the commit that introduces the function to locate the ScanCode license dir to automatically do this in the license text functions. Otherwise all callers of those functions would have had to do this. This was actually the reason why this was put on hold until the utils refactoring was done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is this change really necessary as ScanCode still would get bootstrapped, or?

Only by the ScanCodeScannerFunTest, but this would not help for the license text related tests which do not call the ScanCode binary at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like detecting the ScanCode dir does not work on CI, I'll have to look into this next week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message nit: the "previous commit" reference does not seem right anymore.

@mnonnenmacher mnonnenmacher force-pushed the refactor-scancode-license-texts branch 2 times, most recently from 1535874 to d3146cf Compare November 5, 2021 15:15
Signed-off-by: Martin Nonnenmacher <martin.nonnenmacher@bosch.io>
Add support for multiple license text directories instead of only a
single custom directory, to prepare for an upcoming change.

Signed-off-by: Martin Nonnenmacher <martin.nonnenmacher@bosch.io>
Do not require that the license text filename matches the license id
exactly, but also allow files without the LicenseRef prefix.

This is a preparation for using the ScanCode license text directory
where the filenames do not have the LicenseRef prefix.

Signed-off-by: Martin Nonnenmacher <martin.nonnenmacher@bosch.io>
Try to locate the directory which contains the ScanCode license texts
and use it for resolving license texts. This is a preparation for
removing the ScanCode license texts from the spdx-utils resources.

The advantage of the new approach is that the license texts always match
the installed ScanCode version.

Signed-off-by: Martin Nonnenmacher <martin.nonnenmacher@bosch.io>
The license texts can instead be loaded from an installed ScanCode since
the previous commit.

This requires to install ScanCode for `LinuxTest` and `WindowsTest` on
Azure pipelines, as otherwise tests related to ScanCode licenses would
fail.

Signed-off-by: Martin Nonnenmacher <martin.nonnenmacher@bosch.io>
The task is not required anymore as of the previous commit. Note that
the ScanCode license texts are still downloaded because they are used
for generating the SPDX license text resources.

Signed-off-by: Martin Nonnenmacher <martin.nonnenmacher@bosch.io>
Add a Gradle property to configure which version of the SPDX license
list is imported. This allows to import a specific release instead of
always importing the latest master. The property must point to a valid
tag [1] in the license-list-data repository.

[1]: https://github.com/spdx/license-list-data/tags

Signed-off-by: Martin Nonnenmacher <martin.nonnenmacher@bosch.io>
…ation"

This reverts commit ded4f20.

Since the ScanCode license texts are now only used for the SPDX licenses
it is not required anymore to sync them with the ScanCode version.
Instead, it is preferrable to use the latest version of those license
texts to get the best quality.

Signed-off-by: Martin Nonnenmacher <martin.nonnenmacher@bosch.io>
Done by running "./gradlew :utils:spdx:generateSpdxEnums".

Signed-off-by: Martin Nonnenmacher <martin.nonnenmacher@bosch.io>
And upgrade the SPDX license text resources by running
"./gradlew :utils:spdx:generateSpdxEnums".

Signed-off-by: Martin Nonnenmacher <martin.nonnenmacher@bosch.io>
Rename the test to match the name of the file which contains the
functions under test.

This is a fixup for d564080 where `SpdxUtils.kt` was renamed to
`Utils.kt`.

Signed-off-by: Martin Nonnenmacher <martin.nonnenmacher@bosch.io>
Copy link
Member

@fviernau fviernau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve as I tested that it works with our internal setup. I didn't review it in detail.

@mnonnenmacher mnonnenmacher merged commit 6b7c80f into master Nov 12, 2021
@mnonnenmacher mnonnenmacher deleted the refactor-scancode-license-texts branch November 12, 2021 07:51
@mnonnenmacher
Copy link
Member Author

Merged, ignoring the unrelated failing test.

fviernau added a commit that referenced this pull request Jan 25, 2022
ScanCode may return SPDX license containing underscores characters which
is not allwed, see [1]. This results in ORT's scanner crashing due to an
SpdxException when it tries to parse the SPDX license key.

This issue has been first occured in 2020 and fixed by [2]. It got
re-introduced recently by [3].

Fix the issue based on the idea of the orignal fix [2]. The touched
function becomes less efficient. Delierately don't refactor for
efficiency because this fix can be reverted as soon as [3] is fixed.

[1] aboutcode-org/scancode-toolkit#2813
[2] fb0370f
[3] #4523

Signed-off-by: Frank Viernau <frank.viernau@here.com>
fviernau added a commit that referenced this pull request Jan 25, 2022
ScanCode may return SPDX license keys containing underscores characters
which is not allwed, see [1]. This results in ORT's scanner crashing
due to an SpdxException when it tries to parse the SPDX license key.

This issue has been first occured in 2020 and fixed by [2]. It got
re-introduced recently by [3].

Fix the issue based on the idea of the orignal fix [2]. The touched
function becomes less efficient. Delierately don't refactor for
efficiency because this fix can be reverted as soon as [3] is fixed.

[1] aboutcode-org/scancode-toolkit#2813
[2] fb0370f
[3] #4523

Signed-off-by: Frank Viernau <frank.viernau@here.com>
fviernau added a commit that referenced this pull request Jan 25, 2022
ScanCode may return SPDX license keys containing underscores characters
which is not allwed, see [1]. This results in ORT's scanner crashing
due to an SpdxException when it tries to parse the SPDX license key.

This issue has been first occured in 2020 and fixed by [2]. It got
re-introduced recently by [3].

Fix the issue based on the idea of the orignal fix [2]. The touched
function becomes less efficient. Delierately don't refactor for
efficiency because this fix can be reverted as soon as [3] is fixed.

[1] aboutcode-org/scancode-toolkit#2813
[2] fb0370f
[3] #4523

Signed-off-by: Frank Viernau <frank.viernau@here.com>
fviernau added a commit that referenced this pull request Jan 25, 2022
ScanCode may return SPDX license keys containing underscores characters
which is not allwed, see [1]. This results in ORT's scanner crashing
due to an SpdxException when it tries to parse the SPDX license key.

This issue has first occured in 2020 and been fixed by [2]. It got
re-introduced recently by [3].

Fix the issue based on the idea of the orignal fix [2]. The touched
function becomes less efficient. Delierately don't refactor for
efficiency because this fix can be reverted as soon as [3] is fixed.

[1] aboutcode-org/scancode-toolkit#2813
[2] fb0370f
[3] #4523

Signed-off-by: Frank Viernau <frank.viernau@here.com>
fviernau added a commit that referenced this pull request Jan 25, 2022
ScanCode may return SPDX license keys containing underscores characters
which is not allwed, see [1]. This results in ORT's scanner crashing
due to an SpdxException when it tries to parse the SPDX license key.

This issue has first occured in 2020 and been fixed by [2]. It got
re-introduced recently by [3].

Fix the issue based on the idea of the orignal fix [2]. The touched
function becomes a bit less efficient which is not an issue for the current
callers / use cases. However, a consequtive look-up of many license
texts would probably more efficient if the files in the directories were
listed only once.

[1] aboutcode-org/scancode-toolkit#2813
[2] fb0370f
[3] #4523

Signed-off-by: Frank Viernau <frank.viernau@here.com>
fviernau added a commit that referenced this pull request Jan 25, 2022
ScanCode has one SPDX license key containing an underscore characters
which is not allwed, see [1]. This results in ORT's scanner crashing
due to an SpdxException when it tries to parse the SPDX license key.

This issue has first occured in 2020 and been fixed by [2]. It got
re-introduced recently by [3].

Deliberatly don't fix the general problem with underscores in
`getLicenseTextFile()` in favor of a license ID specific work around,
because this can be implemented efficiently without doing a refactoring
first.

[1] aboutcode-org/scancode-toolkit#2813
[2] fb0370f
[3] #4523

Signed-off-by: Frank Viernau <frank.viernau@here.com>
fviernau added a commit that referenced this pull request Jan 25, 2022
ScanCode has one SPDX license key containing an underscore characters
which is not allwed, see [1]. This results in ORT's scanner crashing
due to an SpdxException when it tries to parse the SPDX license key.

This issue has first occured in 2020 and been fixed by [2]. It got
re-introduced recently by [3].

Deliberatly don't fix the general problem with underscores in
`getLicenseTextFile()` in favor of a license ID specific work around,
because this can be implemented efficiently without doing a refactoring
first.

[1] aboutcode-org/scancode-toolkit#2813
[2] fb0370f
[3] #4523

Signed-off-by: Frank Viernau <frank.viernau@here.com>
fviernau added a commit that referenced this pull request Jan 25, 2022
ScanCode has one SPDX license key containing an underscore characters
which is not allwed, see [1]. This results in ORT's scanner crashing
due to an SpdxException when it tries to parse the SPDX license key.

This issue has first occured in 2020 and been fixed by [2]. It got
re-introduced recently by [3].

Deliberatly don't fix the general problem with underscores in
`getLicenseTextFile()` in favor of a license ID specific work around,
because this can be implemented efficiently without doing a refactoring
first.

[1] aboutcode-org/scancode-toolkit#2813
[2] fb0370f
[3] #4523

Signed-off-by: Frank Viernau <frank.viernau@here.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants