Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle compound license expressions #1554

Closed
wants to merge 3 commits into from

Conversation

deitch
Copy link
Contributor

@deitch deitch commented Feb 8, 2023

As discussed with @kzantow here, syft currently does not handle compound licenses, or anything joined with OR (as opposed to AND).

This changes that.

  • creates a LogicalStrings struct that has a joiner, simple elements and compound child elements
  • adds tests for various parsing of strings into LogicalStrings and back to string
  • replaces the Licenses []string property in various places with a more complex compound expression struct
  • changes the various elements that populate it to use those
  • updates the tests to use those

Some key points and questions:

  1. I out this all in internal, but I am not convinced it shouldn't be in a different package; open to suggestions
  2. I left pretty much everything else that was a simple AND as it was, just part of the LogicalStrings struct. I think that is the right first step; it can be updated later.
  3. I did add parsing for the apk packages because (selfishly), that was what motivated me to put in the effort.

Ball back in the hands of the maintainers.

@spiffcs
Copy link
Contributor

spiffcs commented Feb 8, 2023

👋 Thanks for the PR @deitch

We've got some work in progress that also is touching the license side of syft and I'd like to incorporate this PR since it does a good portion of the compound logic.

The work in progress right now moves syft to consider license as a struct rather than string:

type License struct {
	Name           string
	SPDXExpression string
	Location       string
	Type           LicenseType
	Confidence     uint
}

Here is a sample of the data structure we're trying to move licenses to:

{
   "spdx-expression": "GPL-2.0-only", // optional
   "name": "output of classifer" // required
   "location": {
	   "path": "/lib/apk/SOMELICENSE",
	   "layerID": "sha256:ded7a220bb058e28ee3254fbba04ca90b679070424424761a53a043b93b612bf"
   },
   "type": "concluded",
   "confidence": 0.9
},
{
   "spdxLicensEexpression": "GPL-2.0-only",
   "name": "gpl-2", // required
   "location": {
	   "path": "/lib/apk/db/installed",
	   "layerID": "sha256:ded7a220bb058e28ee3254fbba04ca90b679070424424761a53a043b93b612bf"
   },
   "type": "declared"
},
{
   "spdxLicensExpression": "MIT AND (LGPL-2.1-or-later OR BSD-3-Clause)",
   "name": "gpl-2",
   "location": {
	   "path": "/lib/apk/db/installed",
	   "layerID": "sha256:ded7a220bb058e28ee3254fbba04ca90b679070424424761a53a043b93b612bf"
   },
   "type": "declared"
},
{
   "spdxLicensExpression": "MIT AND (LGPL-2.1-or-later OR BSD-3-Clause)",
   "name": MIT",
   "location": {
	   "path": "/lib/apk/db/installed",
	   "layerID": "sha256:ded7a220bb058e28ee3254fbba04ca90b679070424424761a53a043b93b612bf"
   },
   "type": "declared"
},
{
   "spdxLicensExpression": "MIT AND (LGPL-2.1-or-later OR BSD-3-Clause)",
   "name":BSD-3-Clause",
   "location": {
	   "path": "/lib/apk/db/installed",
	   "layerID": "sha256:ded7a220bb058e28ee3254fbba04ca90b679070424424761a53a043b93b612bf"
   },
   "type": "declared"
}

The confidence score comes from using this license string classifier library to match the contents of an unknown text (contents of the license file) to a set of known texts (license db).

Curious on your thoughts on how we fit compound expression handling into this kind of update - I thought it would be easier to move syft towards a more complex license descriptor before tackling the compound expressions, but doing both at the same time seems achievable given the work you put into this PR!

@deitch
Copy link
Contributor Author

deitch commented Feb 8, 2023

I think both are achievable, although I would split it into two stages, and the order doesn't matter (except for my bias towards getting this one in 😁 ).

The LogicalStrings looks like this:

type LogicalStrings struct {
	Compound []LogicalStrings
	Simple   []string
	Joiner
}

We could convert that into a LogicalLicenses that looks like:

type LogicalLicenses struct {
	Compound [] LogicalLicenses
	Simple   []License
	Joiner
}

where each License is the struct you reference above.

All of this assumes that you have a License for each element, e.g. MIT AND BSD AND GPL-2.0-only would be converted into a []License of size 3.

I would think that we would start with this one (besides my obvious bias), and then we can change the underlying structs when ready, but the flow and logic already will be in place. We still will just be passing around things.

I will get the 3 CI failures fixed, although I don't get what the static analysis is complaining about. Some help would be appreciated.

@spiffcs
Copy link
Contributor

spiffcs commented Feb 8, 2023

All of this assumes that you have a License for each element, 
e.g. MIT AND BSD AND GPL-2.0-only would be converted into a []License of size 3.

^ This is correct!

Let me look at the static analysis.

In the mean time - how much overlap are we seeing with libraries like:
https://github.com/github/go-spdx/blob/main/spdxexp

Is this PR meant to target SPDX expressions specifically, or are we trying to parse compound statements from other ecosystems that don't adheer to that license composite specification?

@spiffcs
Copy link
Contributor

spiffcs commented Feb 8, 2023

The static analysis had to do with a schema bump.

Ideally we don't do multiple changes of this license field back to back (another change to the proposed struct), but I've updated it for now so the checks can go green.

License will eventually be

string ---> X

Where x is what we land on as far as how to handle the complex expressions.

There is possibly another path where we keep license as []string and see what kind of exposed methods exist in the go-spdx library from github if we find there is enough overlap that it services what we want to accomplish with this PR

@deitch
Copy link
Contributor Author

deitch commented Feb 8, 2023

In the mean time - how much overlap are we seeing with libraries like:
https://github.com/github/go-spdx/blob/main/spdxexp

As far as I know, that validates against the list. I was much more looking for parsing all licenses it finds, handling approved ones ("MIT" for spdx; ID: "MIT" for cyclonedx, etc.) and non ("LicenseRef-BSD" for spdx; Name: "BSD" for cyclonedx, etc.).

Is this PR meant to target SPDX expressions specifically, or are we trying to parse compound statements from other ecosystems that don't adhere to that license composite specification?

I just expanded what already was there, but it pretty much was a subset of spdx anyways. The LogicalStrings.Process() was created generically so anything would work.

@deitch
Copy link
Contributor Author

deitch commented Feb 8, 2023

Where x is what we land on as far as how to handle the complex expressions.

Ah, that explains it. Thanks for handling that. I will resolve the other CI issues.

see what kind of exposed methods exist in the go-spdx library from github if we find there is enough overlap that it services what we want to accomplish with this PR

I looked at it, I didn't see any tools there to parse licenses into some of those complex structures. 🤷‍♂️

@kzantow
Copy link
Contributor

kzantow commented Feb 8, 2023

@deitch you're right -- I missed the fact that basically none of the license parsing in the GitHub library is exported, only a couple functions like Satisfies. An alternate library is: https://github.com/kyoh86/go-spdx but it doesn't look especially well maintained.

@kzantow
Copy link
Contributor

kzantow commented Feb 8, 2023

If we're going to roll our own parser, we should probably at least handle:

  • AND expressions
  • OR expressions
  • WITH cases
  • ( parenthesis )

@spiffcs
Copy link
Contributor

spiffcs commented Feb 8, 2023

I looked at it, I didn't see any tools there to parse licenses into some of those complex structures. 🤷‍♂️

I think I see a test case where they are handling the OR expression here:
https://github.com/github/go-spdx/blob/9f1f925e5ba1323fe379f6c1a0afd0582a32607d/spdxexp/parse_test.go#L48-L72

And then here is one testing against the paren case ():
https://github.com/github/go-spdx/blob/9f1f925e5ba1323fe379f6c1a0afd0582a32607d/spdxexp/parse_test.go#L147-L192

Something like:

(MIT AND Apache-1.0+)   OR   DocumentRef-spdx-tool-1.2:LicenseRef-MIT-Style-2 OR (GPL-2.0 WITH Bison-exception-2.2)

Becomes
https://github.com/github/go-spdx/blob/9f1f925e5ba1323fe379f6c1a0afd0582a32607d/spdxexp/parse_test.go#L845-L908

Which complex structure are not covered - the node structure they provide seems pretty comprehensive at first glance, but you have a lot more context loaded for this one so I defer to what you found!

@deitch
Copy link
Contributor Author

deitch commented Feb 8, 2023

If we're going to roll our own parser, we should probably at least handle:

which we do, see here. It has potential robustness improvements, but it does all of that.

@kzantow
Copy link
Contributor

kzantow commented Feb 8, 2023

which we do, see here
Ah - great! I missed the paren handling 👍

@deitch
Copy link
Contributor Author

deitch commented Feb 8, 2023

Which complex structure are not covered - the node structure they provide seems pretty comprehensive at first glance, but you have a lot more context loaded for this one so I defer to what you found!

How interesting, they did I as a binary tree, rather than lists of Simple and Compound. I can see pros and cons both ways.

If their thing was published, I would use it. But it wasn't, so I built it from scratch. Mostly. I had a need for something like this years ago in my JavaScript days, so the logic was somewhere back in my head.

I think we miss the WITH case, but I didn't see us hitting that right away, and we always can expand upon it.

@deitch
Copy link
Contributor Author

deitch commented Feb 8, 2023

Do you understand why some of the encoding tests are failing? The outputs - json and xml etc. - are nearly identical, but the package IDs are different, e.g.:

<             <component bom-ref="pkg:deb/debian/package-2@2.0.1?package-id=342b18b33e30ed0" type="library">
---
>             <component bom-ref="pkg:deb/debian/package-2@2.0.1?package-id=db4abfe497c180d3" type="library">

I see a few such cases that repeat a lot.

@deitch
Copy link
Contributor Author

deitch commented Feb 8, 2023

It appears to have something to do with these .golden files?

@kzantow
Copy link
Contributor

kzantow commented Feb 8, 2023

@deitch I'm seeing a number of issues unrelated to the .golden files, but these are snapshots that compare certain test executions with a known good version, they typically have an update flag if they need updating. You can just temporarily update the false to true in the code to get them updated.

@deitch deitch force-pushed the proper-license-expression branch from 109023a to 892c067 Compare February 8, 2023 19:32
@deitch
Copy link
Contributor Author

deitch commented Feb 8, 2023

I'm seeing a number of issues unrelated to the .golden file

Yup, working on those.

they typically have an update flag if they need updating. You can just temporarily update the false to true in the code to get them updated

So just go test ./... -- -update-spdx-json true ? OK, running that and seeing. I am sure lots still will fail, but if it gets past those, it will be good.

@kzantow
Copy link
Contributor

kzantow commented Feb 8, 2023

@deitch I typically run individual tests in IntelliJ after updating the code to have a true value for the flag. I think you're command is right but can't say for sure 😬

@deitch
Copy link
Contributor Author

deitch commented Feb 8, 2023

It didn't solve them, and the diffs in GHA output are a little confusing. 😁

I guess I can change it manually? Where would I do so?

There also are template version issues here 6.2.0 vs 7.0.0

@kzantow
Copy link
Contributor

kzantow commented Feb 8, 2023

@deitch I would suggest hitting pause on this PR for just a bit -- we're having some internal discussions about how we would like to see changes in the SBOM structure and other things and I don't want you to spend a bunch of time working on something that might need to change

@deitch deitch force-pushed the proper-license-expression branch from 892c067 to 1f3e5c0 Compare February 8, 2023 20:25
@deitch
Copy link
Contributor Author

deitch commented Feb 8, 2023

OK, think I got those. Probably one or two small things left.

@deitch
Copy link
Contributor Author

deitch commented Feb 8, 2023

I would suggest hitting pause on this PR for just a bit -- we're having some internal discussions about how we would like to see changes in the SBOM structure and other things and I don't want you to spend a bunch of time working on something that might need to change

@kzantow just saw this. Ah, well, and I was pretty proud of what we have done.

Part of the reason I am trying so hard to get it in is that I had a chicken and egg. syft missed non-standard licenses, so I got in #1540 , but now it struggled with compound licenses. To be fair, those were broken before too, but syft merrily treated them as unintelligible and returned them as NOASSERTION; it was the "let's handle non-standard sanely" that surfaced the issue this resolves.

Either way, syft now either ignores any non-standard licenses (pre-1540), or breaks on some of them (post-1540). I wanted to try and get it to handle them sanely.

Let me know what I can do to help move it along.

FWIW, the errors are down to a small number just the CLI tests have errors.

@deitch deitch force-pushed the proper-license-expression branch 3 times, most recently from f969d1d to 01ba9a6 Compare February 8, 2023 21:07
@kzantow
Copy link
Contributor

kzantow commented Feb 8, 2023

@deitch you've done some really good work! we just need to have a bit of a conversation how all the license changes work together before jumping in. Thank you very much for your contributions!

@seabass-labrax
Copy link

As a member of the SPDX Steering Committee, I would like to second @kzantow and thank you very much for your contribution, @deitch! Syft is already a very important part of the SPDX ecosystem, and I'm thrilled to see this development. 😃

@deitch
Copy link
Contributor Author

deitch commented Feb 9, 2023

You are most welcome. I will admit some self-interest, in that I am working with client projects that have a need, but the best OSS has different people coming together to pull together.

I will keep a close eye on this.

@deitch
Copy link
Contributor Author

deitch commented Feb 9, 2023

If you are at all interested in the last failing unit test, it is the following. We have some places in the tests where we have an implied AND for multiple licenses - e.g. "MIT BSD GPL" implies "MIT AND BSD AND GPL" - and others where we expect to use an explicit separator, and anything else is a straight string - e.g. "MIT License OR BSD License".

It is, quite literally, impossible to reconcile both of those. Either whitespace is a separator and, if there is no joiner, it is an implied AND, and thus "BSD Licenses" is illegitimate; or only explicit joiners are separators and thus "BSD MIT GPPL2+" is a single license string.

@deitch deitch force-pushed the proper-license-expression branch 4 times, most recently from 3ad35af to 8eb9f95 Compare February 9, 2023 14:52
@kzantow
Copy link
Contributor

kzantow commented Feb 13, 2023

Hey @deitch I didn't want to leave this PR hanging too long -- we all are interested in getting a solution here. Would you be able to join the community meeting on Thursday this week? We are keen to discuss it along with the related changes that @spiffcs mentioned! If not, maybe we could have some more synchronous discussion on the community slack, if you had some time.

@deitch
Copy link
Contributor Author

deitch commented Feb 13, 2023

Unfortunately, I will be unavailable at that time. I will message you on Slack.

@deitch deitch force-pushed the proper-license-expression branch 2 times, most recently from 0407f38 to 3cd400f Compare February 15, 2023 02:23
Signed-off-by: Avi Deitcher <avi@deitcher.net>
@deitch deitch force-pushed the proper-license-expression branch from 3cd400f to fafe494 Compare February 23, 2023 18:01
@deitch
Copy link
Contributor Author

deitch commented Feb 23, 2023

I rebased, but almost certainly broke a bunch of things. It looks like the schema version already was updated to 7.0.0, so I need to update it (7.0.1?). I unfortunately have no idea how to do that.

Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
Signed-off-by: Avi Deitcher <avi@deitcher.net>
@deitch deitch force-pushed the proper-license-expression branch from fafe494 to 595701c Compare February 23, 2023 18:04
@deitch
Copy link
Contributor Author

deitch commented Feb 23, 2023

Oho! I figured that part out!

@spiffcs
Copy link
Contributor

spiffcs commented Mar 16, 2023

@deitch I'll make more time for this today and get this into a state was we discussed offline and then ping you for feedback. Thank you again for making this PR and taking the time to talk with me about the things you need Syft to do to meet your needs when using it.

@spiffcs spiffcs self-assigned this Mar 16, 2023
@deitch
Copy link
Contributor Author

deitch commented Mar 17, 2023

Yup, sounds good. As long as it is out of the current broken state, and into some state where the output is parsable - either a structure representing all of the licenses and their relationships, or a simple string that is standard but parsable (spdx format most likely) - it will be fine.

* main: (47 commits)
  Deprecate config.yaml as valid config source; Add unit regression for correct config paths (anchore#1640)
  chore: Update syft bootstrap tools to latest versions. (anchore#1682)
  Update documentation: (anchore#1680)
  chore: Update Stereoscope to 7928713c391e20abaede6a029f4ce37b628a4c8b (anchore#1681)
  fix: reduce logging for bad dpkg lines (anchore#1675)
  fix ruby classifier (anchore#1678)
  feat: add shared dir for easier cleanup (anchore#1676)
  chore(deps): bump github.com/google/go-containerregistry (anchore#1672)
  chore(deps): bump actions/setup-go from 3 to 4 (anchore#1671)
  fix: move defer after error to protect panic case (anchore#1670)
  feat: add argocd, helm, kustomize and kubectl binary classifiers (anchore#1663)
  defer closing file (anchore#1668)
  fix: remove author contributing to javascript CPEs (anchore#1669)
  fix: more python matching support (anchore#1667)
  Update syft bootstrap tools to latest versions. (anchore#1666)
  feat: add ruby classifier (anchore#1665)
  Update syft bootstrap tools to latest versions. (anchore#1658)
  fix: improved Python binary detection (anchore#1648)
  fix: suppress some known incorrect vendor candidates for npm CPEs (anchore#1659)
  fix: sanitize SPDX LicenseRefs (anchore#1657)
  ...

Signed-off-by: Christopher Phillips <christopher.phillips@anchore.com>
@spiffcs
Copy link
Contributor

spiffcs commented Mar 21, 2023

Attempt at a merge commit to main is done: cleaning up the tests and adding my changes on top to try and keep the diff as clean as possible

@spiffcs
Copy link
Contributor

spiffcs commented May 16, 2023

@deitch long awaited, but now complete:
#1743

^ The above PR has been merged. We have not done a syft release yet with it included, but if you build from main you're now guaranteed to have valid SPDX expressions (singular id or complex) under the licenses collection of packages in cases where it was possible to validate the license value lifted from source.

Licenses are no longer a []string and is instead a []License where if License.SPDXExpression exists, you should be able to parse it out in downstream tooling like the approach you took in the above PR.

Thanks a ton for bringing the syft license issue to our attention and we hope this is the first step in making it easier for consumers of syft to have more guarantees around what fields contain valid SPDX licenses.

We've also made some changes to how our internal model translate licenses to the cyclonedx and spdx formats:
https://github.com/anchore/syft/blob/main/syft/formats/common/cyclonedxhelpers/licenses.go
https://github.com/anchore/syft/blob/main/syft/formats/common/spdxhelpers/license.go

If you have any questions, comments, or suggestions for further changes you need feel free to reach out to me or the team.

If you feel like this PR has been closed in error and the current solution does not expose the data you need also feel free to comment back and we can talk further!

@spiffcs spiffcs closed this May 16, 2023
@deitch deitch deleted the proper-license-expression branch May 16, 2023 14:14
@deitch
Copy link
Contributor Author

deitch commented May 16, 2023

@spiffcs excellent! This really is pretty impressive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants