Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature improved java cataloging #2769

Merged
merged 57 commits into from
Aug 5, 2024

Conversation

GijsCalis
Copy link
Contributor

@GijsCalis GijsCalis commented Apr 11, 2024

As announced in PR #2669 I've improved the package detection for Java/Maven packages by:

I've added support for use of the local Maven cache because it usually it contains all the required pom.xml files, when scanning on a system where the code has been build.

As a result the scanning is significantly more complete and faster, see table below with test results.
I've run the tests on the following projects:

Syft ver project & config time (s.) total pkgs pkg with version licenses found
v1.1.1 httpcomponents with network - 21 12 7
new httpcomponents with network - 21 21 36
v1.1.1 petclinic no network 1.3 24 5 2
v1.1.1 petclinic with network 5.8 24 15 14
new petclinic no network, no local repo 0.8 24 5 2
new petclinic no network, with local repo 1.3 24 23 9
new petclinic with network, with local repo 2.5 24 23 23
v1.1.1 zookeeper with network 33.0 58 18 13
v1.1.1 zookeeper no network 0.2 57 18 0
new new zookeeper no network, no local repo 0.4 57 54 1
new new zookeeper with network, no local repo 7.0 57 57 122
new new zookeeper with network, with local repo 5.0 57 57 122
new new zookeeper no network, with local repo (after maven build) 0.6 57 57 122

Also find attached some SBOM files generated by syft v1.1.1 and the version in this PR.
sbom.cyclonedx.httpcomponents-new.json
sbom.cyclonedx.httpcomponents-v1.1.1.json
sbom.cyclonedx.jackson-new.json
sbom.cyclonedx.jackson-v1.1.1.json
sbom.cyclonedx.zookeeper-new-no-network-with-local-repo-after-build.json
sbom.cyclonedx.zookeeper-v1.1.1-with-network.json
sbom.cyclonedx.petclinic-new-no-network-no-local-repo.json
Uploading sbom.cyclonedx.petclinic-v1.1.1-with-network.json…

Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Update configuration documentation
Improve maven groupid detection

Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
Signed-off-by: Gijs Calis <51088038+GijsCalis@users.noreply.github.com>
@GijsCalis GijsCalis marked this pull request as ready for review April 12, 2024 16:01
@GijsCalis
Copy link
Contributor Author

Note: The 'Detect schema changes / Label changes' failed, but should pass on re-run of the job.

@GijsCalis
Copy link
Contributor Author

GijsCalis commented Apr 18, 2024

@kzantow, @willmurphyscode :
I realise this is quite a large PR, but it also makes significant improvements. Without these improvements syft does not work well enough, especially on Spring Framework based packages, for practical use. The amount of false positives because of missing version number is simply to high. (most of the projects at my company are based on Spring)

I can split this PR into smaller parts, each adding part of the improvements:

  • use of local Maven repository:

    • Greater chance of finding licenses on systems that have build the package, because no network (UseNetwork) is required.
    • Much faster scans when using network and pom file is available locally. And also thus limiting requests to the remote repository.
  • Fix bug in configuration: no default configuration is loaded for the java cataloger (see: https://github.com/GijsCalis/syft/blob/ff1c8431704181ede8edf3325004f3163f3283e6/cmd/syft/internal/options/java.go#L12)

  • Parsing of parent poms for property definitions (to improve resolving of properties) and processing imported managed dependencies (which contain version definitions of dependencies).

What would be the best way forward?

@henrysachs
Copy link
Contributor

I would love to have this PR merged, but can totally understand that syft currently doesnt have the triggering of external tools. But this would also help to better catalog gradle packages

Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
@kzantow
Copy link
Contributor

kzantow commented Jul 24, 2024

Hey @GijsCalis -- I made a significant number of changes since you submitted this PR as you may have noticed, so I just wanted to explain why and see if you had any time or desire to kick the tires on this branch to make sure it is still working as you expect for the cases you care about and had working. I tried to not miss anything, but I could have, since it was a large PR and maven is a large ecosystem 😬

I also wanted to also say a HUGE THANK YOU for getting this work in motion and all the effort you've put in here! Thank you very much for doing so much of the work so we could understand a number of things that needed to be in place for this to work (like BOMs, I hadn't dealt with them during my Maven days long ago).

One of the blocking things in the original set of changes was keeping global state that tracked all the poms found in a static map. This could cause problems, for example, if Syft was being used as a library and scanned multiple things without resetting the map between runs. However, once I got into the nuts and bolts of how this worked, it was quickly apparent that having an in-memory cache of poms was necessary for performance and locating things in certain cases. This was the motivating factor behind most of the changes here. To accomplish this, I refactored maven things into a mavenResolver, so any time someone deals with any maven things beyond just parsing a pom, this should be used. It consolidates the logic for finding parent poms, resolving variables and inherited versions, etc., and also keeps a short-lived in-memory cache of the poms it found. This is actually especially important for the next thing.

The original set of changes were not explicitly working to scan a multi-module project source but rather relied on the fact that top-level pom.xml got scanned before child directories' poms, so the "parent" pom was already in cache. However, at some point in the future this behavior might not be true so I thought it was important to track all poms together across a scan instead so this behavior was explicit. This would also work for cases where a parent pom is an a sibling directory, for example, which I don't think would have always worked as desired.

A few other notes: I've also integrated with the general cache, so any network requests for poms will be cached locally for improved subsequent performance. I also implemented a few more small details like the [probably infrequently used] alternate maven local repository directory lookup, and fixed as much of the possible property resolution usage that I could. An example is... it actually works to have a variable in an artifact name provided by a parent pom. I suspect this type of thing isn't used a whole lot but I'm sure there are people using properties for group, artifact, and version somewhere.

Using the repositories you linked above, the current results look similar to where you left the PR, although I didn't quite get the same numbers you had in your table, with the latest commit before my changes. Regardless, this is what things look like with the latest changes I pushed:

zookeeper
SYFT_JAVA_USE_NETWORK=true syft zookeeper -o json | jq '.artifacts|length'
120

SYFT_JAVA_USE_NETWORK=true syft zookeeper -o json | jq '[.artifacts.[]|.licenses|length]|add'
125

httpcomponents-core
SYFT_JAVA_USE_NETWORK=true  syft httpcomponents-core -o json | jq '.artifacts|length'
39

SYFT_JAVA_USE_NETWORK=true syft httpcomponents-core -o json | jq '[.artifacts.[]|.licenses|length]|add'
31

spring-petclinic
SYFT_JAVA_USE_NETWORK=true  syft spring-petclinic -o json | jq '.artifacts|length'     
23

SYFT_JAVA_USE_NETWORK=true syft spring-petclinic -o json | jq '[.artifacts.[]|.licenses|length]|add'
22

And some examples of the property resolution with the same things:

syft spring-petclinic
NAME                            VERSION  TYPE            
actions/checkout                v4       github-action    
actions/setup-java              v4       github-action    
bootstrap                       5.3.3    java-archive     
cache-api                       1.1.1    java-archive     
caffeine                        3.1.8    java-archive     
font-awesome                    4.7.0    java-archive     
gradle-wrapper                           java-archive     
h2                              2.2.224  java-archive     
jakarta.xml.bind-api            4.0.2    java-archive     
junit-jupiter                   1.19.8   java-archive     
mysql                           1.19.8   java-archive     
mysql-connector-j               8.3.0    java-archive     
postgresql                      42.7.3   java-archive     
spring-boot-devtools            3.3.0    java-archive     
spring-boot-docker-compose      3.3.0    java-archive     
spring-boot-starter-actuator    3.3.0    java-archive     
spring-boot-starter-cache       3.3.0    java-archive     
spring-boot-starter-data-jpa    3.3.0    java-archive     
spring-boot-starter-test        3.3.0    java-archive     
spring-boot-starter-thymeleaf   3.3.0    java-archive     
spring-boot-starter-validation  3.3.0    java-archive     
spring-boot-starter-web         3.3.0    java-archive     
spring-boot-testcontainers      3.3.0    java-archive   

syft httpcomponents-core
NAME                              VERSION             TYPE                           
actions/cache                     v4                  github-action                   
actions/checkout                  v4                  github-action  (+2 duplicates)  
actions/dependency-review-action  v4                  github-action                   
actions/setup-java                v4                  github-action                   
commons-cli                       1.8.0               java-archive                    
conscrypt-openjdk-uber            2.5.2               java-archive   (+1 duplicate)   
github/codeql-action/analyze      v3                  github-action                   
github/codeql-action/init         v3                  github-action                   
hamcrest                          2.2                 java-archive   (+2 duplicates)  
httpcore5                         5.3-beta2-SNAPSHOT  java-archive   (+2 duplicates)  
httpcore5-h2                      5.3-beta2-SNAPSHOT  java-archive                    
httpcore5-reactive                5.3-beta2-SNAPSHOT  java-archive                    
junit-jupiter                     5.10.3              java-archive   (+3 duplicates)  
log4j-core                        2.23.1              java-archive   (+2 duplicates)  
log4j-slf4j-impl                  2.23.1              java-archive   (+2 duplicates)  
mockito-core                      4.11.0              java-archive   (+2 duplicates)  
reactive-streams                  1.0.4               java-archive                    
rxjava                            2.2.21              java-archive                    
rxjava                            3.1.8               java-archive   (+1 duplicate)   
slf4j-api                         1.7.36              java-archive   (+2 duplicates)

syft zookeeper
NAME                             VERSION           TYPE                            
actions/cache                    v3                github-action  (+2 duplicates)   
actions/checkout                 v3                github-action  (+2 duplicates)   
actions/setup-java               v3                github-action  (+2 duplicates)   
actions/upload-artifact          v3                github-action  (+1 duplicate)    
apache-rat-tasks                 0.6               java-archive                     
asm                              3.1               java-archive                     
audience-annotations             0.12.0            java-archive   (+1 duplicate)    
bcpkix-jdk18on                   1.78              java-archive                     
bcprov-jdk18on                   1.78              java-archive                     
commons-cli                      1.5.0             java-archive   (+2 duplicates)   
commons-collections4             4.4               java-archive                     
commons-io                       2.11.0            java-archive                     
curator-recipes                  5.0.0             java-archive                     
curator-test                     5.0.0             java-archive                     
grizzly-servlet-webserver        1.9.8             java-archive                     
guava                            30.0-jre          java-archive                     
hamcrest-library                 2.2               java-archive                     
jackson-databind                 2.15.2            java-archive   (+3 duplicates)   
jersey-client                    1.1.5.1           java-archive                     
jersey-json                      1.1.5.1           java-archive                     
jersey-server                    1.1.5.1           java-archive                     
jetty-client                     9.4.53.v20231009  java-archive                     
jetty-server                     9.4.53.v20231009  java-archive   (+4 duplicates)   
jetty-servlet                    9.4.53.v20231009  java-archive   (+4 duplicates)   
jline                            2.14.6            java-archive   (+2 duplicates)   
jmh-core                         1.23              java-archive                     
jmh-generator-annprocess         1.23              java-archive                     
jmockit                          1.48              java-archive                     
junit-jupiter-api                5.6.2             java-archive   (+1 duplicate)    
junit-jupiter-engine             5.6.2             java-archive   (+5 duplicates)   
junit-jupiter-params             5.6.2             java-archive                     
junit-platform-runner            1.6.2             java-archive   (+1 duplicate)    
junit-vintage-engine             5.6.2             java-archive   (+2 duplicates)   
kerb-core                        2.0.0             java-archive                     
kerb-simplekdc                   2.0.0             java-archive                     
kerby-config                     2.0.0             java-archive                     
logback-classic                  1.3.14            java-archive                     
logback-core                     1.3.14            java-archive   (+4 duplicates)   
metrics-core                     4.1.12.1          java-archive   (+3 duplicates)   
mockito-core                     4.9.0             java-archive   (+3 duplicates)   
netty-handler                    4.1.105.Final     java-archive                     
netty-tcnative-boringssl-static  2.0.61.Final      java-archive                     
netty-transport-native-epoll     4.1.105.Final     java-archive                     
simpleclient                     0.9.0             java-archive                     
simpleclient_hotspot             0.9.0             java-archive                     
simpleclient_servlet             0.9.0             java-archive                     
slf4j-api                        2.0.13            java-archive   (+4 duplicates)   
snappy-java                      1.1.10.5          java-archive   (+3 duplicates)   
spotbugs-annotations             4.0.2             java-archive   (+1 duplicate)    
tools                            0.25.4            java-archive                     
zookeeper                        3.10.0-SNAPSHOT   java-archive   (+10 duplicates)  
zookeeper-client                 3.10.0-SNAPSHOT   java-archive                     
zookeeper-docs                   3.10.0-SNAPSHOT   java-archive                     
zookeeper-it                     3.10.0-SNAPSHOT   java-archive                     
zookeeper-jute                   3.10.0-SNAPSHOT   java-archive   (+3 duplicates)   
zookeeper-prometheus-metrics     3.10.0-SNAPSHOT   java-archive                     
zookeeper-recipes                3.10.0-SNAPSHOT   java-archive   

Again, THANK YOU for this, it is very much appreciated. Apologies it has taken so long to get to getting these changes reviewed and mergable, and also for the length of this comment. 😁

@henrysachs
Copy link
Contributor

I just tried it with a few repos of mine and it works like a charm for large maven multi module projects

syft/file/license.go Outdated Show resolved Hide resolved
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
syft/pkg/license.go Outdated Show resolved Hide resolved
Copy link
Contributor

@wagoodman wagoodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work @GijsCalis and @kzantow !!

Signed-off-by: Keith Zantow <kzantow@gmail.com>
@kzantow kzantow merged commit 9d40d11 into anchore:main Aug 5, 2024
11 checks passed
@kzantow kzantow added the enhancement New feature or request label Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Archived in project
4 participants