-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Squashed all layers #3138
base: main
Are you sure you want to change the base?
Squashed all layers #3138
Conversation
Signed-off-by: tomersein <tomersein@gmail.com>
@tomersein - I know very little about the Syft internals, and I'm trying to understand this PR. From the code and comments I understand that the new option will catalog packages from all layers, but then only include packages that are visible in the squashed file-system. How is that different from the regular squashed scope (or, I could probably rephrase this to: what is the difference between 'cataloging' and 'including')? My main concern is whether this would (eventually) help to fix issue #1818 Many thanks! |
hi @dbrugman , |
Got it, thanks @tomersein |
Hi @tomersein -- thanks for the contribution. I don't think we would want to merge this as-is, though. I wonder if there are any other things we may be able to do in order for you to accomplish what you're hoping to achieve. So I understand correctly: the use case is to be able to find the layer which introduced a package, right? |
yes correct @kzantow , let me know what are the gaps so I can push some fixes \ improvements.
@kzantow - please see my notes after the meeting yesterday |
any update? :) @wagoodman |
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
did some static analysis corrections and all checks are now passed |
@tomersein thank you for submitting a candidate solution to solve the problem of tracking the layer-of-first-attribution problem. Let me first summarize how this PR is achieving attribution. The first change involves adding a new file Resolver, which makes use of the squashed resolver and all-layer resolver based on the use case. The second change is adding Take for example a (rather silly) Dockerfile: FROM ubuntu:latest
RUN apt update -y
RUN apt install -y jq
RUN apt install -y vim
RUN apt install -y wget curl And after build:
[
"sha256:c26ee3582dcbad8dc56066358653080b606b054382320eb0b869a2cb4ff1b98b",
"sha256:5ba46f5cab5074e141556c87b924bc3944507b12f3cd0f71c5b0aa3982fb3cd4",
"sha256:1fde57bfea7ecd80e6acc2c12d90890d32b7977fec17495446678eb16604d8c7",
"sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
"sha256:4097f47ebf86f581c9adc3c46b7dc9f2a27db5c571175c066377d0cef9995756"
] Here we'll have multiple copies of the DPKG status file, which means classically we'll use the last layer for all evidence locations for packages (at least when it comes to the primary evidence location for the status file). Let's take a look at just
[
{
"path": "/usr/share/doc/vim/copyright",
"layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
"accessPath": "/usr/share/doc/vim/copyright",
"annotations": {
"evidence": "supporting"
}
},
{
"path": "/var/lib/dpkg/info/vim.md5sums",
"layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
"accessPath": "/var/lib/dpkg/info/vim.md5sums",
"annotations": {
"evidence": "supporting"
}
},
{
"path": "/var/lib/dpkg/status",
"layerID": "sha256:4097f47ebf86f581c9adc3c46b7dc9f2a27db5c571175c066377d0cef9995756",
"accessPath": "/var/lib/dpkg/status",
"annotations": {
"evidence": "primary"
}
},
{
"path": "/var/lib/dpkg/status",
"layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
"accessPath": "/var/lib/dpkg/status",
"annotations": {
"evidence": "primary"
}
}
] Note that we see the original layer the package was added ( Here's what I see when running a before and after:
It looks like when cataloging ~138 packages was found then before finalizing the number dropped to ~132, so that's good. But I noticed these runs took different times -- 8 seconds vs 11 seconds, not a big difference, but given that this is a small and simple image it is worth looking at. I believe this is because we're essentially doing both a squashed scan + an all-layers scan implicitly, since the resolver will return all references from both resolvers (not deduplicating Also note that there are several more executables and files cataloged! This is concerning since this should be behaving no different than the squashed cataloger from a count perspective. It's not immediately apparent what is causing this but it is a large blocker for this change (at first glance I think it's because catalogers are creating duplicate packages and relationships, but only the packages are getting deduplicated, but not the relationships... this should be confirmed though). After reviewing the PR there are a few problems that seem fundamental:
What's the path forward from here? I think there is a chance of modifying this PR to get it to a mergable state, but it would require looking into the following things:
The following changes would additionally be needed:
@tomersein shout out if you want to sync on this explicitly, I'd be glad to help. A good default time to chat with us is during our community office hours. Our next one is going to be this thursday at noon ET . If that doesn't work we can always chat through discourse group topics or DMs to setup a separate zoom call. |
Hi @wagoodman Please let me know if its ok! |
Hi @wagoodman ,
so I might need some more details or a direction in the code how to do so. moreover, feel free to put it under "waiting for discussion". I will not be able to attend the meeting, but i do hear the summary in youtube. I will have time to develop this feature which in my opinion can be useful :) |
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
hi @wagoodman |
Signed-off-by: tomersein <tomersein@gmail.com>
hi @kzantow @wagoodman thanks! |
It looks like these have not been addressed yet:
I haven't done another in-depth review though to see if there are other outstanding issues. |
hi @wagoodman , |
I will try to elaborate the issue:
so the only way i can filter it is by a context of a given package, and if it exists in a dpkg of a specific layer or not. I can provide some samples of Dockerfiles which demonstrate the issue. If I try to build it inside the resolver, I still can't manage to filter out packages that was deleted (and I don't want to change the Interface of the functions of the resolvers). can we discuss on possible solutions? @wagoodman @popey |
hi! @wagoodman |
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
This comment has been minimized.
This comment has been minimized.
@wagoodman delivered a new solution without the new field. |
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
hi! @wagoodman
I think this PR is ready now for further deep review, let me know how to proceed further :) |
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
an update - I wonder how can we make sure the annotation is always created \ passed in future cataloger or should I add it manually to all catalogers? |
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tomersein -- sorry for the delay getting this reviewed sufficiently. Since there have been some different revisions with varying areas of focus, I'm going to try to make this as thorough as possible. Apologies if some aspects seem like they may have come from left field, but we tend to review things based on getting most important things addressed first, and you've done a great job here addressing the feedback to this point! 👍 But it's possible that not all aspects were scrutinized the same way, so apologies if some of this seems like we maybe should have said something about it earlier.
I would like to address a number of naming issues. (Naming is hard, we all know! One of the hardest CS problems!) It's not as important for internal stuff, at all, but is definitely important for scope options that we would be unable to change later, and we might as well try to get the other things as "right" as we can, while we're at it.
There are a number of spots where we probably should use different names, sometimes simple things like variable naming but also about the feature itself. Squashed means one thing: all layers have been squashed to the final representation, and all-layers means that results found in each layer are included. As implemented, it looks like this takes the results from all-layers
and deduplicates those which doesn't seem as though it's the same thing as a "squashed" representation: it would include packages that are not present in the final representation. If this is the desired behavior, I would probably call this something like all-layers-deduplicated
, or maybe even just adding an option to run the deduplication and use the all-layers.
Is this the current/intended behavior ☝️ , or have I missed something? As an example: if I take a base image with an RPMDB and delete the RPMBD, "squashed" would not have RPM packages because the RPMDB is missing, but all-layers would have those packages, and I believe by association, this PR would result in those packages being included.
If squashed is the right term, we should be consistent in naming "squashed" vs "squash" -- there are a number of spots in variable and CLI option naming where "squash" is used instead of "squashed".
Some other names which don't really seem descriptive enough or are confusing to me:
packagesToRemove
function andgetPackagesToDelete
function; this could be consolidatedNewScopesTask
-- this is specifically for one scope, not all scopes
Apologies if I've missed something; I think this is getting pretty close to the finish line, so thanks for your patience!
syft/pkg/collection.go
Outdated
@@ -285,7 +285,6 @@ func (c *Collection) Sorted(types ...Type) (pkgs []Package) { | |||
for p := range c.Enumerate(types...) { | |||
pkgs = append(pkgs, p) | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should revert this unnecessary newline change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
@@ -211,6 +211,7 @@ func toPackageModels(catalog *pkg.Collection, cfg EncoderConfig) []model.Package | |||
for _, p := range catalog.Sorted() { | |||
artifacts = append(artifacts, toPackageModel(p, cfg)) | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unneeded whitespace change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
internal/task/scope_tasks.go
Outdated
return nil | ||
} | ||
|
||
return NewTask("scope-cataloger", fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a cataloger, is it? I think this task is actually a Squashed-With-All-Layers-Cleanup task, since it only seems to be deleting extraneous packages and it doesn't run unless using the SquashedWithAllLayers scope, so it should be named appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
@@ -0,0 +1,567 @@ | |||
package fileresolver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This filename has a typo, it was intended to be squash
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
internal/task/scope_tasks.go
Outdated
fn := func(_ context.Context, _ file.Resolver, builder sbomsync.Builder) error { | ||
finalizeScope(builder) | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having an intermediate function here, repurpose the "finalizeScope" below with this signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
Hi @kzantow thanks for the response. I"ll start working on the comments. let me know if you have any other questions! |
Signed-off-by: tomersein <tomersein@gmail.com>
This PR tries to solve the squash-with-all-layer resolver issue, aligned to the newest version of syft.
Please let me know how to proceed further, I guess the solution here is not perfect, but it does knows how to handle deleted packages.
part of - #15