Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Squashed all layers #3138

Open
wants to merge 26 commits into
base: main
Choose a base branch
from
Open

Conversation

tomersein
Copy link
Contributor

@tomersein tomersein commented Aug 20, 2024

This PR tries to solve the squash-with-all-layer resolver issue, aligned to the newest version of syft.
Please let me know how to proceed further, I guess the solution here is not perfect, but it does knows how to handle deleted packages.

part of - #15

@dbrugman
Copy link

@tomersein - I know very little about the Syft internals, and I'm trying to understand this PR. From the code and comments I understand that the new option will catalog packages from all layers, but then only include packages that are visible in the squashed file-system. How is that different from the regular squashed scope (or, I could probably rephrase this to: what is the difference between 'cataloging' and 'including')?

My main concern is whether this would (eventually) help to fix issue #1818

Many thanks!

@tomersein
Copy link
Contributor Author

hi @dbrugman ,
In this PR I am trying to display only packages which exists in the squashed layer, and in case they are, to include all of the layers they exist in so we can track down in which layer they were added.

@dbrugman
Copy link

Got it, thanks @tomersein

@kzantow
Copy link
Contributor

kzantow commented Aug 29, 2024

Hi @tomersein -- thanks for the contribution. I don't think we would want to merge this as-is, though. I wonder if there are any other things we may be able to do in order for you to accomplish what you're hoping to achieve.

So I understand correctly: the use case is to be able to find the layer which introduced a package, right?

@tomersein
Copy link
Contributor Author

tomersein commented Aug 29, 2024

yes correct @kzantow , let me know what are the gaps so I can push some fixes \ improvements.
I want to add some more information according to your meeting yesterday:

  • the advantage in this solution that you need to scan only once. When an end user wants to see vulnerabilities in his container, all-layers can make him confuse since some of them doesn't exist anymore.
  • This solution can help users to fix their vulnerabilities by updating the relevant layer the vulnerability started from.

@kzantow - please see my notes after the meeting yesterday
@wagoodman I am available to do some fixes in case needed, just let me know :)

@TimBrown1611
Copy link

any update? :) @wagoodman

Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
@tomersein
Copy link
Contributor Author

did some static analysis corrections and all checks are now passed
@kzantow @wagoodman

@wagoodman
Copy link
Contributor

@tomersein thank you for submitting a candidate solution to solve the problem of tracking the layer-of-first-attribution problem.

Let me first summarize how this PR is achieving attribution. The first change involves adding a new file Resolver, which makes use of the squashed resolver and all-layer resolver based on the use case. The second change is adding IsSquashedAllLayersResolver and IsSquashedLayer bools to the core LocationMetadata struct. The new file resolver will raise up locations that positively confirm if the location is from the squashed layer and if the new resolver is used. The last change is to update the syft json formatter to filter out all packages that has no locations from a squashed layer. This in combination with the existing deduplication logic would yield in the same number of packages found but additional layer attributions.

Take for example a (rather silly) Dockerfile:

FROM ubuntu:latest
RUN apt update -y
RUN apt install -y jq
RUN apt install -y vim
RUN apt install -y wget curl

And after build:

$ docker inspect localhost/squash-all-layers | jq '.[].RootFS.Layers'
[
  "sha256:c26ee3582dcbad8dc56066358653080b606b054382320eb0b869a2cb4ff1b98b",
  "sha256:5ba46f5cab5074e141556c87b924bc3944507b12f3cd0f71c5b0aa3982fb3cd4",
  "sha256:1fde57bfea7ecd80e6acc2c12d90890d32b7977fec17495446678eb16604d8c7",
  "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
  "sha256:4097f47ebf86f581c9adc3c46b7dc9f2a27db5c571175c066377d0cef9995756"
]

Here we'll have multiple copies of the DPKG status file, which means classically we'll use the last layer for all evidence locations for packages (at least when it comes to the primary evidence location for the status file).

Let's take a look at just vims locations:

cat /tmp/after.json| jq '.artifacts[] | select(.name == "vim").locations'
[
  {
    "path": "/usr/share/doc/vim/copyright",
    "layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
    "accessPath": "/usr/share/doc/vim/copyright",
    "annotations": {
      "evidence": "supporting"
    }
  },
  {
    "path": "/var/lib/dpkg/info/vim.md5sums",
    "layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
    "accessPath": "/var/lib/dpkg/info/vim.md5sums",
    "annotations": {
      "evidence": "supporting"
    }
  },
  {
    "path": "/var/lib/dpkg/status",
    "layerID": "sha256:4097f47ebf86f581c9adc3c46b7dc9f2a27db5c571175c066377d0cef9995756",
    "accessPath": "/var/lib/dpkg/status",
    "annotations": {
      "evidence": "primary"
    }
  },
  {
    "path": "/var/lib/dpkg/status",
    "layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
    "accessPath": "/var/lib/dpkg/status",
    "annotations": {
      "evidence": "primary"
    }
  }
]

Note that we see the original layer the package was added (sha256:9b6...c33) as well as the final unrelated modification of the status file (sha256:409...756). This is great! This is exactly what we're looking for in terms of results. There might be some debate around including one and only one spot for primary evidence, but lets ignore that for now.

Here's what I see when running a before and after:

❯ syft localhost/squash-all-layers:latest -o table=/dev/null
 ✔ Loaded image                                                                                                                                                           localhost/squash-all-layers:latest
 ✔ Parsed image                                                                                                                      sha256:6a78fd79097acadb77e57cd1c32fca596c3addc3d99c77e4fc977032a2ab3eb2
 ✔ Cataloged contents                                                                                                                       6608654972fcc7d28136e3ecffca4bfe371d89f0737cef299bdf378c87146bcf
   ├── ✔ Packages                        [132 packages]
   ├── ✔ File metadata                   [5,418 locations]
   ├── ✔ Executables                     [809 executables]
   └── ✔ File digests                    [5,418 files]

❯ syft localhost/squash-all-layers:latest -s squash-with-all-layers -o table=/dev/null
 ✔ Loaded image                                                                                                                                                           localhost/squash-all-layers:latest
 ✔ Parsed image                                                                                                                      sha256:6a78fd79097acadb77e57cd1c32fca596c3addc3d99c77e4fc977032a2ab3eb2
 ✔ Cataloged contents                                                                                                                       6608654972fcc7d28136e3ecffca4bfe371d89f0737cef299bdf378c87146bcf
   ├── ✔ Packages                        [132 packages]
   ├── ✔ File metadata                   [40,226 locations]
   ├── ✔ Executables                     [1,618 executables]
   └── ✔ File digests                    [40,226 files]

It looks like when cataloging ~138 packages was found then before finalizing the number dropped to ~132, so that's good.

But I noticed these runs took different times -- 8 seconds vs 11 seconds, not a big difference, but given that this is a small and simple image it is worth looking at. I believe this is because we're essentially doing both a squashed scan + an all-layers scan implicitly, since the resolver will return all references from both resolvers (not deduplicating file.Locations by the way). This isn't a problem in and of itself, since it might be that any approach may need to do just this, but I think this explains the mechanisms of what's happening time-wise.

Also note that there are several more executables and files cataloged! This is concerning since this should be behaving no different than the squashed cataloger from a count perspective. It's not immediately apparent what is causing this but it is a large blocker for this change (at first glance I think it's because catalogers are creating duplicate packages and relationships, but only the packages are getting deduplicated, but not the relationships... this should be confirmed though).

After reviewing the PR there are a few problems that seem fundamental:

  1. LocationMetadata is being altered in a way where it's aware of the method used from the resolver perspective. Furthermore, since this is used during formatting this means that implicitly the format must know about the method of collecting the packages. This is less than ideal as it's leaking concerns about how to find the data vs the data itself. IsSquashedLayer is less of a problem (though not ideal) but IsSquashedAllLayersResolver is the main problem here.
  2. Converting an existing syft-json SBOM into the memory location would not have any information about IsSquashedLayer on the file.LocationMetadata structs. This incongruity may cause some confusion for us down the line, especially if there is downstream processing (in syft) that depends on these being accurate.
  3. Package filtering is happening during formatting. This is a big one -- it looks like only the syft-json format has this filtering implemented, but porting to the remaining formats is not ideal. Ideally the SBOM object itself would be consistent before any formatters are used, so it hints at this kind of work being done further upstream processing-wise than where it is now. It might be that this could be refactored so that this is happening within the package cataloging task itself instead of downstream in the formatters.
  4. The package collection is aware of the file resolver method used, which is another abstraction leak -- it should only really know about packages, not how the packages were discovered. If there were a more resolver-agnostic definition that was core to what the package collection does then that would be different, but an equivalent configuration name hasn't come to mind yet.

What's the path forward from here? I think there is a chance of modifying this PR to get it to a mergable state, but it would require looking into the following things:

  1. Probably migrate all package filter logic from the formatter to it's own task. Today we have a set of tasks that are run to coordinate how the SBOM is populated. There are some "post tasks" that always run after cataloging (such as the relationship task) that does additional work to what is already in the SBOM. It seems like we could make a new task that would be just after package cataloging that would be responsible for actively removing packages that shouldn't be in the SBOM at all (as well as their relationships). This change should be enough to remove the need for IsSquashedAllLayersResolver on the file.LocationMetadata object.
  2. Directly related to the previous point 1, is there a way to remove the IsSquashedLayer bool on file.LocationMetadata? I don't obviously see a way to do this, but this should be addressed per fundamental problem 2. Should we expose this information to JSON? should these be another location annotation instead? Can we convince ourselves that this is a cataloging/post-cataloging/pre-formatting concern entirely, which would make this a non-point and we don't need to solve for it.

The following changes would additionally be needed:

  1. any []file.Location returned by the new resolver needs to be deduplicated. We have file.LocationSet that should help with this, but it should be noted that the current implementation is noting squashed=true and false for potentially the same layer information, so this needs to be considered.
  2. we need to deduplicate the relationships coming from the several duplicate file parser calls that's happening. I think it would need to occur after we know what the duplicates are, which means there is some interaction with the package cataloger here that is non obvious.

@tomersein shout out if you want to sync on this explicitly, I'd be glad to help. A good default time to chat with us is during our community office hours. Our next one is going to be this thursday at noon ET . If that doesn't work we can always chat through discourse group topics or DMs to setup a separate zoom call.

@TimBrown1611
Copy link

Hi @wagoodman
Thanks for the comments. I will not be available to work on it this month, but i will want to develop it after that.
It will be helpful to discuss all open issues so once i am available again i can work on it, i will watch the discussion on youtube but i will not be able to join myself :)

Please let me know if its ok!

@tomersein
Copy link
Contributor Author

tomersein commented Nov 13, 2024

Hi @wagoodman ,
I've read again all of the comments, I'll try to fix some of them.
However, some of the requests sounds more complex:

we need to deduplicate the relationships coming from the several duplicate file parser calls that's happening. I think it would need to occur after we know what the duplicates are, which means there is some interaction with the package cataloger here that is non obvious.

so I might need some more details or a direction in the code how to do so.

moreover, feel free to put it under "waiting for discussion". I will not be able to attend the meeting, but i do hear the summary in youtube. I will have time to develop this feature which in my opinion can be useful :)

Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
@tomersein
Copy link
Contributor Author

hi @wagoodman
made some changes according to your comments, please let me know how to proceed further.
let me know if the changes that i've made are going to the right direction so this PR will be merged in the future :)

Signed-off-by: tomersein <tomersein@gmail.com>
@tomersein
Copy link
Contributor Author

hi @kzantow @wagoodman
kind reminder :)
if it possible - put this PR in the discussion so I can understand what further steps should i take to make this PR merged.

thanks!

@wagoodman
Copy link
Contributor

wagoodman commented Dec 3, 2024

It looks like these have not been addressed yet:

Converting an existing syft-json SBOM into the memory location would not have any information about IsSquashedLayer on the file.LocationMetadata structs. This incongruity may cause some confusion for us down the line, especially if there is downstream processing (in syft) that depends on these being accurate.

Directly related to the previous point 1, is there a way to remove the IsSquashedLayer bool on file.LocationMetadata?...

I haven't done another in-depth review though to see if there are other outstanding issues.

@tomersein
Copy link
Contributor Author

hi @wagoodman ,
I didn't find a way to do so, would like to know if you have any suggestions \ ideas where this logic will fit better.

@tomersein
Copy link
Contributor Author

tomersein commented Dec 4, 2024

I will try to elaborate the issue:
In case a file is being deleted, the location will exist in the "all-layers" and not in "squashed". I need a way to filter these packages, so I can from the one hand provide all the location of a given package, but not to include packages which doesn't actually exist.
moreover, some locations can be shared by multiple packages, for example:

        {
          "path": "/var/lib/dpkg/status",
          "layerID": "sha256:6ef70ed67080fd791df3956b04f469caa1173aacbb778f47a65f6c2680ef1938",
          "accessPath": "/var/lib/dpkg/status",
          "annotations": {
            "evidence": "primary"
          }

so the only way i can filter it is by a context of a given package, and if it exists in a dpkg of a specific layer or not.

I can provide some samples of Dockerfiles which demonstrate the issue.
The only way I thought was to add a field to the location struct, which can indicate to the other components (like sbom) which resolver provided a given location.

If I try to build it inside the resolver, I still can't manage to filter out packages that was deleted (and I don't want to change the Interface of the functions of the resolvers).

can we discuss on possible solutions? @wagoodman @popey

@tomersein
Copy link
Contributor Author

hi! @wagoodman
followed the gardening OSS meeting today, noticed we hadn't have enough time to discuss this PR, please let me know if you have any suggestions to alternative way beside using a new field in location :)

Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
@github-actions github-actions bot added the json-schema Changes the json schema label Dec 6, 2024

This comment has been minimized.

@github-actions github-actions bot removed the json-schema Changes the json schema label Dec 6, 2024
@tomersein
Copy link
Contributor Author

@wagoodman delivered a new solution without the new field.
let me know what do you think :)

Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
@tomersein
Copy link
Contributor Author

hi! @wagoodman

  • removed the field as requested
  • added UT

I think this PR is ready now for further deep review, let me know how to proceed further :)

Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
@tomersein
Copy link
Contributor Author

an update -
started testing this feature on multiple images and I have an issue related to the annotations:
in some catalogers it is being created manually by the cataloger, so even when I pass it through the resolver it will be empty.
moreover, this bug is related -> #3521
sometimes the results are not consistent, and 2 packages are being created instead of 1. because of it, the entire results are not consisted (the sha is not the same in the package ID)

I wonder how can we make sure the annotation is always created \ passed in future cataloger or should I add it manually to all catalogers?

Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
Signed-off-by: tomersein <tomersein@gmail.com>
@kzantow kzantow self-requested a review December 17, 2024 18:11
Copy link
Contributor

@kzantow kzantow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tomersein -- sorry for the delay getting this reviewed sufficiently. Since there have been some different revisions with varying areas of focus, I'm going to try to make this as thorough as possible. Apologies if some aspects seem like they may have come from left field, but we tend to review things based on getting most important things addressed first, and you've done a great job here addressing the feedback to this point! 👍 But it's possible that not all aspects were scrutinized the same way, so apologies if some of this seems like we maybe should have said something about it earlier.

I would like to address a number of naming issues. (Naming is hard, we all know! One of the hardest CS problems!) It's not as important for internal stuff, at all, but is definitely important for scope options that we would be unable to change later, and we might as well try to get the other things as "right" as we can, while we're at it.

There are a number of spots where we probably should use different names, sometimes simple things like variable naming but also about the feature itself. Squashed means one thing: all layers have been squashed to the final representation, and all-layers means that results found in each layer are included. As implemented, it looks like this takes the results from all-layers and deduplicates those which doesn't seem as though it's the same thing as a "squashed" representation: it would include packages that are not present in the final representation. If this is the desired behavior, I would probably call this something like all-layers-deduplicated, or maybe even just adding an option to run the deduplication and use the all-layers.

Is this the current/intended behavior ☝️ , or have I missed something? As an example: if I take a base image with an RPMDB and delete the RPMBD, "squashed" would not have RPM packages because the RPMDB is missing, but all-layers would have those packages, and I believe by association, this PR would result in those packages being included.

If squashed is the right term, we should be consistent in naming "squashed" vs "squash" -- there are a number of spots in variable and CLI option naming where "squash" is used instead of "squashed".

Some other names which don't really seem descriptive enough or are confusing to me:

  • packagesToRemove function and getPackagesToDelete function; this could be consolidated
  • NewScopesTask -- this is specifically for one scope, not all scopes

Apologies if I've missed something; I think this is getting pretty close to the finish line, so thanks for your patience!

@@ -285,7 +285,6 @@ func (c *Collection) Sorted(types ...Type) (pkgs []Package) {
for p := range c.Enumerate(types...) {
pkgs = append(pkgs, p)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should revert this unnecessary newline change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

@@ -211,6 +211,7 @@ func toPackageModels(catalog *pkg.Collection, cfg EncoderConfig) []model.Package
for _, p := range catalog.Sorted() {
artifacts = append(artifacts, toPackageModel(p, cfg))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unneeded whitespace change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

return nil
}

return NewTask("scope-cataloger", fn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a cataloger, is it? I think this task is actually a Squashed-With-All-Layers-Cleanup task, since it only seems to be deleting extraneous packages and it doesn't run unless using the SquashedWithAllLayers scope, so it should be named appropriately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

@@ -0,0 +1,567 @@
package fileresolver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filename has a typo, it was intended to be squash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

Comment on lines 15 to 18
fn := func(_ context.Context, _ file.Resolver, builder sbomsync.Builder) error {
finalizeScope(builder)
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having an intermediate function here, repurpose the "finalizeScope" below with this signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

@tomersein
Copy link
Contributor Author

Hi @tomersein -- sorry for the delay getting this reviewed sufficiently. Since there have been some different revisions with varying areas of focus, I'm going to try to make this as thorough as possible. Apologies if some aspects seem like they may have come from left field, but we tend to review things based on getting most important things addressed first, and you've done a great job here addressing the feedback to this point! 👍 But it's possible that not all aspects were scrutinized the same way, so apologies if some of this seems like we maybe should have said something about it earlier.

I would like to address a number of naming issues. (Naming is hard, we all know! One of the hardest CS problems!) It's not as important for internal stuff, at all, but is definitely important for scope options that we would be unable to change later, and we might as well try to get the other things as "right" as we can, while we're at it.

There are a number of spots where we probably should use different names, sometimes simple things like variable naming but also about the feature itself. Squashed means one thing: all layers have been squashed to the final representation, and all-layers means that results found in each layer are included. As implemented, it looks like this takes the results from all-layers and deduplicates those which doesn't seem as though it's the same thing as a "squashed" representation: it would include packages that are not present in the final representation. If this is the desired behavior, I would probably call this something like all-layers-deduplicated, or maybe even just adding an option to run the deduplication and use the all-layers.

Is this the current/intended behavior ☝️ , or have I missed something? As an example: if I take a base image with an RPMDB and delete the RPMBD, "squashed" would not have RPM packages because the RPMDB is missing, but all-layers would have those packages, and I believe by association, this PR would result in those packages being included.

If squashed is the right term, we should be consistent in naming "squashed" vs "squash" -- there are a number of spots in variable and CLI option naming where "squash" is used instead of "squashed".

Some other names which don't really seem descriptive enough or are confusing to me:

  • packagesToRemove function and getPackagesToDelete function; this could be consolidated
  • NewScopesTask -- this is specifically for one scope, not all scopes

Apologies if I've missed something; I think this is getting pretty close to the finish line, so thanks for your patience!

Hi @kzantow

thanks for the response. I"ll start working on the comments.
regarding the mode - I think you didn't understand exactly what I've meant. This scope will include only packages which exists in the final state of the image but will include all of the "history" of a package. meaning, if you have deleted RPMDB it will not be included in the SBOM. If you added a package P1 in layer xxx, and updated dpkg (for example, added another package P2) you will see the package P1 existed in 2 layers! so a user can understand when a package was added (and in grype -> he will see where a CVE was added).

let me know if you have any other questions!
btw, according to my explaination, I am not sure about the naming of "squash-with-all-layers" or "all-layers-deduplicated"

Signed-off-by: tomersein <tomersein@gmail.com>
@kzantow kzantow self-assigned this Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Stalled
Development

Successfully merging this pull request may close these issues.

5 participants