Proposal: Stream sbom to disk (avoiding large memory footprint and OOMs) #3263

HairyMike · 2024-09-23T10:22:56Z

What would you like to be added:
Currently, Syft builds the sbom report in memory before writing it to disk. I propose that instead of building in memory, we stream directly to disk.

Why is this needed:
To avoid OOMs

Additional context:
SBOM generation:
https://github.com/anchore/syft/blob/main/cmd/syft/internal/commands/scan.go#L199
https://github.com/anchore/syft/blob/main/internal/task/package_task_factory.go#L116
https://github.com/anchore/syft/blob/main/syft/create_sbom.go#L66

Report generation:
https://github.com/anchore/syft/blob/main/cmd/syft/internal/commands/scan.go#L208

spiffcs · 2024-09-23T14:12:09Z

👋 thanks for the issue @HairyMike --> We're trying to understand where the stress points are that bring up this suggestion.

Are you dealing with SBOMs with large amounts of packages?
Are the contents of those packages so large that you're running OOM?
Are any specific formats you're trying to create that are causing memory issues on your machine?

Do you have an image we could use to start tinkering with what something like this could look like? There are a lot of complexities we could assume when trying to organize this solution and some of them might not actually solve the problem that was encountered.

wagoodman · 2024-09-23T14:12:55Z

I don't see a way to stream the format agnostically as we find packages/relationships. But I image we could spool results to a sqlite file on disk and have an SBOM object backed by this sqlite DB to drive how we format the final SBOM. This, however, could run into the same OOM issue depending on the nature of the image.

HairyMike · 2024-09-23T14:52:20Z

Thanks for the replies @spiffcs and @wagoodman.

This is related to scans of large disks that contain a lot of files. For example - scanning disks attached to a CI machine like Jenkins (where scans can take many hours) can lead to OOMs if the machine doing the scanning doesn't have enough memory.

A few figures from a scan we attempted: on a 250Gb CI node we found 1.5M packages and consumed ~14Gb memory. Scan time was ~15 hours on a m7i-flex.xlarge

One way around it is to break up the scans into smaller chunks and produce multiple sboms, but I think streaming directly onto disk would avoid the need for manual chunking / or high scanner memory.

I don't currently have an image that can be used to reproduce this, although may be able to provide one later.

My hope is that theres a way to incrementally create the sbom during that scan, that would mean the size of the scan target would be limited by disk size (cheap) as opposed to memory (expensive).

wagoodman · 2024-09-25T14:28:10Z

That's a lot of packages! Would you be willing to post a pprof profile for us to take a look at? This can be produced with:

export SYFT_DEV_PROFILE=mem

We're aware of a few memory adjustments to make based on anchore/stereoscope#233 , but I'm interested in your specific profile to see if we have other findings here.

wagoodman · 2024-09-26T20:11:33Z

The team chatted about this one today and came to a few conclusions:

we'd still really like to see the profile (if you can), particularly if it indicates any anomalies outside of what's listed on Very High Memory Usage Using Syft stereoscope#233
syft 2.0 is probably where this kind of enhancement would land

The changes we think we'd make to the system are:

pass an SBOM writer object to catalogers, don't have them return slices of packages/relationships. This is both less brittle from and API standpoint, but also allows for a writer interface where we can swap out the implementation of such a writer (say one that spools to disk).
when spooling out intermediate results there are a few different ways to do this, such as sqlite or protocol buffers -- we should explore more options here. But the first point is the larger one: make a facade for these so we can safely swap these out without exposing them on the public API.
The sbom.SBOM object would also need to be an interface, not a struct with data. This drives a lot of the behavior downstream, so is a very impactful change. This would be akin to the v1.Image and v1.Layer interfaces in the GGCR lib: they are interfaces that represent data, but have different implementations to handle fetching and transforming the data.
It would be fun to additionally have a -o sqlite option in syft
One memory pressure today (on top of the stereoscope one mentioned earlier) is that relationships get copies of objects, not references to objects. We can't change this behavior until v2 also, since we have folks that type assert information out of relationship objects.

wagoodman · 2024-12-19T21:21:48Z

Something that came up in the livestream from today is to consider using a memory based DB that will automatically spool to disk when needed. This needs research to find all of the options here, but I want to say that sqlite still fits the bill here too; you still open the DB to a file but use the cache size pragma to ensure that a significant number of DB pages are kept in memory for fast access.

Something that would be interesting in terms of v1 compatibility: we could start introducing a new sbom.SBOMv2 interface to eventually replace the struct, keeping it in an internal package in the meantime, and only use that type internally within package cataloger tasks themselves. So catalogers would still return their intermediate results, but the task that calls the cataloger can immediately write that out to an internal DB. It wouldn't be as optimal as writing to that DB directly within the cataloger, but would be a great first step.

This new sbom.SBOMv2 interface would have all methods necessary to add/get/remove packages, files, and relationships. It would only be at the last mile today's sbom.SBOM struct would be populated for the API -- a step we could omit when calling internally within the syft cmd package using the internalized interfaces instead.

This can answer the question for folks with resource constraints:

not a lot of memory: use the sqlite backing implementation of the sbom.SBOMv2 and leverage disk
not a lot of disk space: use a map backing implementation (or sqlite :memory: connection) and leverage memory

With this approach syft is always writing results to a file and not primarily keeping it in memory. This would be a really worth while path to try in v1 before needing to make a breaking change (and vital to keep it in an internal package in the meantime so we can iterate over time on this).

wagoodman · 2024-12-19T21:34:30Z

Something more tangible:

use the existing builder interface to write out intermediate results to sqlite instead right after the cataloger call
stop using the low-level accessor interface entirely to read or write to/from the SBOM (for instance in all file catalogers)
replace any post-file/package cataloging augmentation and pruning of the SBOM to use the new sbom.SBOMv2 interface
in v1, change the use of sbom.SBOM within the CreateSBOM function to use the new interface approach
later in syft v2, change sbom.SBOM being returned from CreateSBOM to be the v2 interface instead

So what's needed next? We need to make meaningful proposal for what the new sbom.SBOM interface will be based off of the usage in the above links.

HairyMike added the enhancement New feature or request label Sep 23, 2024

anchoretoolsops added this to OSS Sep 23, 2024

wagoodman added the awaiting-response label Sep 23, 2024

wagoodman removed the awaiting-response label Sep 25, 2024

wagoodman added the awaiting-response label Sep 25, 2024

anchoretoolsops removed the awaiting-response label Sep 25, 2024

willmurphyscode added awaiting-response needs-discussion labels Sep 25, 2024

wagoodman removed the awaiting-response label Sep 25, 2024

popey added the performance label Sep 26, 2024

wagoodman added this to the Syft 2.0 milestone Sep 27, 2024

willmurphyscode added needs-proposal Should be done but needs proposal/design for further discussion and removed needs-discussion labels Dec 18, 2024

spiffcs mentioned this issue Jan 6, 2025

scan a large sbom in batches anchore/grype#2357

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Stream sbom to disk (avoiding large memory footprint and OOMs) #3263

Proposal: Stream sbom to disk (avoiding large memory footprint and OOMs) #3263

HairyMike commented Sep 23, 2024 •

edited

Loading

spiffcs commented Sep 23, 2024

wagoodman commented Sep 23, 2024

HairyMike commented Sep 23, 2024 •

edited

Loading

wagoodman commented Sep 25, 2024

wagoodman commented Sep 26, 2024

wagoodman commented Dec 19, 2024 •

edited

Loading

wagoodman commented Dec 19, 2024

Proposal: Stream sbom to disk (avoiding large memory footprint and OOMs) #3263

Proposal: Stream sbom to disk (avoiding large memory footprint and OOMs) #3263

Comments

HairyMike commented Sep 23, 2024 • edited Loading

spiffcs commented Sep 23, 2024

wagoodman commented Sep 23, 2024

HairyMike commented Sep 23, 2024 • edited Loading

wagoodman commented Sep 25, 2024

wagoodman commented Sep 26, 2024

wagoodman commented Dec 19, 2024 • edited Loading

wagoodman commented Dec 19, 2024

HairyMike commented Sep 23, 2024 •

edited

Loading

HairyMike commented Sep 23, 2024 •

edited

Loading

wagoodman commented Dec 19, 2024 •

edited

Loading