-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Possibly ambiguous language regarding the use of cache artifacts #894
Comments
Interesting, the part that makes the least sense to me is
I generally read these sections indicating that all cached content use for the build must have an immutable reference (i.e. a hash) of the content to be used for addressing. If the build system leverages a cache in a build, the provenance should include the cached content as it was originally cached. Therefore, a build system may cache dependencies pulled/requested by builds as long as the cache is immutably referenceable and the build system maintains all record of the cache's sources to be included in the provenance. If, for some reason, the cache is a change from the original, then the resolvedDependencies would have to capture that. I would expect that this is an action that the build system would perform regardless of whether the cache is used in order for the following clause to be true:
|
Could you restate this? I'm having trouble parsing this, sorry.
I think I agree with the first part but I'm not sure about the second, because the Provenance spec seems to indicate that the use of a cached artifact must be recorded only if it impacts the build.
Just to clarify, do you mean that a cached artifact changes? So, say an artifact foo-1.1.0 is cached with hash A but at some point it's overwritten with hash B because it's not reproducible?
I think this is also a little confusing because it is dependent on the build being reproducible, which is not currently a requirement. |
Let me try to restate my argument instead of directly responding to all of the questions. I see a cache as an implementation detail of a build system. It saves time and processing for a build by saving off "some state" of a build so that a future build which has the same reference can reuse the same state without having to regenerate it. To this end, a cached object is almost like an artifact itself -- but an intermediate artifact. While SLSA isn't recursive, if we treat a cached item as an intermediate artifact, then that artifact should have a known provenance according to the build platform's targeted SLSA level. Therefore, if a build system is producing an artifact by means of a cache, I would expect that it would be able to maintain the provenance of said cache and keep it associated when the cached artifact is created. When a future build comes along and pulls in that cached artifact, the build system will then be able to inject the same (original) provenance into that of the final resulting artifact. The reproducibility of a cached artifact is a separate concern. Some caches may be reproducible, but others may not be. Caches could contain "raw" dependencies (i.e. go/npm/pip packages) or it could contain output from previous build steps. If any processing is required to produce a cached artifact then this should be represented in the resolvedDependencies in the final artifact's provenance. If all cached items used for a build are themselves considered to be consistent with some bar of reproducibility then they shouldn't disqualify an artifact pulling in those cached artifacts from a maximum of that same bar of reproducibility. A cache could literally just be a set of intermediate artifacts which are then aggregated for the final artifact. In essence, this wouldn't be any different than executing all required steps in the build without the use of a cache. Ultimately, the management of a cache including its population and retrieval from it are gated by the build system at some level. These processes should ensure (for build L3 at least) that appropriate controls are in place for builds to refer to cached artifacts with immutable references -- in a way to prevent cache poisoning. This may be using digests/hashes of requested resources, it may be using trusted output from previous build steps, it may be some other process. The important part is that the build system is capable of preventing the poisoning. If for some reason a cached artifact is overwritten then it shouldn't be retrievable using the old cached artifact's references. |
Thanks for pointing this out. I agree that it is ambiguous and ill-defined. Here a stab at a better set of definition / requirements (we'll need diagrams, but hopefully this will do for now). Sorry for the long post. Definitions (without build cache)
For example, suppose a build had the external parameters
Then In provenance, intermediate artifacts SHOULD NOT be recorded while dependencies MUST be recorded at the prospective future Build L4. Build cacheA build cache is an optimization to a build that reuses intermediate artifacts and/or dependencies from prior builds rather than building from scratch and/or fetching from an external resource, respectively. Logically a build cache SHOULD NOT have a material impact on the behavior of the build, meaning that the output SHOULD be identical whether or not the cache is used. However, in practice most build caches are vulnerable to "cache poisoning" attacks, where one build can insert build cache entries such that another build will behavior differently had the build cache not been used. Continuing our example above, suppose Therefore, unless excepted below:
Exception: If a build platform guarantees through its design that a build cache is not vulnerable to cache poisoning attacks, then cached intermediate artifacts can be ignored in the provenance while cached external dependencies can be treated the same as coming from the original source. In practice, this requires the following:
Coming back to the example, if the build were rearchitected such that the compilation of Open questionsIf you agree with the above, then does the discussion of build caches really belong at Build L3, or is it just an L4 thing? If a build opts-in to a cache, e.g. with https://github.com/actions/cache, then I think it's out of scope for L3. What if it's enabled by default - does that affect L3 now? |
Do you mean that all dependencies MUST be recorded in The perspective that I shared earlier came from one where the build platform is managing the cache entirely. Builds might request data via the build platform and the platform is capable of determining whether a cached item can be returned or whether a new artifact must be built/retrieved. In this scenario, I think it is still valid to include restrictions on caches in L3 as indicted in the specification:
If builds are managing/updating caches, then I think that falls along the same "well-intentioned build" statement that is included in L3 as well. A side effect of this is that the cached dependencies may not be represented in the provenance as the control plane may not know about the dependencies.
Therefore, I believe that the clarification belongs in L3. A L4 requirement would be to effectively dis-allow any caches that are not controlled by the build platform.
This confused me initially, but after reading https://slsa.dev/provenance/v1#rundetails, it makes sense. These intermediate artifacts are those produced during a build and do not have use for future builds after completion. |
Oops, yes. Edited to fix the typo.
Yeah, that makes sense. Right now, it says that, if a build cache is used, it MUST NOT be susceptible to cache poisoning from prior builds. What I was suggesting was perhaps we could relax to this to either that OR you consider the cache untrusted and thus anything fetched from the cache equivalent to an external dependency. Though now that I say that, I'm not so sure. What do you think? |
The cache requirements cannot be enforced if the build platform is not in full control of it. Therefore, I think that we can clarify in L3 to indicate that a cache run/operated by the build platform MUST NOT be poisonable. If any other cache is used in the build itself then the no build sub-requirements statement would continue to hold. That being said, I would expect that anything the build platform pulls from the cache will ultimately be represented as a
Therefore, when the build platform generates the provenance for the artifact, it should be able to resolve and include all of the dependencies which are pulled from the cache. |
I'm not sure that is always practical or desirable. For example, consider a Bazel-based build platform that uses Remote Execution under the hood and caches intermediate artifacts using the Content Addressable Storage (CAS). Some builds have >100k intermediate artifacts. Recording all of these artifacts in the That's why I was thinking that one might just ignore the cache if the risk of cache poisoning is sufficiently low. |
Addresses slsa-framework#894 Signed-off-by: arewm <arewm@users.noreply.github.com>
Would a Bazel-based system be able to differentiate between intermediate artifacts and the resolved dependencies? I wasn't trying to say that everything pulled from the cache should be indicated as a resolved dependency. Instead, the resolved dependencies used to create the cached artifacts used in a build must be captured in the resulting build's provenance. #901 is an attempt to represent this. |
…atform. Addresses slsa-framework#894 Signed-off-by: arewm <arewm@users.noreply.github.com>
Oh, I see. Yeah, I think that aligns with my thinking. In other words, the it would look the same whether or not the cache is used? |
Yep, exactly. |
…atform. Addresses slsa-framework#894 Signed-off-by: arewm <arewm@users.noreply.github.com>
Thanks for all the responses! I'm going to respond to some points by both of you here.
I think I agree with this statement. And this is why I was considering the reproducibility of the cached intermediate artifact. Even with strong protections against cache poisoning, using a previously built Additionally, regardless of whether it's because of irreproducibility of the intermediate artifact or cache poisoning, we can't know the impact of using a cached artifact without repeating the build of
I'd argue that as SLSA is not recursive, the spec seems to indicate that we'd have recorded provenance for only the final artifacts and not for intermediate artifacts. I may be missing text that says otherwise though. I see Mark says something similar with "Cached intermediate artifacts MUST be considered dependencies and SHOULD have their own provenance" so I suspect I'm missing some information. |
A "SHOULD" seems weak. Is there a way to reword this into a MUST? Here's one try:
I don't think that's quite right but that's the sense of the direction I was trying to go. |
Addresses slsa-framework#894 Signed-off-by: arewm <arewm@users.noreply.github.com>
Discussed in community meeting on 24 July 2023. Action item for community: review PR #901 |
Addresses slsa-framework#894 Signed-off-by: arewm <arewm@users.noreply.github.com>
Currently, the Provenance spec reads as follows regarding caches:
Zeroing in on communicating with the cache specifically, I'm parsing this as "if using a cached artifact changes the build definition, record it in resolvedDependencies." First of all, is this a fair reading of the sentence?
If yes, it raises the question of what changing the build definition means. buildDefinition describes the inputs to the build. The two ways I see that using a cache can impact the inputs to the build is a) as a configuration to use the cache, and b) by treating a cached artifact as an input to the build as it changes the build result in some way.
The first is entirely disconnected from actually using a cached artifact in my mind because a build can be configured to use a cache and end up with zero hits. Further, this configuration wouldn't actually end up in resolvedDependencies, but presumably one of the parameters fields? On the other hand, the second raises other questions about cache behavior.
The requirements table calls out the impact of caches for the isolation requirement:
This seems to indicate that using a cache must not have an impact on builds beyond a configuration to use or not use the cache (it also alludes to reproducibility if I'm reading this right). Which means cache-specific behavior is never expected to be recorded in Provenance? Regardless of what the intent is, I think some clarifying may be in order! 😄
The text was updated successfully, but these errors were encountered: