-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decoupling Location from Identity - Is this in the scope of purl? #127
Comments
Exactly. Which is why CycloneDX is heavily focused on security use cases, provenance being one of them. It's important to know where something was retrieved from, even if it was an internal mirror. When software is built/assembled, I'm not aware of any use case where the same artifact is retrieved from multiple repos and used. Just because something CAN be retrieved from multiple sources, doesn't mean it was. This is also where CycloneDX and SPDX vary dramatically in scope. As a pure BOM format, CycloneDX cares about what actually transpired, whereas SPDX (which I would not classify as an SBOM format, but it can be used for SBOM use cases) describes what something COULD be. A look at SPDX external references is all that's required for that to become obvious. Internal repo servers (most of them) do not support:
So although I can specify my internal repo in which I retrieved something from, many repo servers do not provide the full transparency necessary to achieve these basic requirements. See also: https://owasp-scvs.gitbook.io/scvs/
Purl is already heavily used in SBOM use cases today with 100K+ CycloneDX adopters - most of which utilize purl. So I think we have to better understand what specific SBOM use case is not being addressed today. As far as I do think however, there's an opportunity for an organization to "opt out" of using location by supporting a way to specify no default repo and no repo url. This might be useful for private repos. If an organization wants to practice security through obscurity, this would provide them a way to achieve that, but I would recommend this be an opt-in feature as we would not want to cripple location for the majority in favor of the few. |
I'm curious how CycloneDX users will be able to find the correct endpoint within their intranet if they are using a docker installation configured to use a mirror, or a go installation that uses an internal proxy. What if they invoke a CLI tool which invoke the tools which do the fetching after several hops?
I don't think cloud native repos as the exist right now (mostly backended by S3 buckets) provide that kind of transparency either. All the user sees is a front facing API with no visibility into where exactly the artifact comes from. In fact, in most cloud native environments, folks don't care where the artifact is located as long as its integrity can be verified and it is signed. |
As far as I can tell, this is what @iamwillbar was suggesting by making the |
@nishakm The answer is in the question. Since CycloneDX has a data model optimized for highly automated pipelines, it’s elementary to enhance, correct, or merge SBOMs during the execution of the pipeline. Inspecting the configuration to discover use of a mirror and correcting purls in the SBOM is quite simple. I believe Maven is one of only a few dependency management systems that also provide information on what repository each and every artifact was retrieved from. Most package managers are immature by comparison. But we should not see the immaturity of other systems as a reason to diminish the default behavior of purl.
You’ve just described how SolarWinds happened - blind trust in something without transparency or methods to validate. We should not be interested in promoting practices that support continued use of bad practices. We need to support efforts that promote further transparency, even if it’s difficult for some ecosystems to achieve today.
@rnjudge I could support the addition of a way to opt out or otherwise specify the location is unknown or not disclosed. I am not in favor of making location strongly recommended for the core purl spec as that one change would alter the meaning of every purl being used today. It’s a small, but breaking change. Pinging @pombredanne for feedback. |
Given @SteveLasker specifically references SBOM use cases I think this is a non-starter. Unless maybe if you are only using purls in SBOMs for intellectual property use cases? Where a package was retrieved from is important for software supply chain security use cases. The component might be the same on disk. But the provenance is quite different. And, if you are trying to look at supply chain risk, this information is important. I think it would be more beneficial to identify what Steve L thinks is missing from the existing purl format. If it's just a case of being able to remove the location information surely that can be done by the consumer when parsing? |
I think I understand @stevespringett's concern that the problem is differentiating what the location is from what the location could be. Therefore, I think this isn't a specification problem but a cloud native problem i.e. the notion of "it doesn't matter how the artifact got here as long as its checksum matches the published checksum and it is signed". Even in the highly automated environments existing now, the client tools do not report the endpoints they are hitting in order to fetch an artifact. So something like the |
Let’s tease apart a few things as I’m not suggesting this is problematic for all references. A source code repo is somewhat interesting if it can be disclosed. Most OSS project can, most products won’t The beauty of digital bits is we can encode them, generate digests (hashes) of them and sign them with indecently verifiable signatures. As long as they remain the same, it doesn’t matter where they were. We know they weren’t tampered with and we know who attests to them with a signature. This is how solar winds was “quickly” found to not be a distribution attack as the dlls were signed and they matched the digests generated from the build environment. So, I get location is interesting from a forensics perspective. In many cases that internal, proprietary information can’t or shouldn’t be disclosed. There is an issue with how to discover the SBoM from the point of an artifact that may not know it has an SBoM. When you have the SBoM, we need a way to know it’s referring to this very specific artifact. |
@SteveLasker can you provide a concrete example of a purl that would be problematic? I cannot think of any. Say you have a private Maven or Docker registry, and for the sake of arguments the same packages are available also in the public, default repository for this package type. For instance:
Say that my "private" image registry is at https://quay.io/ And my "private" Maven repo is at https://repository.jboss.org/nexus/content/repositories/ea Based on that I could:
Either way works. When things are private, feel free to handle it as you like. So I am not sure there is any issue here? As a recap: a purl is a URL and a is locator, and all URLs are also URIs. Therefore a purl is also an identifier. The fact there is a default location for a type as opposed to something always hardcoded in the purl string means that you can also think of a purl as a pure identifier for private purposes. The global uniqueness of this identifier is something that's handled by default by the default public package repositories of each type. If you happen to use a content identifier (say a sha256) instead, that's fine too. If you do not publish your packages on the default public repo and you do not provide a way to locate it with a qualifier, that's OK too. Rather useless as none will be able to find it, but that's OK too. |
@SteveLasker now your question is in the context of #123 and the context there is that you may not have a canonical, default reference repository location for a new OCI type. I see no issue having the default location be optional for a given purl type. This will be weird and problematic as someone with just a purl will not be able to get the package; and therefore this is less useful; short of a purl type-provided default repository URL location or a In the end, when there may be a need to get to the package code, you would always need some repo or registry location of sorts at runtime and/or fetch time to effectively retrieve the package archives. It can stay private In recap, a package type default repository location or a repository_url qualifier is useful and desire to locate, but not essential to identify, especially if the identity is "strongly" content-defined like when you use sha256 as version. I have no problem with this. Weird but OK. |
Would this be a reasonable set of rules based on OCI's requirements:
The intent of these rules is:
Does this resonate with people? |
@stevespringett / @coderpatros I'm curious why location matters from a supply chain security perspective if you have a trusted content hash. If you can't trust the content hash, then adding location doesn't make it anymore (or less) trusted. If you trust the content hash, then adding location doesn't make it anymore (or less) trusted. Whatever trust you give to a content hash should be independent of location because it's the same content. Extending on this, if you have sufficient provenance and pedigree information to say that a given content hash is trusted, from then on the location should be irrelevant. Inversely, if you have information that a content hash can't be trusted (or insufficient information to say you can trust it) then again the location should be irrelevant. In the SolarWinds example, there was originally belief that a content hash was trusted, and new information came to light that a content hash shouldn't be trusted. Adding location wouldn't have mitigated or changed that outcome because it was the underlying content that became untrusted, not the location it was stored in. In fact, the IoCs provided were content hashes, independent of location. |
@iamwillbar at a point in time a component that has been brought into some assembled piece of software, and where it was pulled from, may be "trusted". But that package repository/mirror/whatever is part of your supply chain. And not everyone in the supply chain validates hashes/signatures along the way. So understanding where something came from can be useful. Especially as the "same" component can be different, with a different hash, depending on where it was retrieved from. For example, nuget adds a signature to packages when they are uploaded. Some of those packages are also published as github release artifacts, distributed as part of an SDK, etc. Without knowing where it was retrieved from makes this situation very problematic. Signatures don't solve the problem either. They are only good assuming the signing keys, or release process, hasn't been compromised. |
@coderpatros I completely agree that repositories, mirrors, etc. are part of the supply chain, but that's independent of whether purl must include a location to establish trust. In the specific OCI case that spawned this discussion the version is a sha256 hash of the content and it can be mirrored to any number of locations and that identity doesn't change. If the content is tampered with or changed intentionally that changes the identity of the package and consumers wouldn't inadvertently retrieve the new package. Likewise, information like vulnerabilities, pedigree, etc. can be attached to the content hash and used independent of the location because the identity of the package is intrinsically linked to its contents. Unnecessarily scoping information to the location may result in relevant information being missed because it's deemed not relevant. This isn't to say that a location can't be provided as a hint of where you might be able to retrieve the image, that's perfectly valid, but having a location doesn't change the identity or trustworthiness of a content-addressed package. |
Yeah, I just don't get how removing information helps. Wouldn't you just parse the purl to extract what you want for particular use cases? Or use the component hash from the SBOM? |
@coderpatros the proposal isn't to remove the concept of location but to acknowledge that for some ecosystems location does not make sense because it's not integral to the identity of the package. We're trying to define a new purl type where the concept of "location" doesn't make much sense, there is no default repository, content is often deployed to multiple repositories with no one of those being canonical, content can be moved between repositories and its identity doesn't change and it can be proven the content isn't tampered with. For any ecosystem where two repositories could serve different content for the same identifier then location should be mandatory for the purl and I'd additionally recommend that a content hash be provided where possible. For ecosystems where the identity is intrinsically linked to the content regardless of location the location should be optional (but can be provided as a hint for retrieval but not as part of identity comparison). |
I'm an outsider to purl (so please weigh this input accordingly 😅), but in reviewing purl it doesn't seem like the OCI use case is really much (if any) different from say, hosting a Git repo at GitHub vs Bitbucket vs self-hosted -- the commit hash is going to be identical, the underlying data bits are identical, but the location is completely different (and as such, the purl reference is too). |
If I may provide another use case for security not based on location (and I am not, by any means, a security expert): zero trust systems do not track location but identities like owners and maintainers. In this case, the location may change through the supply chain, but the SBOM or something else can track signatures and attestations by owners. |
@tianon you're right that is a fitting example for the relationship between identity and location (and in fact was/is being discussed in #59). If we take these three (fictional examples):
We know that this is the same commit because we know that the SHA1 hash of a Git commit is based on the commit and the state of the Git tree. I can push that same content to any number of repositories, and it is the same content. Though this isn't obvious from these examples because it requires that understanding of Git's internals and the knowledge that GitHub and BitBucket are both Git-based repositories. If I want to know where the software is located it's important to know the One way to solve this would be to consider
The
Since these macros can be easily converted to a common base class you can compare to see if they refer to the same software but you still have the option of knowing the suggested location of the software. |
These are just the ones I can think of. I'm sure there are others... I'm failing to find any good arguments for decoupling location from identity. |
@stevespringett we're not talking about signing or PKI at all though, we're talking about a content hash... if the content hash is in the purl (which is the proposal for No one is recommend location being removed, just identifying that location is not fundamental to all ecosystems. Purl should reflect the realities of the ecosystems it is trying to represent, rather than trying to impose requirements on them. On your other points, I don't know a location of 'hub.docker.com' does anything to address the threats you outline. It doesn't tell us anything about the contributors, physical location, provenance, pedigree. It may be interesting for forensics but the content hash itself verifies that the package is unchanged in comparison to the purl. |
I understand that. But the ask to decouple location from identity will affect every purl type, not just oci. That's a breaking change to the spec.
Agreed. And most ecosystems have a default repo, and the ones that do not clearly state they do not in the purl type definition. Golang is a good example which reads:
That's a very specific example and you're likely correct, it likely will not. But we are talking about the core purl spec here, not a specific type. If you look at any package on https://packagist.org/ you can absolutely perform that type of analysis. |
@stevespringett I don't think @SteveLasker is suggesting that location is removed from all purls, I think he's encouraging purl to acknowledge that there are (and will be) purl types where location is not a fundamental part of identity and should be optional. For purl types where location is required to establish identity (which is true for most purl types that exist today) it should continue to be there. For If we're saying that it's OK for a purl type to have no default repository and to not require a |
@iamwillbar I did submit a proposal for having "generic" purls in #126. Can this be a pattern that can be used for artifacts that don't follow the conventional centralized public repository pattern? This could also be a way for CycloneDX to not use such purls if they choose to. |
I did also bring up the use case of zero trust security which doesn't check endpoints but identities and signatures. If a client can verify the artifact's digest and signature, is there any need to check the location? |
It's not up to me. But I would advise extreme caution in supporting this for things like the git example above. Git uses SHA-1 for commits. But it is not intended for security use cases. Which is why the hash is often truncated for convenience within a particular repo and is common practice. Expanding on the git example above that drops location information... pkg:github/package-url/purl-spec@4860cee Changing that purl to something like |
We are asking if the purl-spec maintainers are willing to allow for a pattern that describes "non-centralized" locations or "moving" locations. Some examples that come to mind for me:
In the end, it is the same source code, probably coming from the same people, but just moved from one hosting mechanism to another. Personally, I don't think relying on "common knowledge" to triangulate a location is a good security practice. As you know, locating any of these artifacts, including the ones CycloneDX is using now, also relies on user configuration which purl does not capture. Maybe trying to figure out how to accurately describe "artifact movement" is something in scope for the package-url folks? |
@stevespringett, I completely agree that location is required to find the package if you need to find the package reference. The root of this issue is:
The location is required, but it would be provided at runtime, for that particular environment: public --> wabbit-networks-shared-internal --> alpha-team --> staging-for-public--> wabbit-networks-public-registry
\-> wabbit-networks-shared-internal --> delta-team /
public --> acme-rockets-shared-internal --> dev-team-a --> staging-for-prod-env-foo --> prod-env-foo
\-> acme-rockets-shared-internal --> dev-team-b --> staging-for-prod-env-bar --> prod-env-bar By separating identity from location, you can use the unique identity to match the intended package when the location is provided dynamically, at runtime, for that environment. The SBoM has the identity It's not that location isn't important, it's that it's not known when the purl is persisted. |
@coderpatros completely agree on your comments re: Git, I wasn't entirely clear, my intent was more to show that the |
In some ecosystems, yes, that information is known and exposed to build-time plugins. In most ecosystems, this information is not exposed today. It depends on the maturity of the ecosystem. As newer ecosystems become more mature, I would expect location information to become more widely avaialble. The scope of purl is to Wouldn't a As stated in the other ticket, I would be open to the idea of a reserved word for However, decoupling location from identity, as the title of this ticket states, fundamentally changes what a purl is. Purl is useful because it includes location. I can see the need to only care about the identity part. Many SCA vendors use purl for identity only today but have the intent on advancing their capabilities to include location in the future. |
@stevespringett I'm curious why the reserved word is needed, as you pointed out earlier golang (and others) don't have a default repository and don't require a repository_url (although one can still be optionally provided). Can we just codify that approach and allow each ecosystem to define whether location is required or not (again, providing very strong guidance on when it is OK to omit a location). |
In my experience, it's not the fact that it's on "Docker Hub" that specifically provides useful data alone about whether or not something is trustworthy, but more specifically the combination of "Docker Hub" and "specific Docker Hub user or organization which I trust" that does so (or even, maintainer of this particular image / repository within a larger organization).
For example, pulling The only thing that really sets the OCI objects apart from these other package types (from what I can see) is that they're designed to have an explicit content-addressable digest that is commonly used to refer to and fetch them, and that digest remains unchanged (by design) when the content is moved from one registry to another. However, you cannot ask a registry for said content without also knowing the full repository path from which to fetch it ( |
Just to build on what @tianon notes above. The difference between There will be a collection of "docker certified" content where you can trust Docker to have done some amount of verifications and vetting. No different than you trust content from Apple (or not) This goes back to the point I made above where location is a dynamic thing. But, I'll also cover the point about trust should not be placed in any one thing.
This is actually the point I'm trying to interject here. I'd suggest we should be investing in establishing identity, independent from the location. Because mature ecosystems will embrace content must be promoted into environments that are secure, and from within those environments, users can't and shouldn't be capable of going back. When something goes wrong, it is interesting to know when/where it got mutated. But, how does storing that information in an SBoM solve that? |
Maybe an aligned package URI spec could cater for this? |
Closing as #123 accounts for the location being an option, which makes purl super useful for identifying an artifact, independent from the location. Purls are persisted and location is dynamic for where the artifact may be at any point in time. |
I'm opening this issue as a question, as the readme states purl is scoped to:
While I recognize this has been a known pattern to assume a location for an artifact, this has also been a challenge for users that wish to take ownership of the content they depend upon. The realization that even common/shared/oss artifacts must be pulled from multiple locations, making an individual location a problematic concept.
A detailed post, with the context of the problem
Separating Identity From Location
TLDR:
From an SBoM community (CycloneDX and SPDX as examples), there's a desire to assure a reference within an SBoM points to a very specific artifact. It could be a container image, helm chart, wasm or other types where SBoMs are relevant.
There are two dimensions to this decoupling:
For 1, you might be willing to say "this is the debian image from docker.io", however, it's currently in my private registry. As long as the image is in the same repository as the SBoM, it can be resolved, and the URL part of the identifier is ignored as the debain image is said to be unique as it was in docker.io. Mirrors could also be resolved, maybe.
For 2, it's far more challenging. If the exact same debian image is pushed to docker hub, ecr public, github, mcr and quay, what would the URL be? Should the debian owner have to pick one?
Whether the user pulls the debian image from hub, ecr, or their private registry, the SBoM should be able to resolve the debian image, independently from where they got the image.
The proposal in #123, focuses on decoupling location from identity. Location is an optional hint in the oci-artifact purl PR.
What we've been trying to understand is whether purl, the specification, can decouple identity from location, or is purl always about identity & location?
If purl is always about location, then it makes consuming public content, in a secured & reliable manner, problematic as the same content will be available from multiple locations, and users want to pull the content into their private networks.
Is it possible to amend purls scope to assure unique identity, and make location an optional parameter so it could be used reliably for SBoM and security scan result pointers?
The text was updated successfully, but these errors were encountered: