-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation around using PURLs as unique identifiers #242
Comments
Maybe related: #239 |
There's some discussion that happened way back that also might be relevant: #127 |
For example: |
Another example: |
The first example is certainly not the same as the rest. The OCI examples are the same by virtue of the ecosystem deciding to use a normalized content-sensitive digest as the version. The Docker examples might not be the same ones, since the Docker type allows tags as shown in your first example. |
I wouldn't consider those the same package for many |
To bring it back to the original it can be difficult to understand the intent of a given PURL. Is this PURL being given purely as a "locator" or a "unique identifier." It leads to a lot of ambiguity. |
A PURL is a locator and a mostly unique way to identify a package. But this does not mean that there is a single unique PURL for a given package. This is pretty much the same way that any URL can locate a web page and act as a mostly unique identifier for the page. But there are can be multiple URLs that point to the same page, like with https://www.example.com and http://example.com @mlieberman85 you wrote:
This makes sense. Some improved docs and also actual code examples that generate the PURLs. @nishakm you wrote:
This was never the intent nor it is possible to have something that guarantees a unique identifier. You could use a checksum for this but this does not convey much beyond a unique content id. You could add a checksum as a qualifier, but it does not guarantee either that you cannot have two purls. If you want to treat two different PURLs as the same thing this is something for a system to handle, much like a URL crawler may have rules to treat two pages as being the same (in practice FWIW, this not based on exact content but on approximate high similarity in search engines)
The formalism exists: here the Debian packages is from Debian, so using a generic package type is misleading and incorrect, yet always possible. |
@pombredanne: A PURL is a locator and a mostly unique way to identify a package If using PURL as an identification mechanism is not its core use case, then it should not be promoted as such. I think @mlieberman85's suggestion on adding documentation on how one could indicate whether the intent of the PURL is to identify the package and not as a package location is a reasonable first step, but it doesn't solve the issue of PURLs not being the same across organizations or tools. Reg: debian and other centralized packaging systems: Due to the standardized nature of the packaging systems, the location and the identity do have a chance to merge. So I will concede this point. However, it would be nice to organize the PURL types documentation by central package ecosystems rather than in alphabetical order to show that uniformity within each ecosystem. |
@mlieberman85 I looked into the guac models at https://github.com/guacsec/guac/blob/068951468803a87d41592fd281c4e41d97fb16a6/pkg/assembler/graphql/model/nodes.go and "ontology" at https://docs.guac.sh/guac-ontology-definition/ If I get things correctly, your main identifiers are an artifact checksum and package nodes keyed by PURL (as a tree).
Side note: I see also a model for a "Source" node which made me think. For instance I see no conceptual difference between a Git checkout at revision, and a tarball of the mostly same, say as an original code archive for a Debian package. The file tree and archive may not be bit-for-bit identical, but would be the same content if you diff them abstracting minor things such a spaces, permissions and dates. I tend to prefer using a proper PURL for this rather than making your own up, but this is minor. Or consider instead the SPDX spec bits for VCS URLs at https://spdx.github.io/spdx-spec/v2.3/package-information/#77-package-download-location-field |
@pombredanne Here's an example of the way OSV uses pURLs: https://storage.googleapis.com/cve-osv-conversion/osv-output/CVE-2023-38545.json |
@nishakm I think that with things like:
... you may be focusing too much on on edge cases. Here the Debian project ensures that names and versions are unique within their realm. There is nothing more to it than that in PURL that just extends and builds upon the ecosystem of a package type id and naming coordination. You also wrote:
I am not sure I get what you suggest... can you elaborate? |
@nishakm you wrote:
From a quick look, the way OSV handles it needs review. Do you mind to enter a separate issue for this topic? |
@nishakm re: alpine, the alpine "release" stream would need to be clarified in the spec: https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#apk (which has another a wart: the apk of Alpine is NOT the apk of openwrt at all AFAIK they are different packaging formats entirely and not the same type. |
May be another way to discuss unique identifier vs. locator is what I expanded in #257 ... A PURL is like an address Copied from #257 (comment)
|
This helps. I think one thing that we are also really trying to clarify is "intentionality." It can be difficult to understand when given a purl let's just say something like the following as a contrived example:
|
The mentioned URLs point to the same page because the authority resolves to the same IP address. For PURLs we have no authority and no resolver (DNS). I know that you are working on a PURL database but why should we not use DNS if the PURL needs to be resolvable. I also see some issues with really old packages or changes in the ecosystem. In example openSUSE renames the repository name from Leap_15.4 to 15.4. Is it necessary to adapt all openSUSE RPM PURLs in all existing SBOMs? ... or is it a persistent URL which needs to be adapted just once at a registry like purl.archive.org or something decentralized. I think the specification needs to be more clear in it's naming. Do we talk about a URL, URI or a URN. URLs don't need to be globally, unique but it should be possible to locate the resource, download it and hash it. Right now it looks more like a URI which is sometimes resolvable by reconstructing the original URL which depends heavily on the ecosystem and the community contributions. SWHID with Software Heritage as authority is something which I can resolve immediately. It is just a matter of trust. |
There is currently some confusion in the community of what practices someone should take in order to ensure that a PURL can only be resolved to a specific unique package. I don't know if unique identification is a core use case, but it is currently unclear what folks can do to help eliminate ambiguity. Some ecosystems like containers can easily use a sha256 which is suitably unique, but other ecosystems that might not be possible. Also today a lot of tools will generate purls that don't include suitably unique information.
A potential solution to this is in providing some documentation around best practices for using PURL for the identifier use case. I know that each ecosystem might be different, but some high level guidelines I think would help alleviate confusion.
The text was updated successfully, but these errors were encountered: