Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation around using PURLs as unique identifiers #242

Open
mlieberman85 opened this issue Jul 18, 2023 · 17 comments
Open

Documentation around using PURLs as unique identifiers #242

mlieberman85 opened this issue Jul 18, 2023 · 17 comments
Labels
Ecma specification Work on the core specification PURL core specification Format and syntax that define PURL (excludes PURL type definitions)

Comments

@mlieberman85
Copy link

There is currently some confusion in the community of what practices someone should take in order to ensure that a PURL can only be resolved to a specific unique package. I don't know if unique identification is a core use case, but it is currently unclear what folks can do to help eliminate ambiguity. Some ecosystems like containers can easily use a sha256 which is suitably unique, but other ecosystems that might not be possible. Also today a lot of tools will generate purls that don't include suitably unique information.

A potential solution to this is in providing some documentation around best practices for using PURL for the identifier use case. I know that each ecosystem might be different, but some high level guidelines I think would help alleviate confusion.

@nishakm
Copy link

nishakm commented Jul 26, 2023

Maybe related: #239

@rnjudge
Copy link
Contributor

rnjudge commented Jul 26, 2023

There's some discussion that happened way back that also might be relevant: #127

@nishakm
Copy link

nishakm commented Oct 24, 2023

For example: pkg:docker/cassandra@latest, pkg:docker/cassandra@123456abcdef, pkg:docker/cassandra@sha256%123456abcdef, pkg:oci/cassandra@abcdef123456 and pkg:oci/my/local/cas@abcdef123456 are all the same thing. The pURL has to be detailed enough for a person or tool to have high confidence that they mean only one thing.

@nishakm
Copy link

nishakm commented Oct 24, 2023

Another example: pkg:deb/kdenlive and pkg:generic/kdenlive_etc_etc?download_url=<link to deb package> are the same package. The pURL tells you how the package was downloaded but doesn't indicate that it is the same package. My opinion is that pURL does the former extremely well, but to do the latter, we need some formality around how to craft a pURL.

@bureado
Copy link

bureado commented Oct 25, 2023

For example: pkg:docker/cassandra@latest, pkg:docker/cassandra@123456abcdef, pkg:docker/cassandra@sha256%123456abcdef, pkg:oci/cassandra@abcdef123456 and pkg:oci/my/local/cas@abcdef123456 are all the same thing. The pURL has to be detailed enough for a person or tool to have high confidence that they mean only one thing.

The first example is certainly not the same as the rest. The OCI examples are the same by virtue of the ecosystem deciding to use a normalized content-sensitive digest as the version. The Docker examples might not be the same ones, since the Docker type allows tags as shown in your first example.

@bureado
Copy link

bureado commented Oct 25, 2023

Another example: pkg:deb/kdenlive and pkg:generic/kdenlive_etc_etc?download_url=<link to deb package> are the same package. The pURL tells you how the package was downloaded but doesn't indicate that it is the same package. My opinion is that pURL does the former extremely well, but to do the latter, we need some formality around how to craft a pURL.

I wouldn't consider those the same package for many purl use cases. In a "download attribution" use case, let's say you did apt download kdenlive and that got logged using pkg:generic, honestly I wouldn't find that an appropriate resolution. If you were using pkg:generic here to perhaps reflect an AppImage downloaded straight from the home page, then I can think of several use cases where I wouldn't want pkg:deb/kdenlive to be considered the same as pkg:generic/kdenlive.

@mlieberman85
Copy link
Author

To bring it back to the original it can be difficult to understand the intent of a given PURL. Is this PURL being given purely as a "locator" or a "unique identifier." It leads to a lot of ambiguity.

@pombredanne
Copy link
Member

A PURL is a locator and a mostly unique way to identify a package. But this does not mean that there is a single unique PURL for a given package. This is pretty much the same way that any URL can locate a web page and act as a mostly unique identifier for the page. But there are can be multiple URLs that point to the same page, like with https://www.example.com and http://example.com

@mlieberman85 you wrote:

A potential solution to this is in providing some documentation around best practices for using PURL for the identifier use case. I know that each ecosystem might be different, but some high level guidelines I think would help alleviate confusion.

This makes sense. Some improved docs and also actual code examples that generate the PURLs.

@nishakm you wrote:

For example: pkg:docker/cassandra@latest, pkg:docker/cassandra@123456abcdef, pkg:docker/cassandra@sha256%123456abcdef, pkg:oci/cassandra@abcdef123456 and pkg:oci/my/local/cas@abcdef123456 are all the same thing. The pURL has to be detailed enough for a person or tool to have high confidence that they mean only one thing.

This was never the intent nor it is possible to have something that guarantees a unique identifier. You could use a checksum for this but this does not convey much beyond a unique content id. You could add a checksum as a qualifier, but it does not guarantee either that you cannot have two purls. If you want to treat two different PURLs as the same thing this is something for a system to handle, much like a URL crawler may have rules to treat two pages as being the same (in practice FWIW, this not based on exact content but on approximate high similarity in search engines)

Another example: pkg:deb/kdenlive and pkg:generic/kdenlive_etc_etc?download_url= are the same package. The pURL tells you how the package was downloaded but doesn't indicate that it is the same package. My opinion is that pURL does the former extremely well, but to do the latter, we need some formality around how to craft a pURL.

The formalism exists: here the Debian packages is from Debian, so using a generic package type is misleading and incorrect, yet always possible.

@nishakm
Copy link

nishakm commented Nov 5, 2023

@pombredanne: A PURL is a locator and a mostly unique way to identify a package
I understand PURL was never meant to be unique identifier. However, many tools and advisory databases use it as a unique (not globally) identifier. Furthermore, many in the PURL community see no problems with using it as an identifier. This makes it hard for tools to understand if one PURL means the same thing as another PURL. If one organization crafts a PURL in one way and another crafts a PURL in a way for a package that is basically the same, then this breaks interoperability.

If using PURL as an identification mechanism is not its core use case, then it should not be promoted as such.

I think @mlieberman85's suggestion on adding documentation on how one could indicate whether the intent of the PURL is to identify the package and not as a package location is a reasonable first step, but it doesn't solve the issue of PURLs not being the same across organizations or tools.

Reg: debian and other centralized packaging systems: Due to the standardized nature of the packaging systems, the location and the identity do have a chance to merge. So I will concede this point. However, it would be nice to organize the PURL types documentation by central package ecosystems rather than in alphabetical order to show that uniformity within each ecosystem.

@pombredanne
Copy link
Member

@mlieberman85 I looked into the guac models at https://github.com/guacsec/guac/blob/068951468803a87d41592fd281c4e41d97fb16a6/pkg/assembler/graphql/model/nodes.go and "ontology" at https://docs.guac.sh/guac-ontology-definition/

If I get things correctly, your main identifiers are an artifact checksum and package nodes keyed by PURL (as a tree).
There are a few things to consider, but hey, I do not know much about the model planned usage!

  1. Track multiple PURLs for a package, because there can be more than one
  2. Or ensure you track all the qualifiers for all the variants (say multiple Debian arch.)
  3. Or supplement your Package graph with Artifact for content-based "unicity"

Side note: I see also a model for a "Source" node which made me think. For instance I see no conceptual difference between a Git checkout at revision, and a tarball of the mostly same, say as an original code archive for a Debian package. The file tree and archive may not be bit-for-bit identical, but would be the same content if you diff them abstracting minor things such a spaces, permissions and dates. I tend to prefer using a proper PURL for this rather than making your own up, but this is minor. Or consider instead the SPDX spec bits for VCS URLs at https://spdx.github.io/spdx-spec/v2.3/package-information/#77-package-download-location-field

@nishakm
Copy link

nishakm commented Nov 6, 2023

@pombredanne Here's an example of the way OSV uses pURLs: https://storage.googleapis.com/cve-osv-conversion/osv-output/CVE-2023-38545.json
I wonder if you have a recommendation of how the pURL pkg:apk/alpine/curl?arch=source can include alpine:v3.15, or even something from the result of apk info curl.

@pombredanne
Copy link
Member

pombredanne commented Nov 6, 2023

@nishakm I think that with things like:

Another example: pkg:deb/kdenlive and pkg:generic/kdenlive_etc_etc?download_url= are the same package.

... you may be focusing too much on on edge cases. Here the Debian project ensures that names and versions are unique within their realm. There is nothing more to it than that in PURL that just extends and builds upon the ecosystem of a package type id and naming coordination.

You also wrote:

However, it would be nice to organize the PURL types documentation by central package ecosystems rather than in alphabetical order to show that uniformity within each ecosystem.

I am not sure I get what you suggest... can you elaborate?

@pombredanne
Copy link
Member

@nishakm you wrote:

@pombredanne Here's an example of the way OSV uses pURLs: https://storage.googleapis.com/cve-osv-conversion/osv-output/CVE-2023-38545.json I wonder if you have a recommendation of how the pURL pkg:apk/alpine/curl?arch=source can include alpine:v3.15, or even something from the result of apk info curl.

From a quick look, the way OSV handles it needs review. Do you mind to enter a separate issue for this topic?

@pombredanne
Copy link
Member

@nishakm re: alpine, the alpine "release" stream would need to be clarified in the spec: https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#apk (which has another a wart: the apk of Alpine is NOT the apk of openwrt at all AFAIK they are different packaging formats entirely and not the same type.

@pombredanne
Copy link
Member

May be another way to discuss unique identifier vs. locator is what I expanded in #257 ... A PURL is like an address

Copied from #257 (comment)

Here is a possible analogy that may not be too shabby! Say the PURL spec is like a the spec for an address book of people and places. 🧑‍🤝‍🧑 🏙️

Each package type is like a country or state and defines how you can identify and locate a place reasonably uniquely. Uniquely enough that the post can deliver the mail. In a city with well defined streets and street numbers, you get a precise location with the street name and number and may be an apartment number. In some cases you may want the address for a single person with its name, or the whole household. If someone is off the grid in the bayou or some isolated mountain, crafting a proper address may be more hairy and fuzzy. Worst case I may need GPS coordinates for these edge cases. I may also have many different ways to write an address or a name. Heck, some folks also live in orbit on the ISS and GPS will not work there!

@mlieberman85
Copy link
Author

@mlieberman85 I looked into the guac models at https://github.com/guacsec/guac/blob/068951468803a87d41592fd281c4e41d97fb16a6/pkg/assembler/graphql/model/nodes.go and "ontology" at https://docs.guac.sh/guac-ontology-definition/

If I get things correctly, your main identifiers are an artifact checksum and package nodes keyed by PURL (as a tree). There are a few things to consider, but hey, I do not know much about the model planned usage!

  1. Track multiple PURLs for a package, because there can be more than one
  2. Or ensure you track all the qualifiers for all the variants (say multiple Debian arch.)
  3. Or supplement your Package graph with Artifact for content-based "unicity"

Side note: I see also a model for a "Source" node which made me think. For instance I see no conceptual difference between a Git checkout at revision, and a tarball of the mostly same, say as an original code archive for a Debian package. The file tree and archive may not be bit-for-bit identical, but would be the same content if you diff them abstracting minor things such a spaces, permissions and dates. I tend to prefer using a proper PURL for this rather than making your own up, but this is minor. Or consider instead the SPDX spec bits for VCS URLs at https://spdx.github.io/spdx-spec/v2.3/package-information/#77-package-download-location-field

This helps. I think one thing that we are also really trying to clarify is "intentionality." It can be difficult to understand when given a purl let's just say something like the following as a contrived example:

pkg:deb/foo and pkg:deb/foo@1.0 -- Based on the spec today it appears to be ecosystem dependent on how the first one should be interpreted. Is pkg:deb/foo mean latest without @1.0? Since from a temporarily perspective the first case might point to 1.0 but only temporarily. GUAC's use case is trying to both eliminate ambiguity but also highlight where there are unknowns to allow the end user to determine what action to take. It can be difficult to discern and some basic guidelines even if it is ecosystem dependent would be helpful.

@johnmhoran johnmhoran added the PURL core specification Format and syntax that define PURL (excludes PURL type definitions) label Oct 22, 2024
@johnmhoran johnmhoran added the Ecma specification Work on the core specification label Nov 5, 2024
@jloehel
Copy link

jloehel commented Dec 17, 2024

A PURL is a locator and a mostly unique way to identify a package. But this does not mean that there is a single unique PURL for a given package. This is pretty much the same way that any URL can locate a web page and act as a mostly unique identifier for the page. But there are can be multiple URLs that point to the same page, like with https://www.example.com and http://example.com

The mentioned URLs point to the same page because the authority resolves to the same IP address. For PURLs we have no authority and no resolver (DNS). I know that you are working on a PURL database but why should we not use DNS if the PURL needs to be resolvable. I also see some issues with really old packages or changes in the ecosystem. In example openSUSE renames the repository name from Leap_15.4 to 15.4. Is it necessary to adapt all openSUSE RPM PURLs in all existing SBOMs? ... or is it a persistent URL which needs to be adapted just once at a registry like purl.archive.org or something decentralized.

I think the specification needs to be more clear in it's naming. Do we talk about a URL, URI or a URN. URLs don't need to be globally, unique but it should be possible to locate the resource, download it and hash it. Right now it looks more like a URI which is sometimes resolvable by reconstructing the original URL which depends heavily on the ecosystem and the community contributions.

SWHID with Software Heritage as authority is something which I can resolve immediately. It is just a matter of trust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ecma specification Work on the core specification PURL core specification Format and syntax that define PURL (excludes PURL type definitions)
Projects
None yet
Development

No branches or pull requests

7 participants