Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a user, I want to know which repositories some content is in #2865

Closed
newswangerd opened this issue Jun 16, 2022 · 15 comments · Fixed by #3770
Closed

As a user, I want to know which repositories some content is in #2865

newswangerd opened this issue Jun 16, 2022 · 15 comments · Fixed by #3770

Comments

@newswangerd
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Content in pulp is distributed through repositories and distributions. If a client knows the name or HREF for some content, there's currently no way of know which repositories it's available in without looking in each repository.

Describe the solution you'd like
Add a ?content_in_latest= filter to the distributions and repositories APIs that allow users to specify a content HREF and get a list of repositories or distributions that contain the selected content.

Describe alternatives you've considered
Are there any existing ways to do this in pulp?

Additional context
This is needed for repository management in galaxy_ng

@ipanova
Copy link
Member

ipanova commented Jun 22, 2022

as of today it is possible to list all repo_versions(and publications) that contain a certain content or list of content
https://docs.pulpproject.org/pulpcore/restapi.html#tag/Repository_Versions

GET http :/pulp/api/v3/repository_versions/?content=<content_href>

[ 
  {
    "pulp_href": "/pulp/api/v3/repositories/rpm/rpm/07c41c5f-59e4-4371-942a-b6a006a6d2cf/versions/1/",
    "pulp_created": "2021-03-18T19:23:31.661940Z",
    "repository_href": "/pulp/api/v3/repositories/rpm/rpm/07c41c5f-59e4-4371-942a-b6a006a6d2cf/",
    "version": 1,
  }
  ... 
]

GET http :/pulp/api/v3/publications/?content=<content_href>

[ 
  {
    "pulp_href": "/pulp/api/v3/publications/rpm/rpm/4d5bb614-4318-4408-8c18-ab8b6b4c016f/",
    "pulp_created": "2021-03-17T12:24:31.661940Z",
    "repository_version_href": "/pulp/api/v3/repositories/rpm/rpm/07c41c5f-59e4-4371-942a-b6a006a6d2cf/versions/1/",
  }
  ... 
]

@newswangerd
Copy link
Contributor Author

oh neat, I didn't realize the repository version api existed. I was looking at /<repo_href>/repository-versions/

@ipanova
Copy link
Member

ipanova commented Jun 22, 2022

yeah our bad - we have no docs on this :/

@ipanova
Copy link
Member

ipanova commented Jun 22, 2022

with the existing implementation we could add a query param ?version=latest which would filter the latest repo version and show it instead of showing all of them.

Distribution search can be based of the repo_version search.

@newswangerd
Copy link
Contributor Author

Adding a ?version=latest flag to repository_versions could be sufficient. The potential issues I see with this are:

  • The repository_versions API is not quite as data rich as the repositories API. You can't see names, remotes etc. This could be partially address with PoC: Related fields #2828
  • This will make developing UIs more challenging and the end user experience worse overall. Lets say I have a "Repositories" view in my UI. If I want to filter that list of repositories by content. I have to make an api call to a different API endpoint and map the data so that the UI for repositories can interpret it. Without PoC: Related fields #2828, I would also have to make a bunch of additional API calls to get information about the repository itself. I also won't be able to use the content filter with any other repository filters such as repo labels, names or created dates.

@bmbouter
Copy link
Member

I think our users would more or less appreciate all of these options, including:

(a) Adding the ?version=latest flag to the repository_versions API
(b) Adding similar filter capability to the repositories endpoint
(c) Adding similar filtering capability to the distributions endpoint

The distributions endpoint (c) could be a little trickier due to some distributions have publications, and some don't, so for this to be implemented in pulpcore it would have to support both types. Totally do-able, just a bit more to it.

I want (a) because it's really easy, but then it requires the PoC Related Fields, which is a non-trivial piece of work. I think we should avoid tying this need up in that, so I believe the ideal path is to implement (b) and (c) as separate tickets. This ticket could be for one of them, and we'd need another ticket for the other.

For (b) it would be nice to have the ability to search against latest or not. For (c) it shouldn't take a "latest" option since the distribution already encapsulates that feature. Distributions can point to a repo which implies "latest" or a specific repo_version or publication, which implies "just this one".

Given all ^, I'm happy to defer to @newswangerd for whichever one you want.

A short aside on (a), maybe positioning it as latest_only=True/False (with default to False), would be good because I don't think semantically having the option accept mixed types, e.g. latest or in other cases "a url to a repo version" is a positive user experience.

@bmbouter
Copy link
Member

At the pulpcore meeting, we want to pursue using this ticket for implementing (b) and @newswangerd will file another ticket for (c) and @newswangerd or galaxy_ng will implement both?

@newswangerd
Copy link
Contributor Author

I'll file another ticket for distributions. We can implement both

@ipanova
Copy link
Member

ipanova commented Jun 29, 2022

The reason we have implemented (a) and not (b) because the content's presence and distribution comes from a repo version specifically.
Your story sounds like "As a user, I want to know which repos have content X in it's latest version" which is not as generic as "As a user, I want to know which repo versions have content X" . Not every distribution serves latest version, as it was pointed out earlier in the comment, but what the (b) will show in case I don't want to search against latest? The api will give you the repo result like this but will not tell in which version specifically the content is in.

"results": [
        {
            "description": null,
            "latest_version_href": "/pulp/api/v3/repositories/container/container/6d0d0d25-b3c0-49a2-a798-358fbe6f5031/versions/0/",
            "name": "lala",
            "pulp_created": "2022-06-29T11:38:58.610462Z",
            "pulp_href": "/pulp/api/v3/repositories/container/container/6d0d0d25-b3c0-49a2-a798-358fbe6f5031/",
            "pulp_labels": {},
            "remote": null,
            "retain_repo_versions": null,
            "versions_href": "/pulp/api/v3/repositories/container/container/6d0d0d25-b3c0-49a2-a798-358fbe6f5031/versions/"
        }
    ]

And even if you're interested only in the latest repo version, there is no guarantee that by the time the user will decide to consume/use/copy/etc content from that repo 1) that repo will not have other versions created so the latest won't be latest you searched against anymore 2) the new latest might not have that content anymore; that's why it is probably safer to search through repo versions because they are immutable and content will be there until the repo version exists.
I agree on other points of repo versions api not being as rich as repo api and that different endpoints would need to be called but i am hoping that #2828 will help.

@bmbouter
Copy link
Member

bmbouter commented Jul 4, 2022

@ipanova thanks for such an informative post. Given that is the recommendation to focus on developing #2828 ?

I still think from a usability perspective users want to search from repos and distributions directly, but given the concerns you raise a clear path to doing that is not immediately clear to me. I guess one of the concerns is that if repo endpoints start returning serialized objects that aren't repos (for example) the endpoint is now polymorphic from an openAPI perspective. That being said it kind of would be anyway if we implement #2828 Just some rambling thoughts on this. More conversation is welcome.

@newswangerd
Copy link
Contributor Author

@ipanova Let me see if I can summarize your objections to this:

  1. Distributions can point to any repository version, so searching latest doesn't work
  2. The repository serializer only links to the latest repo version, so searching for content in any other version doesn't make sense
  3. Content is ultimately distributed from versions, so searching the versions directly is more reliable.
  4. The repo version pointers for for distributions and repositories can be updated in-between searching for content and requesting content, so they don't accurately represent if the content is actually there.

1, 2 and 3 are great points. Since the content can be distributed from any version in the repository it doesn't make sense to limit your search to just the latest on the repositories endpoint.

I'm not 100% certain 3 matters as much. The best any REST API can do is communicate to the client what the state of the system is in right now. If I make a followup call to the system to request another piece of data there's no guarantee that the object still exists, that the object is the same as it was before, the the server is running, the authentication credentials are still valid, etc. The current endpoint for repository versions has this problem too. The repo versions are immutable, but they're also deletable. There's no guarantee the version exists if I make a followup call to grab my content from it.

With all this in mind, lets revisit some use cases from for this:

  • U1: As an admin, I want to know who has access to download some content
  • U2: As a content consumer, I want to know where to go to download some content

As you pointed out, (b) doesn't make any sense for U1. The content can be distributed from any version of the repository, so knowing if it's in latest doesn't help you make that determination anymore, so it would be better to make a call to the repo versions api endpoint.

U2 is a little more complicated. In the ansible world, clients can only download content from a distribution. If we just implement (a), to determine where my content is available, I would need to request the list of repository versions, and then figure out which repository versions are part of a distribution I have access to. Since distributions are just pointers to repository versions, it seems like it would still make sense to provide a ?contains_content filter to the distributions endpoint.

Providing a ?contains_content on the repository APIs would also be very helpful from a usability perspective, even if it means that you have to make a followup request to get the list of versions that the content is actually in. Going back to U1, let's say an admin is trying to assess which clients may have downloaded some malicious content. If the content is in 3000 versions of repo foo, 200 versions of bar and 1 version of foobar, then the admin has to page through 3201 repository versions to find out that there are 3 repos that demand their attention. It would be easier to get list of 3 repositories in one API call and then go through the list of repo versions for each repo to perform cleanup or whatever they need to do.

I guess this is a long winded way of saying I really want us to implement (c). Limiting your search to just the latest version on a repo doesn't make as much sense, so (a) and (b) might not be necessary, but I would still love to have some content filtering capabilities on the repository endpoint.

cc @bmbouter

@mdellweg
Copy link
Member

mdellweg commented Jul 7, 2022

Is the question about content in a Distribution controversial at this point?
I think the question of what is currently presented in a distribution is rather clear. Should we break this out as a separate issue?

@newswangerd
Copy link
Contributor Author

Ah, maybe I misunderstood the comment. I just added #2952 to track the distribution filter.

@ipanova
Copy link
Member

ipanova commented Jul 11, 2022

I don't think I will be able to attend in time our today's meeting, so let me leave here some of my thoughts.

@ipanova Let me see if I can summarize your objections to this:

1. Distributions can point to any repository version, so searching latest doesn't work

2. The repository serializer only links to the latest repo version, so searching for content in any other version doesn't make sense

3. Content is ultimately distributed from versions, so searching the versions directly is more reliable.

4. The repo version pointers for for distributions and repositories can be updated in-between searching for content and requesting content, so they don't accurately represent if the content is actually there.

1, 2 and 3 are great points. Since the content can be distributed from any version in the repository it doesn't make sense to limit your search to just the latest on the repositories endpoint.

I'm not 100% certain 3 matters as much. The best any REST API can do is communicate to the client what the state of the system is in right now. If I make a followup call to the system to request another piece of data there's no guarantee that the object still exists, that the object is the same as it was before, the the server is running, the authentication credentials are still valid, etc. The current endpoint for repository versions has this problem too. The repo versions are immutable, but they're also deletable. There's no guarantee the version exists if I make a followup call to grab my content from it.

With all this in mind, lets revisit some use cases from for this:

* U1: As an admin, I want to know who has access to download some content

* U2: As a content consumer, I want to know where to go to download some content

As you pointed out, (b) doesn't make any sense for U1. The content can be distributed from any version of the repository, so knowing if it's in latest doesn't help you make that determination anymore, so it would be better to make a call to the repo versions api endpoint.

U2 is a little more complicated. In the ansible world, clients can only download content from a distribution. If we just implement (a), to determine where my content is available, I would need to request the list of repository versions, and then figure out which repository versions are part of a distribution I have access to. Since distributions are just pointers to repository versions, it seems like it would still make sense to provide a ?contains_content filter to the distributions eddpoint.

In my previous 2 comments I was mostly expressing my concerns over (b) as the main API endpoint of reference for repo content search. (a) and (c) make sense to me.

Providing a ?contains_content on the repository APIs would also be very helpful from a usability perspective, even if it means that you have to make a followup request to get the list of versions that the content is actually in. Going back to U1, let's say an admin is trying to assess which clients may have downloaded some malicious content. If the content is in 3000 versions of repo foo, 200 versions of bar and 1 version of foobar, then the admin has to page through 3201 repository versions to find out that there are 3 repos that demand their attention. It would be easier to get list of 3 repositories in one API call and then go through the list of repo versions for each repo to perform cleanup or whatever they need to do.

I am still not sure I am sold to this. If I have a malicious content I plan to remove it.

  1. I will make a call to repo_versions API endpoint with ?latest_only=True to identify the href for foo, bar and foobar and then
  2. have 3 repo API endpoint calls to each repos to remove the malicious content
  3. Since repo versions are immutable I can just delete them so won't I perform 3201 DELETE calls on repo_versions?

I am not very much against providing ?contains_content on the repository APIs I just don't see any added value to it because the result of the API call will tell me that one of the 3000 repo_versions from repo foo contain content X. Well, thanks and what's next? Next, I am going perform (1) with?latest_only=False so I can find all affected repo_versions and then (3) to delete them.

EDIT: well, I do see value for the sake of convenience and generic informative call, to not scroll through the X number of pages of all repo_version results.

I guess this is a long winded way of saying I really want us to implement (c). Limiting your search to just the latest version on a repo doesn't make as much sense, so (a) and (b) might not be necessary, but I would still love to have some content filtering capabilities on the repository endpoint.

yep +1 on (c)

cc @bmbouter

@mdellweg mdellweg changed the title As a user, I want to know which repositories/distributions some content is in As a user, I want to know which repositories~/distributions~ some content is in Jul 14, 2022
@mdellweg mdellweg changed the title As a user, I want to know which repositories~/distributions~ some content is in As a user, I want to know which repositories~~/distributions~~ some content is in Jul 14, 2022
@mdellweg mdellweg changed the title As a user, I want to know which repositories~~/distributions~~ some content is in As a user, I want to know which repositories some content is in Jul 14, 2022
@daviddavis
Copy link
Contributor

We would really like to have this feature. And we'd prefer to filter repositories (as opposed to repo versions) by content as we don't expose repo versions to our users. Our use of Pulp is that users simply create, publish, and distribute repos so they have no concept of repo versions.

daviddavis added a commit to daviddavis/pulpcore that referenced this issue Apr 26, 2023
@pulpbot pulpbot moved this to Needs review in RH Pulp Kanban board Apr 26, 2023
daviddavis added a commit to daviddavis/pulpcore that referenced this issue Apr 26, 2023
daviddavis added a commit to daviddavis/pulpcore that referenced this issue Apr 26, 2023
daviddavis added a commit to daviddavis/pulpcore that referenced this issue Apr 26, 2023
daviddavis added a commit to daviddavis/pulpcore that referenced this issue Apr 27, 2023
daviddavis added a commit to daviddavis/pulpcore that referenced this issue May 5, 2023
daviddavis added a commit to daviddavis/pulpcore that referenced this issue May 5, 2023
daviddavis added a commit to daviddavis/pulpcore that referenced this issue May 5, 2023
daviddavis added a commit to daviddavis/pulpcore that referenced this issue May 5, 2023
daviddavis added a commit to daviddavis/pulpcore that referenced this issue May 5, 2023
daviddavis added a commit to daviddavis/pulpcore that referenced this issue May 5, 2023
daviddavis added a commit to daviddavis/pulpcore that referenced this issue May 5, 2023
mdellweg pushed a commit that referenced this issue May 5, 2023
@pulpbot pulpbot moved this from Needs review to Done in RH Pulp Kanban board May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants