Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ACS stage #4274

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Optimize ACS stage #4274

wants to merge 3 commits into from

Conversation

dralley
Copy link
Contributor

@dralley dralley commented Aug 14, 2023

ACS uses a lot of memory, apparently: https://bugzilla.redhat.com/show_bug.cgi?id=2219885

RemoteArtifact.objects.acs()
.filter(**{f"{checksum_type}__in": batch_checksums[checksum_type]})
.only("url", checksum_type)
.select_related("remote")
Copy link
Contributor Author

@dralley dralley Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO:

  • figure out how to get rid of this select if possible - currently we set the hydrated remote object below so I'm not sure if we can get away with just using the PK? Unless we fetch separately and cache?
  • sanity check overall logic

@dralley dralley force-pushed the acs-optimize branch 2 times, most recently from a4ac583 to e9d81e5 Compare August 14, 2023 04:04
Comment on lines 468 to 476
acs_query = AlternateContentSource.objects.filter(pulp_domain=self.domain)
acs_exists = await acs_query.aexists()
async for batch in self.batches():
acs_query = AlternateContentSource.objects.filter(pulp_domain=self.domain)
acs_exists = await acs_query.aexists()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
Idea: Is it faster to not batch at all without acs? Or should we move this check out to the point, where we can decide wheter the pipeline should include this stage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or should we move this check out to the point, where we can decide wheter the pipeline should include this stage?

Is commit temp3 what you're looking for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly.

Also not grocking the concept of ACS properly... Are they neither tied to a repository, nor a remote? Is that extremely generic exists query the best we can do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this line:

remote = models.ForeignKey("Remote", null=True, on_delete=models.PROTECT)

Why do we combine null=True with on_delete=PROTECTED? If we could say an ACS is always tied to a remote, i would understand a lot more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I think Pavel wrote all of this. @dkliban do you know?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that matter, I wonder why it (according to the docs) only has an effect with on_demand remotes? I would think that you'd want the ACS to be available during immediate syncs too?

https://github.com/pulp/pulpcore/blob/3cb7ef4369fd612d17e2be7863845f6cc74c141f/docs/workflows/alternate-content-sources.rst#overview

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the download_policy should matter.

Copy link
Member

@ipanova ipanova Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACS is not tight to any repo or remote, yet it should have a remote that it's going to use during the refresh action.
The idea was to create that remote always with on-demand policy so the refresh endpoint takes as little time as possible by creating only RA( look at it as indexing content only). Those RA could be used later on during repo syncs.
Each ACS is assumed to be global(per domain) so, every rpm sync will check with the rpm typed ACS content and prefer it's binary data over the original sync remote.

@dralley dralley force-pushed the acs-optimize branch 2 times, most recently from 504a221 to 4b1930d Compare August 14, 2023 21:07
Comment on lines +499 to +510
d_artifact.urls = [
existing_ras_dict[checksum]["url"]
] + d_artifact.urls
d_artifact.remote = existing_ras_dict[checksum]["remote"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading through the code I am mega-confused on how ACS even works and reading the workflow docs did not clear it up for me. https://docs.pulpproject.org/pulpcore/workflows/alternate-content-sources.html

I think there is a big problem with just overwritting the d_artifact.remote field with the one from an existing RA that could be from a different remote! It means all the current urls on the d_artifact could be rendered useless since they won't have the correct remote to build the downloader from. I have a feeling this feature is not working entirely how we intend it to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this isn't my strong suit either. If I understand correctly, the basic idea is to preferentially use a different remote instead of the original one for downloads. I believe what might make it OK, is that when it comes to saving remote artifacts, it's ultimately going to end up saving new remote artifacts rather than modifying existing ones. But in terms of the download process, it will be the new ones that used?

I think! It would be worthwhile to verify those assumptions and document the actual details while I'm working on this code anyway.

Copy link
Contributor

@ggainey ggainey Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of discussion in the original plan.io issue : https://pulp.plan.io/issues/7832

And some questions/answers in the hackmd : https://hackmd.io/@pulp/acs

if getattr(d_artifact.artifact, checksum_type):
checksum = getattr(d_artifact.artifact, checksum_type)
if checksum in existing_ras_dict:
d_artifact.urls = [
Copy link
Member

@ipanova ipanova Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binging back for transparency some of the matrix conversation @dralley and I had.
I see few issue with the current design:
1)The original intention for this line was the declarative artifact to accept list of URLs which downloader will try in a loop to go trough. Those URLs would come from a mirrorlist and the remote would whether not need credentials at all or these credentials would be same for every URLs in that list. Not sure how worthy it is for the ACS stage to have a list of URLs - this would be a mix of them, one coming from original remote, most likely needing credentials and another from ACS( with or without credentials or different credentials?). The remote which is going to be used is the one from ACS.
So maybe this needs to be reverted back to d_artifact.url = existing_ras_dict[checksum]["url"]
2) Another problem is that in case the ArtifactDownload stage fails( namely download of binary data from the substituted ACS source) there is no fallback to the original remote downloader + url.
It is expected in normal circumstances the source for ACS being available because that's the preferred and fastest way how to get the binary data but no one can guarantee normal circumstances 100%.
3) We should first iterate through all available ACSes to download from and then fallback to the original remote but instead we provide just one ( this is currently pipeline limitation, where declarative artifact can reference just one remote).

Copy link

stale bot commented Nov 21, 2023

This pull request has been marked 'stale' due to lack of recent activity. If there is no further activity, the PR will be closed in another 30 days. Thank you for your contribution!

@stale stale bot added the stale label Nov 21, 2023
Copy link

stale bot commented Dec 21, 2023

This pull request has been closed due to inactivity. If you feel this is in error, please reopen the pull request or file a new PR with the relevant details.

@stale stale bot closed this Dec 21, 2023
@dralley dralley reopened this Jan 24, 2024
Copy link

stale bot commented Jan 24, 2024

This pull request is no longer marked for closure.

@stale stale bot removed the stale label Jan 24, 2024
Copy link

stale bot commented Apr 24, 2024

This pull request has been marked 'stale' due to lack of recent activity. If there is no further activity, the PR will be closed in another 30 days. Thank you for your contribution!

@stale stale bot added the stale label Apr 24, 2024
Copy link

stale bot commented May 24, 2024

This pull request has been closed due to inactivity. If you feel this is in error, please reopen the pull request or file a new PR with the relevant details.

@stale stale bot closed this May 24, 2024
@dralley dralley reopened this May 24, 2024
Copy link

stale bot commented May 24, 2024

This pull request is no longer marked for closure.

Lift some queries that were performed per-batch outside of the loop,
resolving an N+1 where N=number of batches. Also use .iterator()

[noissue]
Instead of performing a gigantic AND-ed OR clause query, break up the
list of remote artifacts by checksum type and perform one IN query per
type of checksum, which ought to be easily indexable.

[noissue]
Lift the ACS domain check out of the stage entirely.

[noissue]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants