-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize ACS stage #4274
base: main
Are you sure you want to change the base?
Optimize ACS stage #4274
Conversation
RemoteArtifact.objects.acs() | ||
.filter(**{f"{checksum_type}__in": batch_checksums[checksum_type]}) | ||
.only("url", checksum_type) | ||
.select_related("remote") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO:
- figure out how to get rid of this select if possible - currently we set the hydrated remote object below so I'm not sure if we can get away with just using the PK? Unless we fetch separately and cache?
- sanity check overall logic
a4ac583
to
e9d81e5
Compare
acs_query = AlternateContentSource.objects.filter(pulp_domain=self.domain) | ||
acs_exists = await acs_query.aexists() | ||
async for batch in self.batches(): | ||
acs_query = AlternateContentSource.objects.filter(pulp_domain=self.domain) | ||
acs_exists = await acs_query.aexists() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Idea: Is it faster to not batch at all without acs? Or should we move this check out to the point, where we can decide wheter the pipeline should include this stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or should we move this check out to the point, where we can decide wheter the pipeline should include this stage?
Is commit temp3 what you're looking for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly.
Also not grocking the concept of ACS properly... Are they neither tied to a repository, nor a remote? Is that extremely generic exists
query the best we can do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this line:
pulpcore/pulpcore/app/models/acs.py
Line 31 in 20afd1e
remote = models.ForeignKey("Remote", null=True, on_delete=models.PROTECT) |
Why do we combine null=True
with on_delete=PROTECTED
? If we could say an ACS is always tied to a remote, i would understand a lot more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, I think Pavel wrote all of this. @dkliban do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that matter, I wonder why it (according to the docs) only has an effect with on_demand remotes? I would think that you'd want the ACS to be available during immediate syncs too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the download_policy should matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACS is not tight to any repo or remote, yet it should have a remote that it's going to use during the refresh action.
The idea was to create that remote always with on-demand policy so the refresh endpoint takes as little time as possible by creating only RA( look at it as indexing content only). Those RA could be used later on during repo syncs.
Each ACS is assumed to be global(per domain) so, every rpm sync will check with the rpm typed ACS content and prefer it's binary data over the original sync remote.
504a221
to
4b1930d
Compare
d_artifact.urls = [ | ||
existing_ras_dict[checksum]["url"] | ||
] + d_artifact.urls | ||
d_artifact.remote = existing_ras_dict[checksum]["remote"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading through the code I am mega-confused on how ACS even works and reading the workflow docs did not clear it up for me. https://docs.pulpproject.org/pulpcore/workflows/alternate-content-sources.html
I think there is a big problem with just overwritting the d_artifact.remote
field with the one from an existing RA that could be from a different remote! It means all the current urls on the d_artifact
could be rendered useless since they won't have the correct remote to build the downloader from. I have a feeling this feature is not working entirely how we intend it to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this isn't my strong suit either. If I understand correctly, the basic idea is to preferentially use a different remote instead of the original one for downloads. I believe what might make it OK, is that when it comes to saving remote artifacts, it's ultimately going to end up saving new remote artifacts rather than modifying existing ones. But in terms of the download process, it will be the new ones that used?
I think! It would be worthwhile to verify those assumptions and document the actual details while I'm working on this code anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of discussion in the original plan.io issue : https://pulp.plan.io/issues/7832
And some questions/answers in the hackmd : https://hackmd.io/@pulp/acs
if getattr(d_artifact.artifact, checksum_type): | ||
checksum = getattr(d_artifact.artifact, checksum_type) | ||
if checksum in existing_ras_dict: | ||
d_artifact.urls = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binging back for transparency some of the matrix conversation @dralley and I had.
I see few issue with the current design:
1)The original intention for this line was the declarative artifact to accept list of URLs which downloader will try in a loop to go trough. Those URLs would come from a mirrorlist and the remote would whether not need credentials at all or these credentials would be same for every URLs in that list. Not sure how worthy it is for the ACS stage to have a list of URLs - this would be a mix of them, one coming from original remote, most likely needing credentials and another from ACS( with or without credentials or different credentials?). The remote which is going to be used is the one from ACS.
So maybe this needs to be reverted back to d_artifact.url = existing_ras_dict[checksum]["url"]
2) Another problem is that in case the ArtifactDownload stage fails( namely download of binary data from the substituted ACS source) there is no fallback to the original remote downloader + url.
It is expected in normal circumstances the source for ACS being available because that's the preferred and fastest way how to get the binary data but no one can guarantee normal circumstances 100%.
3) We should first iterate through all available ACSes to download from and then fallback to the original remote but instead we provide just one ( this is currently pipeline limitation, where declarative artifact can reference just one remote).
This pull request has been marked 'stale' due to lack of recent activity. If there is no further activity, the PR will be closed in another 30 days. Thank you for your contribution! |
This pull request has been closed due to inactivity. If you feel this is in error, please reopen the pull request or file a new PR with the relevant details. |
This pull request is no longer marked for closure. |
This pull request has been marked 'stale' due to lack of recent activity. If there is no further activity, the PR will be closed in another 30 days. Thank you for your contribution! |
This pull request has been closed due to inactivity. If you feel this is in error, please reopen the pull request or file a new PR with the relevant details. |
This pull request is no longer marked for closure. |
Lift some queries that were performed per-batch outside of the loop, resolving an N+1 where N=number of batches. Also use .iterator() [noissue]
Instead of performing a gigantic AND-ed OR clause query, break up the list of remote artifacts by checksum type and perform one IN query per type of checksum, which ought to be easily indexable. [noissue]
Lift the ACS domain check out of the stage entirely. [noissue]
ACS uses a lot of memory, apparently: https://bugzilla.redhat.com/show_bug.cgi?id=2219885