-
Notifications
You must be signed in to change notification settings - Fork 739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge identical forks #5405
Merge identical forks #5405
Conversation
Consider these requirements from pylint 3.2.5: ``` Requires-Dist: dill >=0.3.6 ; python_version >= "3.11" Requires-Dist: dill >=0.3.7 ; python_version >= "3.12" ``` We will split on the python version, but then we may pick a version of `dill` that's `>=0.3.7` in both branches and also have an otherwise identical resolution in both forks. In this case, we merge both forks and store only their conjoined markers.
@@ -2426,6 +2462,30 @@ pub(crate) struct ResolutionDependencyEdge { | |||
} | |||
|
|||
impl Resolution { | |||
fn universal() -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That method gets removed upstack
.map(ResolverMarkers::Fork) | ||
.unwrap_or(ResolverMarkers::Universal); | ||
continue 'FORK; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I buy this, but two thoughts come to mind.
Firstly, is it possible for more than one duplicate resolution to exist at any given point in time? If so, this would I believe only find one of them. But, I do not think this is case, since this is run for every resolution before it is "saved." So it should never be the case that more than one duplicate resolution appears.
Secondly, this is doing an exhaustive search over all existing resolutions to find a possible duplicate. And I suspect that the Resolution::same_graph
routine is itself not especially cheap. I think this ends up being quadratic in the number of forks (which are themselves exponential in the number of dependencies I think? or possibly in the depth in the dependency tree). I don't have a good feel for how big of an issue that is in practice. Do we have a sense of what the common case is? I would guess the common case is that there aren't any duplicates. So perhaps we can optimize for that path. (To be clear, I don't mean to suggest that be done in this PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, is it possible for more than one duplicate resolution to exist at any given point in time? If so, this would I believe only find one of them. But, I do not think this is case, since this is run for every resolution before it is "saved." So it should never be the case that more than one duplicate resolution appears.
We fork every time we see conflicting markers, but in many of those cases the requirements themselves are not conflicting (say numpy >= 1.16
for one and numpy >= 1.19
for the other). When forking, we can't yet know whether we'll find a compatible numpy
for both of. I've also seen cases where we end up rejecting the package version we forked on in both branches, removing the conflicting requirements. By copying over preferences from previous forks, we try to coerce two forks to resolving the same package version. Basically, our strategy is to fork often to avoid failing on avoidable conflicts, but still having a solution with as few divergences as possibles.
Re perf: I agree that this is potentially costly, but i think we have to do this to get desirable resolution. We have some short-cuts that we get from std that makes this cheaper: When two forks have a different number of packages, the check is a single usize
comparison. We also usually have a small number of forks (and the more specific a fork is, the more likely it is we skip future fork points because we're already more specific), so i see this more as a fixed cost of maybe 10^2/2=50 checks. There are of course pathological cases, for those i think we just have to be a bit slow here avoid redundant forks in the lockfile.
## Summary This was added in #5405 but is now the cause of an instability in `github_wikidata_bot`. Specifically, on the initial run, we fork in `pydantic==2.8.2`, via: ``` Requires-Dist: typing-extensions>=4.12.2; python_version >= '3.13' Requires-Dist: typing-extensions>=4.6.1; python_version < '3.13' ``` In the end, we resolve a single version of `typing-extensions` (`4.12.2`)... But we don't recognize the two resolutions as the "same graph", because we propagate the fork markers, and so the "edges" have different markers on them... In the second run through, when we have the forks in advance, we don't split on Pydantic... We just try to solve from the root with the current forks. This is fundamentally different and I fear it will be the cause of many instabilities. But removing this graph check fixes the proximate issue. I don't really understand why this was added since there was no test coverage in the PR.
Consider these requirements from pylint 3.2.5:
We will split on the python version, but then we may pick a version of
dill
that's>=0.3.7
in both branches and also have an otherwise identical resolution in both forks. In this case, we merge both forks and store only their conjoined markers.