-
Notifications
You must be signed in to change notification settings - Fork 1k
Add satisfiability check for case variants #1079
Conversation
acd581e
to
fad46df
Compare
internal/gps/selection.go
Outdated
} | ||
|
||
func (s *selection) popDep(id ProjectIdentifier) (dep dependency) { | ||
deps := s.deps[id.ProjectRoot] | ||
dep, s.deps[id.ProjectRoot] = deps[len(deps)-1], deps[:len(deps)-1] | ||
|
||
prlist := s.prLenMap[len(id.ProjectRoot)] | ||
s.prLenMap[len(id.ProjectRoot)] = prlist[:len(prlist)-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this supposed to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the question more about what prLenMap
does in general, or just this specific segment here?
it's popping the last element off the slice of project roots on the prLenMap
. it's dual to append we do up in *selection.pushDep()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes total sense now. Looks like I forgot Go slice semantics. 😛
internal/gps/selection.go
Outdated
} | ||
|
||
// TODO(sdboyer) bug here if it's possible that strings.ToLower() could | ||
// change the length of the string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will dropping in utf8.RuneCountInString()
for len()
fix this?
What if the map was keyed on the lowercase string rather than their length or rune-count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe i considered that approach initially, then dismissed it. lemme reconstruct my thought process...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you betting on len()
being a shortcut/hueristic to do less ToLower
calls overall?
I was thinking that if you can't use len()
(O(1)
), and are forced to scan each one to count the runes anyways (O(n)
), then ToLower()
isn't any more complex at least. But there might be other things to consider, or constant factors dominating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can't reconstruct the original thought process, but it's fine, because i just bit the bullet and dug into how the toolchain does this.
They have a special ToFold()
helper, which generates a stable, folded representation of a string that we can key the map on. at least, i guess that's what it does - i thought this was harder. but, we're at least sorta covered here, because the compiler is the ultimate arbiter of what's acceptable here. of course, i'm relying on current implementation of just one compiler, not a spec, and maybe case-insensitive filesystems will disagree about this definition of case equivalence...but now we're in the domain of some truly far-fetched cases, so whatever.
ToFold()
is still O(n) even in its fast case, but we no longer need the slice on the other side, so we have fewer checks to do on the other side of the lookup. So, likely still slower than the O(1) len()
, but at least we get something back. Either way, it's worth it to not have to care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lololol apparently i found toFold()
before - #433 (comment) - and forgot about it when i actually sat down to implement this
ugh lol this is metastasizing a bit (as solver checks kinda tend to) - i realized that there are more cases to address slightly differently:
i'm gonna punt on the first one, because it's kinda vanishingly unlikely to happen right now, and i can't think of a way to do it without inducing a relatively expensive validation check in the main solving loop. a good solution down the road might be to precalculate error conditions like that in "as needed" may also include up in dep, outside of the solver, for case-variant imports within the root project. alternately, and probably preferably, it might also be checked as part of the second item, though, does need to be done now, and kinda needs its own failure type. it needs to be clearly opinionated about the fact that the dependers are doing the wrong thing, whereas |
ok, this has grown more. in the course of getting tests for the alternate failure type (also now implemented) where we can unambiguously infer canonical import path information from the internal import path patterns (num 2 above), i ran into difficulties getting the harnesses to behave. so, now the gps solver testing harness has effectively been made case-insensitive for the "root portion" (up to the first this is a significant choice, because it's now basically just dumb to not follow suit in the real in looking at all that, i also realized that we have a nasty bug - on a case-insensitive filesystem, having case-variant project roots results in a situation where there are multiple |
02ca36d
to
f2d1529
Compare
} | ||
|
||
f := func(name string, pi1, pi2 ProjectIdentifier) { | ||
t.Run(name, func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NBD, but for cases like this, sometimes I like to define a struct for the test state, with a run
method:
type testName struct {...}
func (testName) run(*testing.T) {...}
Then instead of:
f("folded first", folded, casevar1)
f("folded second", casevar1, folded)
f("both unfolded", casevar1, casevar2)
You get:
t.Run("folded first", testName{folded, casevar1}.run)
t.Run("folded second",testName{casevar1, folded}.run)
t.Run("both unfolded", testname{casevar1, casevar2}.run)
Which reads nicely, and avoids the multi-level closure and indentation hell that sometimes comes with these sorts of tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm yes, good point, i am being lazy with these. table-based declaration of them is more generally standard and does look better. i keep seeing your changes come in and thinking "i should remember to declare like that 😛 "
i'm backing away from the HFS analogy, as it isn't great. at the very least, it's misleading, as the operations we perform aren't terribly analogous to the ones performed by filesystems. in any case, i think we're now at a stable spot with this. the core checks are in place in the solver itself, both the testing harness and the in general, the new satisfiability check is just a big win. there are really no cases that were working before that it cuts out - it just prevents dep from accepting solutions which already weren't going to result in a compilable build. the only drawback is the performance cost: we now have to perform case folding on each external import root at each step in the solver. we don't have benchmarks (there's another TODO #896) to know the actual impact of that, and it's all masked now anyway by the constant-factor costs of network and disk interaction. i'm slightly less bullish on treating root portions of paths as case-insensitive. we are, of course, within the bounds of what the compiler enforces by doing it, but we're also reinterpreting that logic in a different domain (local disk vs. network). but this PR is effectively deciding that all code hosts treat the root portion of the code they host as case-insensitive. even if it's considered bad practice to vary only by case, i can't imagine all code hosts actually enforce this as a rule in the way that we now effectively assume they do. the reasons to do this anyway, despite that risk, are:
as the code is currently written, the risks are:
|
I'm very curious if this has ever happened intentionally. I can't think of any example, and so far, neither has Twitter. Also, if what dmorsing is saying is correct, this situation will already cause the Go compiler to fail to build, which would mean it's not necessary for dep to solve it. (In other words, if it's a build failure, it's unlikely to be the case with any packages that exist today, and if it happens tomorrow for a new package, well, even without dep, it'd still provide a build failure). We should confirm that, though! As for the case folding, which is what you pinged me about: My reading of the Unicode docs implies that this isn't a terrible idea (albeit not recommended):
In this case, the folded "text" that we're storing is actually the key in a key/value store, whereas the docs seem to be written with the "value" mostly in mind. (For example, according to the Unicode docs, we shouldn't fold the source code itself and store that - even aside from the fact that it'd completely break compilation in Go). That said, I'm not wholly convinced that the folding is necessary on-disk. Presumably the contents would be identical in both directories. While that's a minor waste of space, vendoring itself is a solution that only makes sense of we treat disk space as a resource abundant enough not to require minor optimizations. And from a version control perspective, git will deduplicate the files, so the amount of additional space needed in the repository is negligible. (I'm less familiar with other version control systems, to be honest, but I think most should handle this reasonably). As you mentioned, the logic for the solver (ensuring that these versions are treated identically) has to exist anyway, so other than saving a few bytes on disk, I don't see a strong benefit to dropping that piece of information. (That is, the information of which casing was used to access a library at the time it was vendored by dep). |
awesome, thanks for taking the time on this! 😄
he is indeed correct - that's what i reference earlier on in here as being the rule in the compiler. this is where that check is. and this comment has a user running into it: #433 (comment).
it's kinda the other way around, actually. in order to produce a depgraph that the compiler will find acceptable, this PR introduces checks that make it impossible for case-only differences in import paths to exist in any solution that it finds. (that was the original goal of this PR; everything related to filesystems and storage is actually just a knock-on effect of addressing this original problem in the solver). otherwise, dep will pick out a set of dependencies that won't actually work, and won't even be writable on a case-insensitive filesystem (e.g. #797). people end up having to resolve this crap manually, which has been an arduously difficult process for a number of users already. these changes can't fix it for them, but it at least tells people which of their dependencies are using problematic imports and need to be fixed, as well as attempting to find versions of those dependencies that don't have a problem. i should have an example of what the
coooool cool good, ok. yes, that makes a lot of sense, and assuages my concerns.
it may well not be. i opted for this approach mostly because uniformity seemed beneficial. but, some things to clarify:
indeed, i think the "avoiding waste" is not a terribly good argument for doing the on-disk folding. the crucial requirement here is rather that we strictly control there being only a single object (a but doing that doesn't necessarily entail using a folded case on the filesystem itself - only that in the in-memory maps, we keep the folded case on hand for lookup purposes, so that subsequent calls to fetch the the network activity is a tad more concerning than the disk usage, but still probably negligible. however, it may become more of a pain in the future - e.g., under #431, it may become useful for people to forcefully clear the caches for a particular dependency. (hopefully not, but...) in such cases, it seems to me it might be easier that if we generally follow the pattern of keying on the case-folded-form everywhere, we might avoid gotchas in that arena.
just to be totally clear, that's not actually the moment we're talking about here. what appears in your ¯\_(ツ)_/¯ |
here's some sample output from the tests introduced in the PR:
it's the there's a more verbose failure message that gets dumped than the one in the tracer, but...well, yeah, suggestions on the wording of these failure messages is also welcome 😄 |
This lifts the exact folding algorithm and use pattern followed by the toolchain up into gps. Not only does it solve the strings.ToLower() inadequacy, but it means we're using the exact same logic the go compiler does to decide this same question.
May or may not end up using this right away, but it'll be in place for when we have the slightly stronger failure case of a project being addressed with an incorrect case, as indicated by the project's way of referencing its own packages.
This effectively makes them case-insensitive, case-preserving.
Also add an up-front check for case variant map lookup misses, and update the map accordingly.
eaeb2a1
to
9e038e6
Compare
the rabbit hole kinda keeps going down with this one. one thing, for example, that i need to look at - must import comments be byte-literal matches, or do they case fold as well? this could end up mattering, as these casing rules start spreading their tendrils through dep :( |
ugh...actually, so, that latest change just unconditionally always operates on the folded form of URLs when interacting with remote services. that's truly assuming that they're case-insensitive, rather than the weaker case where they can be case-sensitive, but still disallow case-only variations in the data they host. the latter seems much safer. need to make another change tomorrow to accommodate that. |
Keeping track of what maps to what in the sourceGateway setup can be really tricky with all the combinations; in the event of failures in this test, this will show the mapping tables, which helps a lot with understanding the actual final state.
But, still preserve the rule that we record the canonical folded URL in memory, so that we can have non-canonical inputs come in first and still converge with subsequent canonical, or other-case-variant forms later.
OK, those issues are now ameliorated - we don't case-fold what we write to disk, but we do case-fold in memory. This being case sensitivity, I imagine there's still gremlins running around somewhere, but I think they've been banished sufficiently far underground that we won't hear from them for a while. |
var buf bytes.Buffer | ||
|
||
str := "Could not introduce %s due to a case-only variation: it depends on %q, but %q was already established as the case variant for that project root by the following other dependers:\n" | ||
fmt.Fprintf(&buf, str, e.goal.dep.Ident.ProjectRoot, e.current, a2vs(e.goal.depender)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of arguments here is wrong. They should match the arguments to the previous fmt.Sprintf call, with the a2vs first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already merged, so a PR fixing it is preferred :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I ran into this deep into a concerted effort to port a project from govendor at the end of the work week, so I figured a quick note was better than nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sure, better to get the note down when you have the eureka moment. i'll take a PR whenever you find some time 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do / why do we need it?
This introduces a new satisfiability check in the solver that ensures we don't have any import paths that vary only by path. It's actually really just
ProjectRoots
, not import paths as a whole, because internal package paths are implicitly verified to be the right case by the package existence checker - that's grounded in the reality of a case-sensitive comparison against what's been read from disk and built into thePackageTree
.The effect of this is that dep will only allow one case-variant of any given import path in a solution/a
Gopkg.lock
. It will search until it finds combinations of versions of projects that maintain this invariant - as well as all other satisfiability criteria, like version constraints - or fail out with an informative message if no such combination exists.What should your reviewer look out for in this PR?
still need a handful more tests to cover the combinations
Do you need help or clarification on anything?
Which issue(s) does this PR fix?
several, at least.
fixes #433
fixes #797
fixes #806