Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lockfile concerns #579

Open
wycats opened this issue Oct 11, 2016 · 27 comments
Open

Lockfile concerns #579

wycats opened this issue Oct 11, 2016 · 27 comments

Comments

@wycats
Copy link
Member

wycats commented Oct 11, 2016

I wanted to open up this issue as a discussion item. I don't think there's anything blockery about this concern (more below), but I do think that the structure of the lockfile will become problematic in the long term.


First, here's a quick example of one of the consequences of the problem, from Yarn's own lockfile:

wordwrap@~0.0.2:
  version "0.0.3"
  resolved "https://registry.npmjs.org/wordwrap/-/wordwrap-0.0.3.tgz#a3d5da6cd5c0bc0008d37234bbaf1bed63059107"
wordwrap@0.0.2:
  version "0.0.2"
  resolved "https://registry.npmjs.org/wordwrap/-/wordwrap-0.0.2.tgz#b79669bb42ecb409f83d583cad52ca17eaa1643f"

There are two dependencies described here (wordwrap ~0.0.2 and wordwrap =0.0.2). Both of them are satisfied by wordwrap 0.0.2, but instead we get two different packages.

This doesn't always happen; there is some code that tries to "dedup" packages:

xtend@^4.0.0, "xtend@>=4.0.0 <4.1.0-0", xtend@~4.0.0:
  version "4.0.1"
  resolved "https://registry.npmjs.org/xtend/-/xtend-4.0.1.tgz#a5c6d532be656e23db820efb943a1f04998d63af"

But it's hardly the only example:

rimraf@^2.2.8, rimraf@^2.4.3, rimraf@^2.4.4, rimraf@^2.5.0, rimraf@~2.5.0, rimraf@~2.5.1, rimraf@2:
  version "2.5.4"
  resolved "https://registry.npmjs.org/rimraf/-/rimraf-2.5.4.tgz#96800093cbf1a0c86bd95b4625467535c29dfa04"
  dependencies:
    glob "^7.0.5"
rimraf@~2.2.6:
  version "2.2.8"
  resolved "https://registry.npmjs.org/rimraf/-/rimraf-2.2.8.tgz#e439be2aaee327321952730f99a8929e4fc50582"

The rimraf ^2.2.8 dependency could unify with rimraf ~2.2.6, but it doesn't.

Worse:

recast@^0.10.0, recast@^0.10.10:
  version "0.10.43"
  resolved "https://registry.npmjs.org/recast/-/recast-0.10.43.tgz#b95d50f6d60761a5f6252e15d80678168491ce7f"
  dependencies:
    ast-types "0.8.15"
    esprima-fb "~15001.1001.0-dev-harmony-fb"
    private "~0.1.5"
    source-map "~0.5.0"
recast@0.10.33:
  version "0.10.33"
  resolved "https://registry.npmjs.org/recast/-/recast-0.10.33.tgz#942808f7aa016f1fa7142c461d7e5704aaa8d697"
  dependencies:
    ast-types "0.8.12"
    esprima-fb "~15001.1001.0-dev-harmony-fb"
    private "~0.1.5"
    source-map "~0.5.0"

All of these versions of recast could unify to recast =0.10.33, but instead we get two copies of recast.

Another example:

once@^1.3.0, once@1.x:
  version "1.4.0"
  resolved "https://registry.npmjs.org/once/-/once-1.4.0.tgz#583b1aa775961d4b113ac17d9c50baef9dd76bd1"
  dependencies:
    wrappy "1"
once@~1.3.0, once@~1.3.3:
  version "1.3.3"
  resolved "https://registry.npmjs.org/once/-/once-1.3.3.tgz#b2e261557ce4c314ec8304f3fa82663e4297ca20"
  dependencies:
    wrappy "1"

All of these patterns can unify to once 1.3.3, but we get two copies.


Hopefully I've made my point 😉


The current yarn strategy successfully unifies dependencies opportunistically, but doesn't have any backtracking mechanism, which is necessary for the unification opportunities described above. It also doesn't provide any way for resolvers to participate in the necessary backtracking. This means that even if we successfully unify the graph we have now via post-processing, there are likely even better opportunities that we'll miss.

The major change in the lockfile format I propose is that the lockfile represents a graph of packages, where the entries are exact package identifiers. This is a change from the current format, where the lockfile is a list of patterns and associated packages.

The reason I think the current status isn't an urgent priority is that a list of seen patterns and associated packages is still deterministic, which is the number one priority of the lockfile. Lower priorities, like conservative updating (yarn update some-pkg changes the minimal necessary subgraph) depend on a more traditional lockfile, but we can work on it over time.


Ultimately, I think a simplified version of the Cargo strategy (the Cargo strategy supports a nice optional dependency mechanism called "features" which we don't support, and which complicates the algorithm somewhat) is the right way to go. The Cargo strategy is a refined version of the Bundler strategy, which was also used (and further refined) by Cocoapods. The Cargo strategy adds support for package duplication, which is necessary (and desired) in the Node ecosystem.

This would mean version-bumping the Yarn lockfile, but the good news is that Yarn's lockfile is already versioned, so this should be fine.

The Rough Strategy

Here's a rough description of the algorithm. (it's slightly modified from Cargo and Bundler's algorithm because of the amount of duplication expected as a matter of course in the npm ecosystem).

Dependency resolution always starts with a top-level package.json, which has a list of dependencies.

Each dependency (once@~1.3.0) expands to a set of possible packages, determined by querying all of the available registries. If a dependency describes a non-registry source (a git repo or tarball, for example), the dependency expands to a set of just one package (in bundler, this is called a "pinned" dependency).

The first step in the algorithm adds a list of all dependencies in the top-level package.json to a remaining dependencies list as a tuple of (dependency, candidate-list).

Initialize an all packages Map whose keys are package names and whose values are Sets of all available packages with that name.

Repeatedly activate a dependency by popping a dependency from the list of remaining dependencies:

  • if a dependency is already satisfied by the existing set of activated packages, continue
  • otherwise, pick a candidate for the dependency;
    • restrict the set of all packages with the name of the dependency to the set of packages that match the dependency.
    • add it to the set of activated packages
    • link the activated package in the graph to the parent package the dependency came from
    • recursively activate it

If a dependency cannot be unified with an existing instance of the same package, consider backtracking if:

  • the set of all packages still available contains packages that could satisfy the dependency
  • the dependency is a peer dependency
  • the package is marked flat

If the --flat flag was passed to yarn, always backtrack if a dependency cannot be satisfied with an existing package with the same name in the set of activated packages. This will effectively turn the algorithm into Bundler, which guarantees the "highlander rule" ("there can be only one" package per name in the graph).

Keep going until the set of remaining dependencies is empty. At this point, a complete graph of unique packages will exist, and be ready to be written out to a lockfile.


A note about backtracking: because dependency graphs with thousands of packages are common, it is not advisable to use recursion when implementing this algorithm in a language with limited stack depth. Instead, we would maintain a side stack (see the same algorithm in Cargo) that allows us to implement the recursive algorithm without blowing the stack.

@ide
Copy link
Contributor

ide commented Oct 11, 2016

I really like the proposal in #422 to use a weighted SAT solver assuming it runs fast enough on large dependency graphs. Then we could optimize for as much deduping as possible, or using the latest packages as much as possible, and so on. It might end up being impractical but modern SAT solvers run impressively fast and can find an optimal solution when a solution exists.

@wycats
Copy link
Member Author

wycats commented Oct 11, 2016

@ide the usual SAT solvers treat constraints as absolute: either two dependencies conflict or they don't. This is more-or-less Bundler's strategy. The strategy described above tries to unify as much as possible, while still allowing duplicates to exist. It's also a back-of-the-envelope strategy (Cargo does something similar, but the ecosystem is much more reliant on semver, defaults to the ^ operator, and ends up with fewer dupes as a matter of course). I'm sure, in practice, it would be tweaked for correctness and performance 😄

@wycats
Copy link
Member Author

wycats commented Oct 11, 2016

@ide your description implies that you know of existing SAT solvers that allow for weighted constraints. Links?

@wycats
Copy link
Member Author

wycats commented Oct 11, 2016

@ide I'd also like to add that the Cargo algorithm is a few hundred lines of code in total (including comments) and is performant enough in practice. The bundler algorithm is a similar size, is written in Ruby, and is performant enough in practice (again: written in Ruby!).

I don't think we need to wait to integrate an existing SAT solver library (unless someone is motivated to do the work and make sure it works reliably on the supported platforms) to do this improvement.

@cpojer
Copy link
Contributor

cpojer commented Oct 11, 2016

I'm excited about your proposal @wycats. Is this something you are planning to actively work on?

@ide
Copy link
Contributor

ide commented Oct 11, 2016

@wycats I was thinking about https://github.com/meteor/logic-solver (linked in that issue), which Meteor uses to compute its dependency graph. They've done the work of making it run on many systems via emscripten I think. There's a method called Logic.Solver#maximizeWeightedSum which computes a solution that maximizes a sum of weighted booleans. I haven't thought about the problem very deeply, I've just seen that many optimization problems can be reduced to SAT. Assuming we can solve that problem it would allow us to compute dependency graphs that are optimal for different needs. I don't feel strongly enough about it to do the work myself, I just think the computed dependency graph has nice properties.

@wycats
Copy link
Member Author

wycats commented Oct 11, 2016

@cpojer yes. I was working on it a bit last week, but decided to write up the proposal first to get feedback before sinking a bunch of time into it.

If people are enthusiastic, I think there are a few steps we can take to land the change incrementally.

@Daniel15
Copy link
Member

Daniel15 commented Oct 11, 2016

I don't really have anything useful to add and I'm probably not clever enough to help solve this, but I just wanted to mention that your description is fantastic and describes the problem very well 😄

For reference, here's how NuGet handles dependency resolution: https://docs.nuget.org/ndocs/consume-packages/dependency-resolution#dependency-resolution-rules. In general, it uses the lowest version that satisfies all version requirements (so in your example of ~0.0.2 and =0.0.2, it'd always use 0.0.2). NuGet is different in that it's mainly for C# and you can only load one version of a particular library. If the version constraint can't be resolved such that a single version wins, it throws an error. I'd actually be fairly interested in having an optional mode for that in Yarn, particularly for my personal webapps where I'm pretty stringent about dependencies and really don't want to load multiple versions of the same library.

@sdboyer
Copy link

sdboyer commented Oct 11, 2016

Just for reference, a basically-analogous algorithm to the one @wycats is describing is also used in dart's pub, and my own take on it for Go is moving towards adoption, as well.

@wycats
Copy link
Member Author

wycats commented Oct 11, 2016

@sdboyer <3 Some of my favorite conversations on this topic have been between the two of us, so I'm happy you recognized my description of the algorithm (it's a tricky algorithm to describe well).

@Jessidhia
Copy link

Not sure if quite matching this topic -- but the "resolved" field doesn't seem appropriate for a lockfile. I mean, you do need to store the resolved version / file hash, but storing even the absolute URL makes lockfiles generated in different machines (say, if a user is using a local sinopia cache) have different resolved URLs depending on user settings.

In the worst case, a lockfile could be generated entirely with localhost URLs, even if fetching from the public registry would have the same result (same downloaded hash); so nobody would be able to install from the lockfile without also setting up a local cache server.

The origin of the file is not important, what matters is that the hashes match.

@dxu
Copy link
Contributor

dxu commented Oct 12, 2016

Just saw this, I'm doing some work to implement a basic version of #422 that uses the aformentioned logic solver by meteor. Seems like both your proposed solution and what I'm implementing might be trying to solve the same problem.

The major change in the lockfile format I propose is that the lockfile represents a graph of packages, where the entries are exact package identifiers.

Possibly unimportant question - the algorithm will always output a single package that satisfies all constraints, but what is the advantage of having the graph of packages on top of this? At that point, couldn't you just have the list of installed exact packages? My understanding of the lockfile is just that it's a file to indicate that packages you already have installed, to compare with versions or ranges in the future, so definitely correct me if I'm wrong here.

The Cargo strategy adds support for package duplication, which is necessary (and desired) in the Node ecosystem.

Sorry, just to clarify, by "package duplication", do you just mean multiple versions of the same package?

Here's a rough description of the algorithm....

Cool! A few questions off the top of my head - it seems like this could potentially result in a "non-optimal" configuration. Given that in large repos, you could potentially have a lot of "valid" configurations due to occurrences like transitive dependencies, or overlapping/conflicting dependencies at various parts of the dependency graph, you could potentially end up in a situation where you've optimistically chosen the first dependency as the "most recent" version, but as a result multiple packages further down the line suffer as a result. A simple example would be something where you have 3 packages - A, B, C - each with versions 0.0.1 -> 0.0.3

A 0.0.1 depends on B*
A 0.0.3 depends on B 0.0.1

B 0.0.1 depends on C 0.0.1
B 0.0.3 depends on C 0.0.3

The top level package.json requires A*.

With the algorithm described above, it'd look at A, optimistically choose the most recent version 0.0.3 (i'm assuming), which will require Bv0.0.1 and transitively, Cv0.0.1. But choosing A 0.0.1 might potentially be a "better" choice, given that you have less out of date packages. The algorithm would then never consider backtracking and checking lower versions of A, because it satisfies the other constraints. At this point, one might argue that that's a pretty arbitrary problem that doesn't have a deterministic solution, but bringing it up in case it is worth discussing. Is this a concern?

Note that the example is very contrived, but hopefully my example gets my question across. You can extend this example with more packages with interwoven dependencies to simulate the backtracking checks - but the idea here is that packages wouldn't ever consider backtracking, because they're already in a "valid" state, because the dependencies are technically already satisfied by the existing set of activated packages - they just haven't explored the other options.

I'll talk about the approach that @yunxing and I discussed offline, and was going forward with for #422 - I think both approaches are trying to solve the same underlying problem. Though, note that this was intended to only be applied during the --flat installation.

With the SAT solver approach, the plan to implement a basic version of this is to model the package version resolution as a logic problem, and using the logic-solver to handle the actual solution-finding. If you take a look at the api for modeling the problem, you should be be able to model the problem as a series of require, or, and implies formulae to indicate explicit version, version ranges, and dependencies, respectively.

The algorithm for setting up the logic problem would be as follows:

  1. All packages that have explicit versions will be required as the explicit version.
  2. Any packages that have ranges of versions will be added as a group of packages or'ed together. So for example, ^0.0.1 for a package that has all available versions 0.0.1, 0.0.2, 0.0.3 would be modeled as or(['0.0.1', '0.0.2', '0.0.3'])
  3. For every dependency, you can set up an implies formula between the explicit versions, or ranged versions. For example, if A v0.0.1 has a dependency on B v0.0.2, you would add implies('A v0.0.1', 'B v0.0.2') as a constraint. You will have to do this for every single valid combination of parent->child dependency. This also includes all pairs formed as a result of a *.

Theoretically, setting this up and telling the solver to solve the problem should result in a set of valid "choices". At this point, you can define the "best choice" heuristic however you want, by adding weights (as @ide mentions above), and running another solver on this smaller set of choices (for example, weighting more recent versions higher). The nice part is, all of the complexity of generating/traversing up and down the dependency graph no longer needs to be handled by us, the logic solver will just "handle" all of that for us. Because we've already fetched all of the information anyway, we just need to keep track of the final list of packages and their versions, and use the manifests we already fetched to install only those packages.

Note that I'm still working on this, so I don't know the performance implications yet. It looks like Meteor is using this within their package manager, with more complex logic.

cc @yunxing, @jordwalke, correct me if I'm wrong here... Props also goes to them (and maybe others that I don't know) for thinking of using the logic solver!

@yunxing
Copy link
Contributor

yunxing commented Oct 12, 2016

Love this! Thanks for thinking about the problem. The proposed algorithm looks good to me, only concern is the performance of the algorithm with thousands of dependencies (this is NP-complete!). But even then we can add a separate command to generate the lock file first (so all subsequent installation can just read the lock file).

@dxu has already started working on the SAT solver approach. It may not be as flexible as the proposed algorithm, but it could be a great fit when in "--flat" mode. I think we should share our plans here to avoid duplicated efforts.

@alloy
Copy link
Contributor

alloy commented Oct 12, 2016

For reference, Bundler and CocoaPods’ resolver implemented in Ruby lives here.

@sdboyer
Copy link

sdboyer commented Oct 12, 2016

@dxu these are some great thoughts! lots to chew on. some quick responses to a couple points:

With the algorithm described above, it'd look at A, optimistically choose the most recent version (i'm assuming), which will require Bv0.0.1 and Cv0.0.1. But choosing A 0.0.1 might potentially be a "better" choice, given that you have less out of date packages (obviously, at this point, one might argue that that's a pretty arbitrary problem that doesn't have a deterministic solution), but bringing it up in case it is worth discussing. Is this a concern?

So yeah, I'm gonna make that argument 😸 But specifically in reference to having, or not having, "more out of date packages."

The algorithm @wycats has described should be able to guarantee that, if a solution is found, then the most recent version of each package is used that satisfies all mutual constraints. (Assuming highlander/no duplication - I haven't yet absorbed the implications of allowing duplicates into my thinking). If the goal were to find the solution with the "minimal" distance from selected to newest versions for the entire depgraph, that would require an extra layer where some sort of cost function (probably having a similar shape to meteor's version pricer) defines those distances, then we look for multiple solutions and pick the one with the minimum total cost. (Or, some entirely different approach.)

Either way, though, I think such an approach would necessarily have to explore the entire combinatorial version space to guarantee it had found "optimal distance." That would eliminate one of the main escape valves a solver can use to minimize running time: we only have to find one solution, not all possible ones. That makes it worth making strategic choices about the order in which deps that are likely to (further) minimize the amount of the space we have to explore.

In a broader sense, though, I'm sorta on the fence about whether it's a Good Thing™ to keep as close to the latest release as possible. I mean, as a goal within dev groups, I think that's probably reasonable, and it's certainly the way most teams I've been on have operated. But that doesn't necessarily mean tooling should operate in the same way. Absent user intent to the contrary, it might be better to default in a conservative direction - one that instead focuses on minimizing change, and thereby minimizing risk. (That's the motivation behind preferred versions, which I'm still experimenting with.)

@wycats
Copy link
Member Author

wycats commented Oct 14, 2016

@dxu @yunxing I've been making some progress on this issue but I definitely see that this overlaps with #422. Can we do a call early next week to sync up our efforts?

@wycats
Copy link
Member Author

wycats commented Oct 14, 2016

Either way, though, I think such an approach would necessarily have to explore the entire combinatorial version space to guarantee it had found "optimal distance." That would eliminate one of the main escape valves a solver can use to minimize running time: we only have to find one solution, not all possible ones.

For what it's worth I think this is extremely important. I have also found empirically that the longer it takes to find a solution, the more likely it is that the solution is conceptually wrong ("I found a version of Rails that is compatible with this version of ActiveMerchant. It's Rails 0.0.1, which didn't depend on ActiveSupport!")

@wycats
Copy link
Member Author

wycats commented Oct 14, 2016

Absent user intent to the contrary, it might be better to default in a conservative direction - one that instead focuses on minimizing change, and thereby minimizing risk

Bundler and Cargo both implement "conservative updates", which optimizes for minimal change rather than newest versions. I strongly prefer that in the absence of a signal to the contrary (yarn update pkgname or updating the package explicitly in the package.json).

@dxu
Copy link
Contributor

dxu commented Oct 14, 2016

Yeah, I should be mostly free. Perhaps we can coordinate through discord/email on a time that works for all of us.

Quick thoughts:

I have also found empirically that the longer it takes to find a solution, the more likely it is that the solution is conceptually wrong

This is true. with the logic solver approach, the heuristics pricing becomes the more important problem than the solution generation, and most likely a problem that is more easy to be improperly implemented.

Probably the main advantage to the solver approach is that it offloads the complexity of solving the dependency graph to the SAT solver and focuses on the heuristics, which allows for more flexibility compared to a deterministic algorithm. Practically, this might be a very rare need. I'm guessing performance with the algorithm probably isn't an issue for 99% of cases given that it also powers bundler and cargo.

@dxu
Copy link
Contributor

dxu commented Oct 17, 2016

a small update in case it's helpful for our upcoming conversation/for anyone interested, I implemented a quick and dirty POC for the sat solver, just for solving all combinations - without the pricing heuristics - you can find details on #422

@graydon
Copy link

graydon commented Oct 17, 2016

your description implies that you know of existing SAT solvers that allow for weighted constraints. Links?

Fwiw the semver solver I prototyped on top of z3 a while ago when this came up in discussion for cargo allows for this; in general z3 permits weighting and optimal-solving by user-defined metrics, as of at least whenever the "z3opt" branch was integrated. It's definitely on trunk these days.

Other MaxSAT solvers include ahmaxsat, toysat, qmaxsat, maxino, maxhs, openwbo ...

@dxu
Copy link
Contributor

dxu commented Oct 18, 2016

Fwiw the semver solver I prototyped on top of z3 a while ago when this came up in discussion for cargo...

For reference: rust-lang/cargo#2064

@yunxing
Copy link
Contributor

yunxing commented Dec 13, 2016

Relevant to this issue: https://research.swtch.com/version-sat
@rsc

@scinos
Copy link

scinos commented Jul 25, 2017

We face that problem as well, and wrote a post-process tool to identify and resolve duplicated packages: https://www.npmjs.com/package/yarn-tools.

@DanielSWolf
Copy link

The unreliable deduping is a real problem for us. We use a number of packages that misbehave if multiple versions run at the same time.

Has there been any progress?

instructure-gerrit pushed a commit to instructure/canvas-lms that referenced this issue Mar 6, 2018
because yarn is silly and frequently adds duplicate entries for things
into yarn.lock when it could satisfy everything the app needs with one
version; when you make any changes to package.json, run this script:
`yarn upgrade-and-dedupe` 
to make sure that yarn.lock doesn't add needless duplicates of things.

if yarn ever fixes this issue:
yarnpkg/yarn#579
then this can go away.


test plan:
1. check this out
2. run `yarn upgrade-and-dedupe`
3. push the change that commit makes
4. it should pass jenkins and not introduce any errors

Change-Id: I8c268cc4c7c6ac4474221ce65129a6ef8b261813
Reviewed-on: https://gerrit.instructure.com/142447
Tested-by: Jenkins
Reviewed-by: Clay Diffrient <cdiffrient@instructure.com>
Product-Review: Ryan Shaw <ryan@instructure.com>
QA-Review: Ryan Shaw <ryan@instructure.com>
ryankshaw referenced this issue in instructure/instructure-ui Jul 26, 2018
To prevent duplicates in consumer bundles

Change-Id: I5aceb0f151654efebd8d8b86e30e28cac11d37e2
Reviewed-on: https://gerrit.instructure.com/158300
Tested-by: Jenkins
Reviewed-by: Ryan Shaw <ryan@instructure.com>
Product-Review: Jennifer Stern <jstern@instructure.com>
QA-Review: Jennifer Stern <jstern@instructure.com>
@fbartho
Copy link

fbartho commented Dec 3, 2020

Just wanted to keep people from thinking this is stale -- this is still a concern today!

@wycats
Copy link
Member Author

wycats commented Dec 9, 2020

Whoa! Notifications about a 2016 comment === blast from the past!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests