Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

The dep workflow: default constraints, minimal manifest? #213

Closed
nathany opened this issue Feb 1, 2017 · 14 comments
Closed

The dep workflow: default constraints, minimal manifest? #213

nathany opened this issue Feb 1, 2017 · 14 comments
Labels

Comments

@nathany
Copy link
Contributor

nathany commented Feb 1, 2017

This is a meta-issue after some discussion with @sdboyer on #vendor to better understand the dep tool and to both explain my understanding and present what I think the workflow could be like based on what I currently know. Currently I'm only tackling the simple case of building a new app.

The dep tool is unique compared to its contemporaries because it is aware of the import graph contained in source code. That means the import statements in your source code are the source of truth. Bundler, Hex, NPM, etc. look to a manifest file for a list dependencies you manually maintain. This impacts what the so-named (#168) "manifest" file is actually for.

So to use dep, you can start by just writing code. At some point you need to dep init to create the manifest and lock file. The lock file is a full transitive list of all imported code with commit SHAs that make reproducible builds possible. The "manifest" need not list all the dependencies and constraints, in fact it could be essentially empty†. Let me explain.

Say you import "foo/bar" and run dep ensure to ensure that everything imported is correctly vendored and tracked.

The dep tool can be smart enough to fetch the latest stable version v1.3.4 of "foo/bar" rather than pulling master (as with go get) and rather than pulling v2.0.0.beta.1. Pre-releases are always opt-in, as the solver gives preference to stable versions.

At this point there is no need for anything to exist in the manifest, but the lock file will contain a specific commit SHA for "foo/bar" which you commit along with your code (and optionally the vendored copy of "foo/bar" for the hermetic view of the world).

Now lets say you import "baz/qux". Even if "foo/bar" v1.3.5 is available, running dep ensure should vendor "baz/qux" but not touch "foo/bar" at all.

A separate command should exist to dep update foo/bar, which would upgrade to v1.3.5. If no constraints are manually specified, dep update x always gets you the latest stable version, so once v2.0.0 is out, dep update foo/bar would give you that.

IMO, this should be fine for most cases. Upgrading versions is always done in a controlled manner and builds are reproducible thanks to the lock file. Additionally, you can use dep status to see what's going to happen before running dep update x and simply choose not to update to new major versions immediately.

The "manifest" comes into play when the default behaviour isn't what you want.

  • To opt-in to a pre-release version of a dependency ^v2.0.0.beta.1
  • When two downstream dependencies have conflicting constraints, the manifest can specify an override to determine which one wins.
  • Or generally to override the preferences of a downstream dependency if you know better.
  • To lock to an older version due to an incompatibility, license change, or other consideration.
  • To use a temporary fork in place of a given dependency.

These are all exceptions to the general rule of latest stable version. In all these cases, it is helpful to include a comment in the manifest indicating why the constraint is specified.

[dependencies]
"foo/bar" = "^v1.3.5" # v2 is incompatible with baz/qux

Next time you run dep ensure, the constraints in the manifest will override the default behaviour (latest stable version) when solving. However, if v2.0.0 is already vendored, you may see an error, and need to run dep update to update to the older specified version.

How does this sound so far?

If this makes sense, I'd suggest a few changes that could help direct some open issues:

  • The "manifest" should be a human-editable file that is created on init and never modified by the tool. Comments and organization of the file are up to the humans maintaining it. The initial file should include a wall of comments as documentation on usage. To support comments and for readability, the format preferably wouldn't be JSON (Move to TOML for manifest and lock #119), but the lock file could remain JSON.
  • The dep ensure foo/bar variant should go away in favour of a separate update command. Neither command should alter the manifest. To add a dependency, write an import statement in your Go code, and run dep ensure to vendor the latest stable version.
  • The "manifest" probably shouldn't be called "manifest" (Discuss renaming lock.json and manifest.json #168). It provides a way to override default behaviours. It's not a full list of everything (the lock file is).
  • The manifest is for specifying constraints and overrides. It is an error if the manifest specifies a package that isn't ever imported (much like unused variable/import in Go).
  • If supported, dep update foo/bar v1.3.4 could override any constraints specified in the manifest too, vendoring v1.3.4. But assuming this doesn't update the manifest (not worth the trouble), maybe it's not worth supporting constraints on the CLI.

The end result of all this is a manifest designed for humans but that rarely needs editing. For the common case, just write your code, run dep ensure, and commit your work as usual. That's as simple as it gets.

† Yet to confirm that the solver can cope with a large number of unspecified constraints (empty manifest) with a preference for the latest stable version.

@sdboyer
Copy link
Member

sdboyer commented Feb 2, 2017

First, this is an awesome writeup. Thanks for doing this!

I still have some unspecified dread about it being recommended - or at least, standard operating procedure - that constraints are omitted from that manifest. I can't point to a concrete reason for it, but this is a new (to my mind) way of thinking about how we could operate given the unique properties of the tool we have. And in a complex space like package management, one with significant and permanent consequences, I view new things with some trepidation :)

That said, I do want to like this direction, and I think we should consider it heavily over the coming months of experimentation. With dep/gps, we/I made a very deliberate choice to still have the import graph be queen, rather than replicating it into the manifest. That's what makes an approach like this possible at all. It's also the single property that makes this tooling very idiomatically Go, as it arises naturally from the way that import paths allow for extensible, universal deduction of upstream source location.

As for the elephant in the room:

† Yet to confirm that the solver can cope with a large number of unspecified constraints (empty manifest) with a preference for the latest stable version.

Yeah, so, this is a big deal. The problem here is constraint solving, which is an NP-hard search (as Russ demonstrated). We have to be cognizant of how the design of the tool, and the recommendations we make about its use, could create an ecosystem with pathological solving cases.

I can say some simple things about this:

  • If everyone uses empty manifests, then that's a very fast solve - there's no possibility of (version constraint) conflicts, so the first solution that the solver visits is likely to work. And that's a great solution...if all the versions of code selected actually work with one another.

  • It's possible - and if it turns out to be computationally feasible, I would like to do this - that we may introduce additional satisfiability checks in gps' solver in the future. For example, we may do an intransitive type compatibility check (Implement package type/api compatibility checking sdboyer/gps#67). I think this would be fairly magical, especially if not specifying constraints is a norm - the solver could just figure out that two versions won't work together, absent any constraint info, and move on to the next one. But, version constraint checks are generally in the 10s of nanoseconds range, so we do them first, as these other checks are much more expensive. Turns out, magic ain't free.

  • The real pathological solving cases are ones where the solver has to work through a large number of variables in order to try one solution, discover than solution doesn't work, backtrack and work through another large set to find that one doesn't work. One way this might happen is if there's a shared/diamond dependency deep in the graph - with both conflicting parents at least two levels away from the root - but the interstitial dependencies have a large number of versions and no constraints on them. That could result in the dreaded exponential search path, as the solver explores all the interstitial versions but keeps getting bitten by the further-removed constraint.

    It's possible we could make this situation more likely to occur if the norm is to have unconstrained imports (the interstitials), only adding constraints when we need them. Of course, it's also possible that some changes in solver ordering logic might obviate this as a risk, too :)

  • I am somewhat concerned that if the user starts by working in the carefree wonderland of an empty manifest, and is then suddenly hit with a solve error, then they'd be setting constraints in anger, and may be more inclined to ratchet down the constraint more than they really need to. Ratcheting down constraints like that is locally optimal, but globally harmful - if everyone does it, the ecosystem can seize up, nobody can rely on anyone else's constraints anymore, and everyone has to use overrides (aka the "take my toys and go home" lever).

I've opened sdboyer/gps#165 as one way of trying to get out ahead of this problem a bit - if we can generate random graphs that help identify patterns that gps' solver has problems with, then it could give us a better empirical sense of where the risks may lie.

Beyond that, a couple things, mostly nits:

However, if v2.0.0 is already vendored, you may see an error, and need to run dep update to update to the older specified version.

First, a clarification - dep doesn't care what's vendored when it comes to making decisions about versions to select. It treats vendor as volatile - nothing more than a concrete instantiation of the transitively complete requirements specified in the lock.

So, assuming you meant "if v2.0.0 is what's in the lock": there's no error here, at least not from the solver. There's no flag that prevents incidental downgrades - as long as the version is still admissible according to constraints/other criteria, it's fair game. So If there's an error for this case, it's something dep would have to tack on, or that we'd have to add a new rule for within the solver. (Neither is impossible - I'm just noting the current state of affairs)

At this point there is no need for anything to exist in the manifest, but the lock file will contain a specific commit SHA for "foo/bar" which you commit along with your code

Anything the solver picks on its own is going to be a rev with a corresponding tag or branch (there's one possible exception to this, but not relevant here) - it will record both the tag or branch AND the rev.

Now lets say you import "baz/qux". Even if "foo/bar" v1.3.5 is available, running dep ensure should vendor "baz/qux" but not touch "foo/bar" at all.

In general, this is true. ensure tries to minimize the amount of not-explicitly-requested change that happens. But it doesn't fail/error if that's not possible - it'll search for other solutions. So, if there's a dependency shared between "foo/bar" and "baz/qux" and there's some conflict on it with v1.3.4 of "foo/bar", it is possible that "foo/bar"'s version might end up at v1.3.5 afterwards.

@nathany
Copy link
Contributor Author

nathany commented Feb 3, 2017

Thanks. I'll pretend that I understand half of that, but I think I get the gist, and I appreciate the experimentation in this direction.

The reason behind this line of though for me is the confusion I see when there is a CLI command to add dependencies to the manifest. That suggests a different workflow that is competing with what the import graph brings. Unions of the manifest and import graph, locking and vendoring things that aren't imported yet, I can feel it getting messy. I think dep can and should avoid that in one way or another. This isn't the only way...

Making the manifest minimal puts more emphasis on just writing code and adding imports.

The type compatibility check would be a pretty awesome way to enhance the solver by making the default behaviour smarter, which may not be possible if the everything is persisted to a manifest instead of allowing recalculation.

If the manifest file is only machine-modified, through some series of CLI commands, outputting everything to the file is okay because it can be updated with every run, but then why have a separate manifest and lock file?

It's a different story if the manifest file is intended to be a primary part of the dep UI with just a few simple CLI commands to enforce it.

In the magical world where it's both human & machine editable with comments and order preserved, would it be a too magical for the tool to muck with that manifest file's constraints on behalf of the user? Perhaps not.

But if so, making the manifest constraints optional seems like the best option.

Here's a simple case:

One of my projects uses import "golang.org/x/net/http2". This is one of several repositories without version tags. There is no latest stable version, so the fallback is branch: master. What could be written to a manifest file in this case? In the current version (e93d9b5) of dep:

"golang.org/x/net": {
    "branch": "master"
}

After dep becomes wildly successful and integrated into the go toolchain, everybody starts using semver so fast forward to the future where there is a v1.0.0 tag on golang.org/x/net.

What does dep update golang.org/x/net do? I sure want it to start using v1.0.0, but does that mean rewriting the manifest file? Do I need to go around changing constraints before running the update, and will I do the right thing tag: ^v1.0.0?

If we can get that magical world of a human/machine editable manifest that doesn't feel surprising in use, that may be best because it provides a visual of all the immediate dependencies without having to hand write it. Otherwise, I quite like the idea of specifying the minimum necessary, and using the CLI to provide visuals of the graph (status).

As far as update vs. ensure and reporting errors, what I'm trying to get at is providing a means to add a dependency in an isolated way. This isn't always possible because of shared dependencies, but I don't want to upset the lock file and vendor folder more than necessary. Gradual, controlled additions and upgrades are more my preference (keep it green!). It sounds like this is already the case 👍

@nathany nathany changed the title The dep workflow The dep workflow: default constraints, optional manifest? Feb 3, 2017
@nathany
Copy link
Contributor Author

nathany commented Feb 3, 2017

Apologies for the wall of text. I'm thinking "out loud" and had another thought.

Consider this not-so-uncommon diamond scenario where two packages have the same dependency. My project depends on food and power which both depend on sun v1.0.0. After the release of sun v2.0.0 the power library adopts the new version immediately, but food is still happily using the 1st generation sun library.

When I go to upgrade power, I get a conflict that requires some user input.

After inspecting the code, I notice that food will operate unchanged with sun v2.0.0. Whether it never used the modified APIs in the first place, or it wasn't the breaking change the version suggested. So I want to specify an override of some sort:

[overrides]
"sun" = "^v2.0.0"

This isn't here to say that my project depends on sun. It's an override to resolve a downstream conflict.

NOTE: I really like the idea of the type compatibility check. Might later versions of dep hit the conflict and determine to use sun v2 without user intervention?

What happens when things change? Say food begins depending on artificiallight and power begins using nuclear.

What does my sun override do now that sun doesn't exist in the import graph? I think it should become an error similar to imported and not used.

I don't know if I can explain it, but somehow the minimal manifest just feels better to me here. More organized, with less clutter. I'd go so far as to suggest that the manifest doesn't list [dependencies] at all -- instead it could just be a list of [overrides].

The tools in my past (Bundler, Hex) have this a huge list of dependencies, mostly immediate, but some that are hoisted up just to get around conflicts. That happens a lot when early adopting prerelease versions before the rest of the ecosystem catches up. It's a huge jumble, and it's never really clear whether a dependency can be removed.

The import graph promises to clean that up.

What I don't want here is a union of the manifest and import graph. I want to know when I can drop a dependency.

@nathany nathany changed the title The dep workflow: default constraints, optional manifest? The dep workflow: default constraints, minimal manifest? Feb 3, 2017
@nathany nathany closed this as completed Feb 3, 2017
@nathany nathany reopened this Feb 4, 2017
@peterbourgon
Copy link
Contributor

I'm sorry, I'm having trouble following all of the exposition. Do you think you could pare down the "walls of text" to the specific workflow[s] you have in mind?

@nathany
Copy link
Contributor Author

nathany commented Feb 5, 2017

This has been an exploration of the UX of using dep in one particular direction.

The gist of everything above is to have two inputs into the solver. The import graph and a human-managed "manifest". In the cases above, "manifest" is essentially optional, and can be used to override the default common case (which would be designed to be what people usually want).

This makes the manifest fairly minimal in use (just the exceptions) and the CLI surface minimal too (just enforce the manifest + import graph in a controlled manner).

Alternatives could include:

  • A machine-managed manifest modified exclusively by CLI commands (possibly a lot of them), serving essentially the same purpose. In this case, what the manifest contains isn't part of the UX. It should never, ever need to be modified by people and the name (Discuss renaming lock.json and manifest.json #168) and location (Location of manifest and lock file (document) #207) should reflect that. The UX discussion would focus around the CLI.
  • A magic rainbows human/machine-managed manifest where the machine doesn't stomp over things humans care about or make surprising changes. It is more difficult to get right (implementation), but it would allow a more complete human-editable manifest, instead of just the overrides.
  • Absolute and utter chaos where it's not clear if the user should edit the manifest or use a CLI command, CLI commands with unexpected behaviours (why didn't it vendor?), bizarre special cases like temporarily locking and vendoring things that haven't been used "yet", etc., etc. As in, poor UX that is the result of implementing a lot of cool stuff without giving enough attention to the experience of using it (which I assume is one of the driving reasons why this pre-alpha is available to try out).

Happy to collapse my mutterings and reframe them in terms of the projects direction if desired, preferably after I have a better understanding of what the team's direction with the UX actually is. Thanks!

@sdboyer
Copy link
Member

sdboyer commented Feb 5, 2017

The gist of everything above is to have two inputs into the solver. The import graph and a human-managed "manifest". In the cases above, "manifest" is essentially optional, and can be used to override the default common case (which would be designed to be what people usually want).

Just to be clear, this is how it actually works now. The current project's import graph is represented in SolveParameters.RootPackageTree, and manifest in SolveParameters.Manifest.

@nathany
Copy link
Contributor Author

nathany commented Feb 6, 2017

@sdboyer I'm concerned with how it feels to use and what expectations users are having when using the tool. How does the UX of working with the CLI and manifest (if applicable) convey how it actually works?

When users see dep ensure github.com/pkg/errors@^0.8.0 in the README and then wonder why running said command didn't lock & vendor pkg/errors, there is a disconnect between the UX as presented and what's actually going on.

When I run dep init and see a bunch of stuff in manifest.json, I don't know just-by-looking whether this manifest is getting overwritten all the time, whether I should be editing it or not, etc.

It took some explaining, which can be turned into documentation, that would be good. It would be great if the documentation wasn't necessary.

What dep is actually doing, thanks to the import graph, is wonderful stuff. I really ❤️ it. How can the tool's CLI best reflect it?

@freeformz
Copy link

I was the squeeky wheel here so I want to explain my reasons for championing having dep ensure foo.com/bar/baz@~1.2.3 be a thing. The main motivation was that of convenience. The thought being: Why have to edit the file, manually type in the constraints and then run a command to apply the changes when I can have the tool do the work for me? I've seen that this has caused a lot of concern because of the confusing signaling it's caused.

Moving forward we're going to pursue only having dep create an initial manifest during dep init, but otherwise not modify the manifest, leaving it to the domain of developers to hand modify as necessary. We believe that init provides useful introspection of dependencies in the existing Go ecosystem and as such should continue to create an initial manifest.

I think once these changes are implemented dep will move more towards the system you describe above.

@adg
Copy link
Contributor

adg commented Feb 10, 2017

Just wanted to add my thanks to @nathany for pushing in this direction. I think it's a good outcome.

@sdboyer
Copy link
Member

sdboyer commented Feb 10, 2017

Echoing @adg here - very glad you pushed through with this, @nathany 😄

Given this general change in direction in favor of one of the broader goals (hand-edited manifests), I'm going to close this issue. Now that we're heading at least in this general direction, this issue is too broad to be helpful for further discussion. So, let's replace it with some more specific follow ups, as needed. I've kicked that off with #233.

@myitcv
Copy link
Member

myitcv commented Apr 9, 2017

Related issues include #303 and #277.

Like @peterbourgon I’ve struggled to fully understand/follow this thread, largely because I’m struggling to pick up/use the language that describes various situations/cases.

Some preamble; no points here, just establishing context/whether I've understood things to this point:

  • I acknowledge (per discussion with @sdboyer) that the use of vendor is an implementation detail, so references to vendor below are simply because that’s the current implementation. In this regard I note Feedback: should be able to use go tool on a project's dependencies #368
  • At any point in time, it’s the code I have on disk that gives me a reproducible build or otherwise (assuming the Go tools are the only additional tools required so we don’t need to get into the realm of specifying what other tools are required...)
  • So we have the code for a reproducible build... But where did that code come from? That’s what the lock file should tell us. Insofar as the remote repos and commits are faithful and available... this will be useful information, i.e. we could at any point tell whether our code copy represents an unmodified copy of the stated lock
  • Version information, constraints etc are to my mind an additional level on top... a mapping onto repository concepts that ultimately resolve to the commits we see in the lock file. This is where manifest comes in to my mind.

So now an attempt to formalise the language (please excuse the rather loose/imprecise/inaccurate set notation) used to describe dep:

  • Let’s assume I decide to vendor some code somewhere. Let’s further assume that this location is d/, such that after dep init the files d/manifest.json and d/lock.json exist
  • There is Go code that exists in the subtree rooted at d/ (ignoring d/vendor/) and there are tools that I use in the course of writing/using that code (i.e. they could be development or runtime tools). Let’s ignore all non-Go dependencies for now... then we define:
    • Set A: a subset of the packages go list ./... when run from d/ (assuming the implementation as adopted in cmd/go: exclude vendor dir from matching ... go#19090, hence everything in vendor is ignored). In almost all cases A would be the entire set
    • Set B: the program dependencies we want/need; main packages (for example github.com/cespare/reflex). It doesn't matter why we need such a program dependency.
  • Our Go dependencies, the set D, is a set of Go packages. Specifically it is the transitive dependencies of the union A ⋃ B, or more formally D = transitiveDepsOf(A ⋃ B)
  • The set D_d is the subset of D composed of our program depedencies and the packages in D that are directly imported by the set of packages A
  • Vendored packages are the subset V of D which are then found within d/vendor/ and "controlled" by a tool like dep
  • Let’s define the set O such that O = D - V. The set O is by definition found elsewhere within our environment and is not relevant for this discussion
  • Let’s assume that the subset of D that are standard library packages are generally not vendored (note this does not preclude this being required/supportable). I.e. for simplicity we’ll ignore our standard library dependencies for now
  • Let’s define C as the set of packages such that C = go list ./vendor/... when run from d/ (note this definition is intentionally based on the files that are on disk, not the contents of manifest.json/lock.json)
  • Let’s then define E as the set of packages such that E = C - V, where E is then the packages that happen to exist within the subtree rooted at d/vendor/ but that we have no dependency upon
  • Each package in the set C belongs to a repository. The set R is the set of repositories that are the containers of the set of packages C. Every package in C belongs to exactly one repository in R.
  • Let’s define the function packagesIn(repo) such that it gives the set of packages contained in the repository repo
  • At any point in time there exists a possibly empty set Q of repositories that can be pruned from beneath d/vendor/, where { r ∈ Q | packagesIn(r) ⊆ E }, i.e. we have no dependencies on the packages contained within those repositories
  • lock.json describes how we got from the original repos at specific commits to what we have on disk
  • manifest.json describes some constraints/other that describe how we derived lock.json
  • A discrepancy between lock.json and what we have on disk beneath d/vendor/:
    • can only be determined by re-fetching the contents described by lock.json and seeing whether the resulting directory trees are identical or not
    • should be detected during CI to ensure consistency for example
    • can arise for a number of reasons (not trying to list them all here)
  • A discrepancy between manifest.json and lock.json:
    • ... is beyond what I’m trying to cover here... this is the “versioning” problem

Given the above and this comment:

“It is an error if the manifest specifies a package that isn't ever imported (much like unused variable/import in Go).”

To my mind it’s entirely within our gift whether we want to describe such a scenario as an error or not. We can detect the existence of such cases at any point in time, describe them in terms of these sets; how we choose to handle them is up to us/ the dep tool.

I think the above attempt to formalise the language then helps when defining the required user commands we want.

Notice I use specific commits intentionally here because I’m not interested for now in the version resolution step, I’m simply interested in lock.json

# initialise a directory as the root of a vendoring process
#
dep init

# acknowledging https://github.com/golang/dep/issues/303 because I agree
# it should be possible to add to the set C without having to have declared the 
# dependency in code first

# "control" a package we will import/have imported. This could (/should?) prompt 
# us to ask whether we want to specify:
#
# * whether the commits of the transitive dependencies of github.com/pkg/errors 
#   should also be controlled by dep
# * assuming yes, whether we want to specify the commit
#
dep add github.com/pkg/errors@ff09b135c25aae272398c51a07235b90a75aa4f0

# "control" a main package we will use at some point - notice the -p flag used
# to indicate it’s a program dependency, i.e.
#
dep add -p github.com/myitcv/gopherjs/cmd/reactGen@648bf1950ae20f0ad155e4faabc276252c7f3ff9

# let’s verify that the tree d/vendor/ is identical to the state that results from replaying 
# lock.json
#
dep verify

# let’s prune the repositories beneath d/vendor/ that are in the set Q (note that dep verify
# would have given us details of the superfluous repositories)
#
dep prune

A user is interested in various of the sets listed above, functions of those sets and manipulating these sets. The commands the dep exposes can be defined in terms of those sets.

Would such an approach help to pin down the different areas/concerns and make precise what we're trying to do with various commands, the separate areas of concern?

It would also help in formalising feedback.

@sdboyer
Copy link
Member

sdboyer commented Apr 10, 2017

Yes, this thread definitely introduces some new terms and injects some additional uncertainty into the discussion. I do have formal properties like these in my mind with respect to dep's operation, but have not written them out. I'd say that doing so would be helpful.

Some notes, some corrections:

At any point in time, it’s the code I have on disk that gives me a reproducible build or otherwise

"code I have on disk" is a bit ambiguous - it could refer to, at minimum, either what's in a project's vendor or in GOPATH more generally. Either way, though, from the point of view of dep, this is not what provides reproducibility - lock.json does. Committing your vendor dir also provides reproducibility, but that's outside dep's scope of responsibility.

Version information, constraints etc are to my mind an additional level on top... a mapping onto repository concepts that ultimately resolve to the commits we see in the lock file.

If we're being precise here, it's better not to think of these as mappings onto "repositories", but a more generalized concept of a "source" that can produce N code trees, each of which is a version of a "project."

But yes, the basic idea that what's in the manifest is "an additional level on top" is generally sound; the manifest is not strictly necessary for running a solve. That fact was largely what motivated @nathany to open this issue.

  • Set A: ...
  • Set B: ...

Yep, this is how they're defined in gps. The *rootdata.externalImportList() call performs what amounts to the the union op.

Vendored packages are the subset V of D which are then found within d/vendor/ and "controlled" by a tool like dep
Let’s define the set O such that O = D - V. The set O is by definition found elsewhere within our environment and is not relevant for this discussion

dep explicitly manages all of vendor, never a subset of it. By this definition, when system state is good, V = D, and O is the empty set. So, it's relevant to the extent that we consider the situation where O is non-empty to be indicative that vendor is out of sync with the lock. It's generally not a harmful situation, but it's still incorrect.

At any point in time there exists a possibly empty set Q of repositories that can be pruned from beneath d/vendor/, where { r ∈ Q | packagesIn(r) ⊆ E }, i.e. we have no dependencies on the packages contained within those repositories

Some disambiguation here is needed - there is no valid exit state of dep in which Q will be non-empty. The only pruning work that may need to be done is the removal of E, which may be non-empty. (That's the focus of #120)

A discrepancy between lock.json and what we have on disk beneath d/vendor/:

  • can only be determined by re-fetching the contents described by lock.json and seeing whether the resulting directory trees are identical or not

If the lock contains a hash digest of the code tree, then one-sided verification is possible. This is a useful property, and much of the focus of #121.

  • can arise for a number of reasons (not trying to list them all here)

This is a bit ambiguous, but may be a key point of departure: while such things can happen, none of them are an output state of dep. That is, there is no current dep command that can exit 0 and allow such a disk inconsistency to exist. We may bend a little on this for particular circumstances, but this is a cornerstone of dep's design. To that end:

A user is interested in various of the sets listed above, functions of those sets and manipulating these sets. The commands the dep exposes can be defined in terms of those sets.

Can and should! But it should also be emphasized that dep is expressly not a tool for arbitrary manipulation of these sets. Being opinionated is how dep can guarantee that, for example, Q and O are always empty, if the system is in a good state, and that it is always possible for dep to drive towards that good state during every command execution. That's a useful guarantee, because it is categorically unuseful for Q and O to be non-empty.

# "control" a package we will import/have imported. This could (/should?) prompt 
# us to ask whether we want to specify:
#
# * whether the commits of the transitive dependencies of github.com/pkg/errors 
#   should also be controlled by dep
# * assuming yes, whether we want to specify the commit
#

Having a notion of "control" like this would change dep from an opt-out model (via ignored packages in the manifest) to an opt-in model, where you have to explicitly specify the deps the tool should be managing. This would dramatically increase the amount of maintenance work the user needs to perform, and I'm not seeing the upside to it. I'm sorry, but this is a non-starter.

dep add github.com/pkg/errors@ff09b135c25aae272398c51a07235b90a75aa4f0

Now, there may be some benefit to an add command, separate from ensure, to make the specific case in #303 behave as users expect - but it does not entail a new property like "control" to achieve it. The required property exists in the manifest, and we can even make it work without explicitly writing that property to the manifest on disk - #36.

# "control" a main package we will use at some point - notice the -p flag used
# to indicate it’s a program dependency, i.e.
#
dep add -p github.com/myitcv/gopherjs/cmd/reactGen@648bf1950ae20f0ad155e4faabc276252c7f3ff9

Would such an approach help to pin down the different areas/concerns and make precise what we're trying to do with various commands, the separate areas of concern?

Yes, while these are the terms I think in, I do think it would benefit broader discussion to formalize them. As evidenced by my response here, though, I think we need to be cautious that in defining these terms, we do not get distracted by a sense that we have to give the user access to manipulate all the sets we can define.

@myitcv
Copy link
Member

myitcv commented Apr 11, 2017

@sdboyer - I very much appreciate the comprehensive reply, thank you!

I'd say that doing so would be helpful.

To that end I've moved my straw man to https://docs.google.com/document/d/1xlo-fKGt5oJq8z8yQSTPShN__obQFmdBzbBS6puUhXg/edit?usp=sharing to help try and move the formalisation forward. Feel free to take a copy if you want to take control of the doc (after all the dep team are running the process), my attempt is just a starting point.

"code I have on disk" is a bit ambiguous - it could refer to, at minimum, either what's in a project's vendor or in GOPATH more generally. Either way, though, from the point of view of dep, this is not what provides reproducibility - lock.json does. Committing your vendor dir also provides reproducibility, but that's outside dep's scope of responsibility.

The reason behind this choice of phrase is that the compiler or indeed any other Go tool consumes .go files, not lock.json. It's the compiler that ultimately gives me a reproducible build or otherwise (again, ignoring any other tools); from the compiler's perspective, the contents of lock.json are irrelevant, it's the .go files on disk at any point in time that matter.

dep and lock.json (plus remote repositories, assuming I don't have a local copy) might allow me to get my filesystem into a state such that the code on disk allows for a reproducible build, but dep (and lock.json by extension) do not by themselves give me a reproducible build. Indeed as you rightly pointed out, there are other means by which I can ensure a reproducible build: committing all my code being the easiest.

I agree it's beyond the remit of dep as a tool to be forcing the user to do anything in this regard; best practice would almost certainly dictate they commit the vendor/ directory, lock.json and manifest.json but that's just advice. And this doesn't even consider other tools/environment that might be required for a reproducible build... again that's beyond the remit of dep.

If we're being precise here, it's better not to think of these as mappings onto "repositories", but a more generalized concept of a "source" that can produce N code trees, each of which is a version of a "project."

Is there a reason to deviate from the term repository (as seen here); is it insufficient/inadequate in some way?

I'm not entirely sure what you mean by:

can produce N code trees, each of which is a version of a "project"

I've added to the document what I understand as the definition of "repository", "commit" and "version". Do you agree with the base definitions?

Vendored packages are the subset V of D which are then found within d/vendor/ and "controlled" by a tool like dep
Let’s define the set O such that O = D - V. The set O is by definition found elsewhere within our environment and is not relevant for this discussion

dep explicitly manages all of vendor, never a subset of it.

Agreed, that seems like a sensible decision.

By this definition, when system state is good, V = D, and O is the empty set.

By requiring that V = D for a "good state" you're saying it is an error from dep's perspective for me to satisfy the compiler by any means other than the contents of vendor/, correct? Is that the intention?

So, it's relevant to the extent that we consider the situation where O is non-empty to be indicative that vendor is out of sync with the lock. It's generally not a harmful situation, but it's still incorrect.

O can be non-empty in the case a dependency from either A or B is satisfied by a location other than vendor/. The contents of vendor/ can still be entirely consistent with lock.json in such a situation.

Just to be clear (and I've clarified this in the Google Doc), any dependency in O could be satisfied by a GOPATH entry or not be satisfied at all (i.e. constitute a (temporary) compile error)

But again, back to the question above, is this an error? From the compiler's perspective it is not.

From a user's perspective this may not be what was intended... but again dep status (or similar) could alert them to this and an appropriate follow up command could "move" a package from O into V by some means.

Some disambiguation here is needed - there is no valid exit state of dep in which Q will be non-empty. The only pruning work that may need to be done is the removal of E, which may be non-empty. (That's the focus of #120)

So to confirm I have understood the current position correctly:

  • dep will automatically prune/delete repositories in Q every time it runs
  • dep will (following Vendor pruning #120) automatically prune/delete packages in E every time it runs

I think this is where I see dep being too aggressive by default. I fully acknowledge this may be a decision that's been taken already, so would appreciate some pointers as to why the following was discounted as an option or where I'm missing something:

  • dep should have a prune command that prunes the repositories in Q. I don't want dep removing things for me except where I've explicitly requested it to do so... that would be too surprising; related to Installing not yet used packages #303, but also if I temporarily remove an import (as part of a refactor for example), happen to re-run dep whilst the import is not there, then re-add the import... then in certain situations dep will have removed a source... and I'll need to make a network request again? Or is a local cache being used?
  • dep should not by default do anything with the set E to my mind. If you want to provide an option to prune (or indeed another command) that removes the packages E then it should be opt in. Reason being, as a human I know that the repository golang.org/x/tools contains a whole load of packages that I'm going to use. Do I have to declare all of those imports up front? Or do I have to repeatedly re-run dep ensure as I go?

In any case, do we need to make this decision on pruning now? Can we not have a v1 that doesn't prune? Then take lessons from there?

If the lock contains a hash digest of the code tree, then one-sided verification is possible.

Excellent point.

can arise for a number of reasons (not trying to list them all here)

This is a bit ambiguous, but may be a key point of departure: while such things can happen, none of them are an output state of dep. That is, there is no current dep command that can exit 0 and allow such a disk inconsistency to exist. We may bend a little on this for particular circumstances, but this is a cornerstone of dep's design.

It was intentionally non-specific 👍 But I agree with your point: that if you discount anything other than dep (i.e. users, other tools) touching the contents of d/vendor/ then dep can ensure that lock.json is consistent with the files on disk. I've clarified this in the Google Doc.

But it should also be emphasized that dep is expressly not a tool for arbitrary manipulation of these sets.

Indeed. Just to be clear, the intention behind trying to formalise the language (clarified at the top of the Google Doc) was to make it easier to be precise about certain situations in either discussion, specification, feedback etc. The interface provided to the user via dep's commands is (potentially) another matter entirely.

... That's a useful guarantee, because it is categorically unuseful for Q and O to be non-empty.

As I'm sure has been discussed before, it's one thing telling the user when the situation occurs (and explaining how to correct it), it's quite another to automatically remove/prune by default. This is just reiterating my point from above though.

Having a notion of "control" like this would change dep from an opt-out model (via ignored packages in the manifest) to an opt-in model, where you have to explicitly specify the deps the tool should be managing. This would dramatically increase the amount of maintenance work the user needs to perform, and I'm not seeing the upside to it. I'm sorry, but this is a non-starter.

Except that these packages belong to repositories and it's the repository you're opting into. They are the ones referenced in the lock file in terms of their location and a commit.

Now, there may be some benefit to an add command, separate from ensure to make the specific case in #303 behave as users expect

Unless I've misunderstood things, the case brought up by #303 is a direct result of the decision to have dep automatically prune Q and E. For the reasons I outlined earlier, but also the case of #303, I'm not convinced by this choice of default behaviour.

As evidenced by my response here, though, I think we need to be cautious that in defining these terms, we do not get distracted by a sense that we have to give the user access to manipulate all the sets we can define.

Agreed (per my comments earlier in this response). No intention to distract here, just trying to help pin down the language/implementation.

@sdboyer
Copy link
Member

sdboyer commented Apr 11, 2017

Is there a reason to deviate from the term repository (as seen here); is it insufficient/inadequate in some way?

Yes. Introducing the notion of a "project" - a tree of code - has far-reaching effects, one of which is to make the linked documentation too loose. Lack of definition around the differences between what the different VCS types provide with respect to how a "version" is defined and encoded is already a problem. And, while the go tool, and dep, currently only support working with version control-backed sources, that is not guaranteed to be the case long term - e.g. #175.

The sooner we move to and define a more abstracted notion of a 'source', and get rigorous in terms of how it relates to import paths, the better.

I've added to the document what I understand as the definition of "repository", "commit" and "version". Do you agree with the base definitions?

They're missing a couple things (s/commit/revision/; revisions and branches/versions have a relation; differentiation between semver and non-semver versions) but I don't have the bandwidth to write up precise definitions right now.

dep will automatically prune/delete repositories in Q every time it runs

Yes.

dep will (following #120) automatically prune/delete packages in E every time it runs

If you look through the discussion there, you'll see I'm very hesitant about doing any pruning automatically beyond removing nested vendor dirs (as that's required for correctness). If dep takes direct responsibility for pruning at all, it'll be through a standalone command - #322.

if I temporarily remove an import (as part of a refactor for example), happen to re-run dep whilst the import is not there, then re-add the import... then in certain situations dep will have removed a source...

Yes, it will. And it'll re-add that dependency when the import graph indicates it should.

It's far more costly to invent new forms of state management that allow a user to mark a removal as "temporary" (what does that even mean? how is it different than a package being required? (it's not)) than it is to make the tool work in a straightforward way with simpler inputs.

and I'll need to make a network request again? Or is a local cache being used?

Usually yes to the network request (though for other reasons, and that could be eliminated); yes, a local cache is used.

Except that these packages belong to repositories and it's the repository you're opting into.

In what way is it the repository you're opting into? The opt-in is to packages contained within a repository. If you don't need packages from a source, you don't need the source.

For the reasons I outlined earlier, but also the case of #303, I'm not convinced by this choice of default behaviour.

Let me be more pointed in directing you to a particular comment, rather than just the issue: #36 (comment). It's possible for us to support something like the add behavior in the current paradigm, via crafty use of required. It's not possible, as far as I've been able to think, to avoid the absurdities and/or constant housekeeping that fall out from taking another approach; see this writeup.

krisnova pushed a commit to krisnova/dep that referenced this issue Apr 21, 2017
Move LockDiff from dep into gps
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants