Allow gateway exec-ing into a failed solve with an exec op #1732

hinshun · 2020-10-13T01:33:31Z

Adds Evaluate bool to the SolveRequest, this explicitly evaluates ResultProxy because they are currently lazy results (until returned, or a call like ref.ReadFile is called). Otherwise, exec errors will only show up in the main solve request instead of inside runGatewayCB.
Adds SolveError as a TypedErrorProto. Clients can use errors.As to extract it.
Adds GetDefault() (worker.Worker, nil) to frontend.WorkerInfos, that plumbs down the default worker.
Refactors frontend.WorkerInfos to worker.Infos because of the cyclic dependency
Refactors gateway.NewContainer to share the exact same code as solver/llbsolver/ops/exec.go, previously was 90% duplicated. The existing version of gateway.NewContainer also didn't allow for scratch mounts whereas solver/llbsolver/ops/exec.go did.

Example client code: https://gist.github.com/hinshun/e72f509121e022bc81ebba03fc2851c6

When an pb.ExecOp or pb.FileOp fails to be solved, its protobuf definition, along with its solved inputs / outputs are wrapped in a typed error to the lbfBridgeForwarder which holds the temporary IDs for references.

The inputs and outputs have temporary uuids generated and sent back over as a TypedErrorProto which allows it to unmarshaled on the client side to reconstruct the arguments necessary to gateway exec back into a container process for the failed solve.

frontend/gateway/forwarder/forward.go

frontend/gateway/pb/gateway.proto

client/build_test.go

tonistiigi · 2020-10-20T01:31:22Z

frontend/gateway/forwarder/forward.go

 			if !ok {
 				return nil, errors.Errorf("unexpected Ref type: %T", m.Ref)
 			}
+
+			res, err := refProxy.Result(ctx)


wonder if this should be parallelized?

Done here: dfaf613

solver/llbsolver/ops/exec.go

tonistiigi · 2020-10-31T20:40:14Z

@hinshun Is this ready?

hinshun · 2020-11-01T05:44:03Z

@tonistiigi Yes, though I did punt on the errors before Run for ExecOp. I don't think its worth the complexity for SolveError. I think they can also derive the op from the VertexError if they wanted metadata.

tonistiigi · 2020-11-01T05:45:28Z

VertexError only contains digest though, not the proto definition.

hinshun · 2020-11-01T05:47:13Z

Yeah, I understand clients doesn't always have access to the proto definition to lookup the op via digest. Maybe introduce an OpError in the future?

tonistiigi · 2020-11-01T06:06:19Z

What is the difference between OpError and SolveError? I think returning the input refs would be useful for OpError as well.

hinshun · 2020-11-05T00:57:26Z

@tonistiigi I rebased and added a commit that returns exec errors for errors returned before the executor run now: 1bbc7c2

coryb · 2020-11-06T02:16:21Z

frontend/gateway/forwarder/forward.go

+		var err error
+		inputIDs, err = c.registerResultIDs(ee.Inputs...)
+		if err != nil {
+			return err


In the edge case that the registerResultIDs returns the unexpected type for result error then we will lose the original solve error here. I wonder if these should be errors.Wrap(solveErr, err.Error()) instead to make sure the solve error is preserved? I am not sure if we would ever practically get the unexpected type for result error though.

I am not sure if we would ever practically get the unexpected type for result error though.

I think that would be an implementation error in Buildkit?

I'm okay with the errors.Wrap approach, but unsure if it'll be useful if incomplete? You'll need a lot of safeguards in the client side for incomplete solve errors.

coryb · 2020-11-06T02:30:14Z

frontend/gateway/gateway.go

+		var err error
+		inputIDs, err = lbf.registerResultIDs(ee.Inputs...)
+		if err != nil {
+			return err


same issue here, we will lose solveErr

solver/llbsolver/solver.go

tonistiigi · 2020-11-06T04:56:04Z

client/build_test.go

+			mounts = append(mounts, client.Mount{
+				Selector:  mnt.Selector,
+				Dest:      mnt.Dest,
+				ResultID:  se.Solve.OutputIDs[mnt.Output],


Shouldn't this be mnt.Input ?

Both work, they could use mnt.Input but they would be the unmodified inputs. For this test, I was checking that I could access the mutated mounts rather than the original inputs.

In the LLB we have: echo %s > output && fail, so this test was to exec into a mount where echo %s > output succeeded but the fail failed the execop overall.

The logic should be that both se.Solve.OutputIDs and se.Solve.InputIDs are indexed by mnt.Input. mnt.Output is a link to the next vertex. There is no need that output must be defined on a mount in order to be able to debug its error state.

tonistiigi · 2020-11-06T05:33:21Z

client/build_test.go

+		op := se.Solve.Op
+		opExec, ok := se.Solve.Op.Op.(*pb.Op_Exec)
+		require.True(t, ok)
+


check the count of items in OutputIDs InputIDs, as well as some meta properties like Args

tonistiigi · 2020-11-06T05:42:24Z

solver/llbsolver/ops/file.go

+					}
+				}
+
+				err = errdefs.WithExecError(err, inputRes, outputRes)


Was quite confused about fileop returning ExecError. Maybe a better name if we want to keep this structure.

Exec as opposed to CacheMap, not the ExecOp. I'm not sure what's a better name... maybe ResultError?

Ok, I guess we don't need to block on that. It probably makes sense to rename the Exec function in Op as well to avoid confusion but that can be follow-up.

tonistiigi · 2020-11-12T02:10:04Z

@hinshun Any update? Mainly on the ref indexing that I think needs changes as we discussed in slack. If we can get this over the line we could do v0.8rc

hinshun · 2020-11-12T04:40:35Z

@tonistiigi Sorry have been on my week long oncall at work. I don't expect the ref indexing take too long to implement, but I think I'll need to clarify some of the edge cases with you tomorrow.

Signed-off-by: Edgar Lee <edgarl@netflix.com>

- Plumb default worker by adding GetDefault() to frontend.WorkerInfos - To avoid cyclic dependency, refactor frontend.WorkerInfos to worker.Infos - Refactor gateway.NewContainer to share code with llbsolver/ops/exec.go Signed-off-by: Edgar Lee <edgarl@netflix.com>

Signed-off-by: Edgar Lee <edgarl@netflix.com>

tonistiigi · 2020-11-15T07:58:58Z

frontend/gateway/container.go

-				}
+		// if mount is based on input validate and load it
+		if m.Input != opspb.Empty {
+			if int(m.Input) > len(refs) {


Fixed now, but note this is an existing bug:

buildkit/solver/llbsolver/ops/exec.go

Line 242 in 9369d53

if int(m.Input) > len(inputs) {

tonistiigi · 2020-11-15T08:04:22Z

frontend/gateway/container.go

+			}
+			mountable = active
+			p.Actives = append(p.Actives, active)
+			if m.Output != opspb.SkipOutput && ref != nil {


I wonder if having output in here is even supported. Maybe just error?

At least client protects against this

buildkit/client/llb/exec.go

Line 307 in c700580

if !m.noOutput && !m.readonly && m.cacheID == "" && !m.tmpfs {

Hmm I haven't changed this conditional, it's the same before refactoring here:

buildkit/solver/llbsolver/ops/exec.go

Line 294 in 9369d53

if m.Output != pb.SkipOutput && ref != nil {

Ok, yeah I missed that input is cloned to be output here. Still a weird case but no changes needed for this PR.

frontend/gateway/forwarder/forward.go

tonistiigi · 2020-11-15T08:31:33Z

solver/llbsolver/ops/exec.go

-			if mountable == nil {
-				continue
+			execOutputs := make([]solver.Result, len(e.op.Mounts))
+			for i, res := range results {


I still don't quite understand how this works. We are mapping only mounts that had set output index, instead of all the mounts(that would then point to either mutable or same input if readonly). I think the test works because something always fills up output index there, even if it is unused.

Addressed here: 1240dd7

tonistiigi · 2020-11-15T08:32:14Z

client/build_test.go

+			"rootfs and readwrite mount",
+			llb.Image("busybox:latest").Run(
+				llb.Shlexf(`sh -c "echo %s > /data && echo %s > /rw/data && fail"`, id, id),
+				llb.AddMount("/rw", llb.Scratch().File(llb.Mkfile("foo", 0700, []byte(id)))),


If you would set llb.ForceNoOutput() in here I think this test would not work.

Confirmed it doesn't work, thanks. The MountID[1] becomes scratch when setting llb.ForceNoOutput.

Signed-off-by: Edgar Lee <edgarl@netflix.com>

hinshun · 2020-11-16T21:25:27Z

PR comments have been addressed.
Note that I changed when we wrap the OpError: 1240dd7#diff-2fc2e99eea0899fac8a14da1878d67aa8520bed09be4de598226d4d29450d6c8L201-L215

I noticed there were edge cases of some kind that the digest from VertexError doesn't correspond to any op inside the *pb.Definition. I took the safe route of wrapping the op where I know it will have the correct one in solver/jobs.go, by doing the cast s.st.vtx.Sys().(*pb.Op) and allowing me to pass all the tests.

Signed-off-by: Edgar Lee <edgarl@netflix.com>

coryb

Looks great! Sorry I missed the scratch mounting via NewContainer, thanks for fixing that.

hinshun force-pushed the exec-error branch 3 times, most recently from ee65f89 to d3b7dc6 Compare October 15, 2020 22:58

hinshun requested a review from coryb October 15, 2020 23:08

hinshun force-pushed the exec-error branch from d3b7dc6 to f384d74 Compare October 15, 2020 23:35

This comment has been minimized.

Sign in to view

hinshun force-pushed the exec-error branch from f384d74 to d54f864 Compare October 16, 2020 19:43

hinshun marked this pull request as ready for review October 16, 2020 19:49

hinshun force-pushed the exec-error branch 4 times, most recently from 8219604 to ccf746b Compare October 16, 2020 22:06

hinshun requested a review from tonistiigi October 16, 2020 23:26

coryb reviewed Oct 18, 2020

View reviewed changes

frontend/gateway/forwarder/forward.go Outdated Show resolved Hide resolved

tonistiigi added this to the v0.8.0 milestone Oct 19, 2020

hinshun force-pushed the exec-error branch from ccf746b to 758038b Compare October 19, 2020 22:00

tonistiigi reviewed Oct 20, 2020

View reviewed changes

hinshun mentioned this pull request Oct 26, 2020

gateway exec: set correct default path for OS #1748

Closed

hinshun force-pushed the exec-error branch from a799187 to 4618f9d Compare October 28, 2020 21:07

hinshun requested a review from tonistiigi November 1, 2020 05:44

hinshun force-pushed the exec-error branch from 6755763 to 1bbc7c2 Compare November 5, 2020 00:56

coryb reviewed Nov 6, 2020

View reviewed changes

tonistiigi reviewed Nov 6, 2020

View reviewed changes

hinshun added 7 commits November 13, 2020 22:05

Plumb op metadata to recreate failed ops with gateway exec

7ce58c3

Signed-off-by: Edgar Lee <edgarl@netflix.com>

Fix lint and unit tests for fileopsolver

2d23d0c

Signed-off-by: Edgar Lee <edgarl@netflix.com>

Refactor to file action indexed outputs

7e1dc9b

Signed-off-by: Edgar Lee <edgarl@netflix.com>

Improve file action client test by adding test matrix

bbc3f6d

Signed-off-by: Edgar Lee <edgarl@netflix.com>

Return exec error for errors returned before executor

a459eb4

Signed-off-by: Edgar Lee <edgarl@netflix.com>

Rename OutputIDs to MountIDs

c33bcd6

Signed-off-by: Edgar Lee <edgarl@netflix.com>

hinshun force-pushed the exec-error branch from 4a32663 to c33bcd6 Compare November 14, 2020 06:05

hinshun requested review from tonistiigi and coryb November 14, 2020 06:09

hinshun added 3 commits November 13, 2020 22:14

Parallelize unlazying ref proxy in the gateway forwarder

dfaf613

Signed-off-by: Edgar Lee <edgarl@netflix.com>

Fix container release not capturing closure of loop variable

4c0ca17

Signed-off-by: Edgar Lee <edgarl@netflix.com>

Fix ExecError.EachRef invoking callback with possibly nil solver.Results

3ba6cd7

Signed-off-by: Edgar Lee <edgarl@netflix.com>

tonistiigi reviewed Nov 15, 2020

View reviewed changes

Return committed readonly inputs and actives in exec error in MountIDs

1240dd7

Signed-off-by: Edgar Lee <edgarl@netflix.com>

hinshun requested a review from tonistiigi November 16, 2020 21:31

Fix optional cast for WithOp when unit testing

fa8a02c

Signed-off-by: Edgar Lee <edgarl@netflix.com>

hinshun force-pushed the exec-error branch from d5a6132 to fa8a02c Compare November 16, 2020 21:37

tonistiigi approved these changes Nov 17, 2020

View reviewed changes

coryb approved these changes Nov 17, 2020

View reviewed changes

tonistiigi merged commit 5e5f527 into moby:master Nov 17, 2020

sipsma mentioned this pull request Apr 12, 2022

Terminate an action dagger/dagger#1249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow gateway exec-ing into a failed solve with an exec op #1732

Allow gateway exec-ing into a failed solve with an exec op #1732

hinshun commented Oct 13, 2020 •

edited

Loading

This comment has been minimized.

tonistiigi Oct 20, 2020

tonistiigi Nov 6, 2020

hinshun Nov 14, 2020

tonistiigi commented Oct 31, 2020

hinshun commented Nov 1, 2020

tonistiigi commented Nov 1, 2020

hinshun commented Nov 1, 2020

tonistiigi commented Nov 1, 2020

hinshun commented Nov 5, 2020

coryb Nov 6, 2020

hinshun Nov 6, 2020 •

edited

Loading

coryb Nov 6, 2020

tonistiigi Nov 6, 2020

hinshun Nov 6, 2020

tonistiigi Nov 6, 2020

tonistiigi Nov 6, 2020

tonistiigi Nov 6, 2020

hinshun Nov 6, 2020

tonistiigi Nov 6, 2020

tonistiigi commented Nov 12, 2020

hinshun commented Nov 12, 2020

tonistiigi Nov 15, 2020

hinshun Nov 16, 2020

tonistiigi Nov 15, 2020

tonistiigi Nov 15, 2020

hinshun Nov 16, 2020

tonistiigi Nov 16, 2020

tonistiigi Nov 15, 2020

hinshun Nov 16, 2020

tonistiigi Nov 15, 2020

hinshun Nov 16, 2020

hinshun commented Nov 16, 2020

coryb left a comment

Allow gateway exec-ing into a failed solve with an exec op #1732

Allow gateway exec-ing into a failed solve with an exec op #1732

Conversation

hinshun commented Oct 13, 2020 • edited Loading

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonistiigi commented Oct 31, 2020

hinshun commented Nov 1, 2020

tonistiigi commented Nov 1, 2020

hinshun commented Nov 1, 2020

tonistiigi commented Nov 1, 2020

hinshun commented Nov 5, 2020

Choose a reason for hiding this comment

hinshun Nov 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonistiigi commented Nov 12, 2020

hinshun commented Nov 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hinshun commented Nov 16, 2020

coryb left a comment

Choose a reason for hiding this comment

hinshun commented Oct 13, 2020 •

edited

Loading

hinshun Nov 6, 2020 •

edited

Loading