Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow gateway exec-ing into a failed solve with an exec op #1732

Merged
merged 12 commits into from
Nov 17, 2020

Conversation

hinshun
Copy link
Collaborator

@hinshun hinshun commented Oct 13, 2020

For: #1472 (comment)

  • Adds Evaluate bool to the SolveRequest, this explicitly evaluates ResultProxy because they are currently lazy results (until returned, or a call like ref.ReadFile is called). Otherwise, exec errors will only show up in the main solve request instead of inside runGatewayCB.
  • Adds SolveError as a TypedErrorProto. Clients can use errors.As to extract it.
  • Adds GetDefault() (worker.Worker, nil) to frontend.WorkerInfos, that plumbs down the default worker.
  • Refactors frontend.WorkerInfos to worker.Infos because of the cyclic dependency
  • Refactors gateway.NewContainer to share the exact same code as solver/llbsolver/ops/exec.go, previously was 90% duplicated. The existing version of gateway.NewContainer also didn't allow for scratch mounts whereas solver/llbsolver/ops/exec.go did.

Example client code: https://gist.github.com/hinshun/e72f509121e022bc81ebba03fc2851c6

When an pb.ExecOp or pb.FileOp fails to be solved, its protobuf definition, along with its solved inputs / outputs are wrapped in a typed error to the lbfBridgeForwarder which holds the temporary IDs for references.

The inputs and outputs have temporary uuids generated and sent back over as a TypedErrorProto which allows it to unmarshaled on the client side to reconstruct the arguments necessary to gateway exec back into a container process for the failed solve.

@hinshun hinshun force-pushed the exec-error branch 3 times, most recently from ee65f89 to d3b7dc6 Compare October 15, 2020 22:58
@hinshun hinshun requested a review from coryb October 15, 2020 23:08
@hinshun

This comment has been minimized.

@hinshun hinshun marked this pull request as ready for review October 16, 2020 19:49
@hinshun hinshun force-pushed the exec-error branch 4 times, most recently from 8219604 to ccf746b Compare October 16, 2020 22:06
@hinshun hinshun requested a review from tonistiigi October 16, 2020 23:26
@tonistiigi tonistiigi added this to the v0.8.0 milestone Oct 19, 2020
frontend/gateway/pb/gateway.proto Show resolved Hide resolved
client/build_test.go Outdated Show resolved Hide resolved
client/build_test.go Outdated Show resolved Hide resolved
if !ok {
return nil, errors.Errorf("unexpected Ref type: %T", m.Ref)
}

res, err := refProxy.Result(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if this should be parallelized?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here: dfaf613

solver/llbsolver/ops/exec.go Outdated Show resolved Hide resolved
@tonistiigi
Copy link
Member

@hinshun Is this ready?

@hinshun
Copy link
Collaborator Author

hinshun commented Nov 1, 2020

@tonistiigi Yes, though I did punt on the errors before Run for ExecOp. I don't think its worth the complexity for SolveError. I think they can also derive the op from the VertexError if they wanted metadata.

@hinshun hinshun requested a review from tonistiigi November 1, 2020 05:44
@tonistiigi
Copy link
Member

VertexError only contains digest though, not the proto definition.

@hinshun
Copy link
Collaborator Author

hinshun commented Nov 1, 2020

Yeah, I understand clients doesn't always have access to the proto definition to lookup the op via digest. Maybe introduce an OpError in the future?

@tonistiigi
Copy link
Member

What is the difference between OpError and SolveError? I think returning the input refs would be useful for OpError as well.

@hinshun
Copy link
Collaborator Author

hinshun commented Nov 5, 2020

@tonistiigi I rebased and added a commit that returns exec errors for errors returned before the executor run now: 1bbc7c2

var err error
inputIDs, err = c.registerResultIDs(ee.Inputs...)
if err != nil {
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the edge case that the registerResultIDs returns the unexpected type for result error then we will lose the original solve error here. I wonder if these should be errors.Wrap(solveErr, err.Error()) instead to make sure the solve error is preserved? I am not sure if we would ever practically get the unexpected type for result error though.

Copy link
Collaborator Author

@hinshun hinshun Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we would ever practically get the unexpected type for result error though.

I think that would be an implementation error in Buildkit?

I'm okay with the errors.Wrap approach, but unsure if it'll be useful if incomplete? You'll need a lot of safeguards in the client side for incomplete solve errors.

var err error
inputIDs, err = lbf.registerResultIDs(ee.Inputs...)
if err != nil {
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue here, we will lose solveErr

solver/llbsolver/solver.go Show resolved Hide resolved
mounts = append(mounts, client.Mount{
Selector: mnt.Selector,
Dest: mnt.Dest,
ResultID: se.Solve.OutputIDs[mnt.Output],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be mnt.Input ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both work, they could use mnt.Input but they would be the unmodified inputs. For this test, I was checking that I could access the mutated mounts rather than the original inputs.

In the LLB we have: echo %s > output && fail, so this test was to exec into a mount where echo %s > output succeeded but the fail failed the execop overall.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic should be that both se.Solve.OutputIDs and se.Solve.InputIDs are indexed by mnt.Input. mnt.Output is a link to the next vertex. There is no need that output must be defined on a mount in order to be able to debug its error state.

op := se.Solve.Op
opExec, ok := se.Solve.Op.Op.(*pb.Op_Exec)
require.True(t, ok)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the count of items in OutputIDs InputIDs, as well as some meta properties like Args

}
}

err = errdefs.WithExecError(err, inputRes, outputRes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was quite confused about fileop returning ExecError. Maybe a better name if we want to keep this structure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exec as opposed to CacheMap, not the ExecOp. I'm not sure what's a better name... maybe ResultError?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess we don't need to block on that. It probably makes sense to rename the Exec function in Op as well to avoid confusion but that can be follow-up.

@tonistiigi
Copy link
Member

@hinshun Any update? Mainly on the ref indexing that I think needs changes as we discussed in slack. If we can get this over the line we could do v0.8rc

@hinshun
Copy link
Collaborator Author

hinshun commented Nov 12, 2020

@tonistiigi Sorry have been on my week long oncall at work. I don't expect the ref indexing take too long to implement, but I think I'll need to clarify some of the edge cases with you tomorrow.

Signed-off-by: Edgar Lee <edgarl@netflix.com>
Signed-off-by: Edgar Lee <edgarl@netflix.com>
Signed-off-by: Edgar Lee <edgarl@netflix.com>
Signed-off-by: Edgar Lee <edgarl@netflix.com>
Signed-off-by: Edgar Lee <edgarl@netflix.com>
- Plumb default worker by adding GetDefault() to frontend.WorkerInfos
- To avoid cyclic dependency, refactor frontend.WorkerInfos to worker.Infos
- Refactor gateway.NewContainer to share code with llbsolver/ops/exec.go

Signed-off-by: Edgar Lee <edgarl@netflix.com>
Signed-off-by: Edgar Lee <edgarl@netflix.com>
Signed-off-by: Edgar Lee <edgarl@netflix.com>
Signed-off-by: Edgar Lee <edgarl@netflix.com>
}
// if mount is based on input validate and load it
if m.Input != opspb.Empty {
if int(m.Input) > len(refs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>=

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed now, but note this is an existing bug:

if int(m.Input) > len(inputs) {

}
mountable = active
p.Actives = append(p.Actives, active)
if m.Output != opspb.SkipOutput && ref != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if having output in here is even supported. Maybe just error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least client protects against this

if !m.noOutput && !m.readonly && m.cacheID == "" && !m.tmpfs {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I haven't changed this conditional, it's the same before refactoring here:

if m.Output != pb.SkipOutput && ref != nil {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yeah I missed that input is cloned to be output here. Still a weird case but no changes needed for this PR.

frontend/gateway/forwarder/forward.go Show resolved Hide resolved
if mountable == nil {
continue
execOutputs := make([]solver.Result, len(e.op.Mounts))
for i, res := range results {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't quite understand how this works. We are mapping only mounts that had set output index, instead of all the mounts(that would then point to either mutable or same input if readonly). I think the test works because something always fills up output index there, even if it is unused.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed here: 1240dd7

"rootfs and readwrite mount",
llb.Image("busybox:latest").Run(
llb.Shlexf(`sh -c "echo %s > /data && echo %s > /rw/data && fail"`, id, id),
llb.AddMount("/rw", llb.Scratch().File(llb.Mkfile("foo", 0700, []byte(id)))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you would set llb.ForceNoOutput() in here I think this test would not work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed it doesn't work, thanks. The MountID[1] becomes scratch when setting llb.ForceNoOutput.

@hinshun
Copy link
Collaborator Author

hinshun commented Nov 16, 2020

PR comments have been addressed.
Note that I changed when we wrap the OpError: 1240dd7#diff-2fc2e99eea0899fac8a14da1878d67aa8520bed09be4de598226d4d29450d6c8L201-L215

I noticed there were edge cases of some kind that the digest from VertexError doesn't correspond to any op inside the *pb.Definition. I took the safe route of wrapping the op where I know it will have the correct one in solver/jobs.go, by doing the cast s.st.vtx.Sys().(*pb.Op) and allowing me to pass all the tests.

@hinshun hinshun requested a review from tonistiigi November 16, 2020 21:31
Signed-off-by: Edgar Lee <edgarl@netflix.com>
Copy link
Collaborator

@coryb coryb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Sorry I missed the scratch mounting via NewContainer, thanks for fixing that.

@tonistiigi tonistiigi merged commit 5e5f527 into moby:master Nov 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants