-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Exec arbitrary container commands for chain #173
Conversation
eee1bb5
to
32ccc30
Compare
32ccc30
to
88eb4ed
Compare
StartWithGenesisFile(testName string, ctx context.Context, home string, pool *dockertest.Pool, networkID string, genesisFilePath string) error | ||
// Exec runs an arbitrary command using Chain's docker environment. | ||
// "env" are environment variables in the format "MY_ENV_VAR=value" | ||
Exec(ctx context.Context, cmd []string, env []string) (stdout, stderr []byte, err error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I think we need to break this interface up into separate categories of chains.
Such as:
BaseChain // probably includes IBCTransfer
InterchainAccountChain
InterchainSecurityChain
QueryChain // IBC queries
SmartContractChain
ExportableChain // export and start from saved state
1f8f451
to
f760531
Compare
@@ -678,11 +682,6 @@ func (tn *ChainNode) CreateNodeContainer(ctx context.Context) error { | |||
return nil | |||
} | |||
|
|||
func (tn *ChainNode) StopContainer(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was also used for the juno halt recovery to test taking down nodes, writing a new genesis file, then bringing them back up. Will be nice for one off tests to still have the ability to do this.
) | ||
|
||
// Image is a docker image. | ||
type Image struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Image
type and Container
are the meat of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a handful of comments, but nothing blocking merge -- any of the more involved details can be a separate change later.
This updated structure should make it easier to isolate the cause of the containers being stuck in created state, if it does continue to happen.
} | ||
|
||
// at this point stdout should look like this: | ||
// interchain_account_address: cosmos1p76n3mnanllea4d3av0v0e42tjj03cae06xq8fwn9at587rqp23qvxsv0j | ||
// we split the string at the : and then just grab the address before returning. | ||
parts := strings.SplitN(stdout, ":", 2) | ||
parts := strings.SplitN(string(stdout), ":", 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it wasn't doing it before, but this should still check len(parts)
before indexing into the slice.
var err error | ||
txResp, err = authTx.QueryTx(c.getFullNode().CliContext(), txHash) | ||
return err | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like retry.Do
defaults to 10 attempts, with a 100ms delay between each. Is that appropriate here, or is there a better arbitrary count/delay we can use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's completely arbitrary. The package's defaults sounded good to me. I may make sense to tie it to the estimate block creation time.
@@ -388,15 +389,15 @@ func (c *PenumbraChain) start(testName string, ctx context.Context, genesisFileP | |||
eg, egCtx = errgroup.WithContext(ctx) | |||
for _, n := range c.PenumbraNodes { | |||
n := n | |||
fmt.Printf("{%s} => starting container...\n", n.TendermintNode.Name()) | |||
c.log.Info("Staring tendermint container", zap.String("container", n.TendermintNode.Name())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Staring/Starting/
👀
And once more below on L400.
// start a chain with a provided genesis file. Will override validators for first 2/3 of voting power | ||
StartWithGenesisFile(testName string, ctx context.Context, home string, pool *dockertest.Pool, networkID string, genesisFilePath string) error | ||
// Exec runs an arbitrary command using Chain's docker environment. | ||
// "env" are environment variables in the format "MY_ENV_VAR=value" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would help me as a reader if this comment clarified whether this was a new Docker container as a one-off command (which I think it is), as opposed to a docker exec
in a long-lived container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be either. Right now, it's always a one-off command. I'll clarify.
} | ||
|
||
func (image *Image) wrapErr(err error) error { | ||
return fmt.Errorf("image %s:%s: %w", image.repository, image.tag, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is a case where it would be better to have an ImageError
struct that can capture more fields, such as the command. Easy enough to update later if the need arises.
|
||
if exitCode != 0 { | ||
out := strings.Join([]string{stdoutBuf.String(), stderrBuf.String()}, " ") | ||
return nil, nil, fmt.Errorf("exit code %d: %s", exitCode, out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my previous experience with writing wrappers around the Docker API, and with wrappers around external processes, this is the case where we particularly want a structured error so that a caller can easily inspect the exit code, stdout, or stderr. We generally expect all invoked commands to exit 0, but sooner or later there will be one that may have a non-zero exit that we need to handle gracefully.
It could be
type ExecError struct {
Stdout, Stderr string
Code int
}
But since we have stdout and stderr as part of the return signature already, it could just be type NonZeroExitError{Code int}
and we could keep returning stdout
and stderr
for that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it!
notRunning *docker.ContainerNotRunning | ||
) | ||
|
||
err := client.StopContainerWithContext(c.container.ID, 0, ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StopContainerWithContext stops a container, killing it after the given timeout (in seconds). The context can be used to cancel the stop container request.
This sounds like we would still want a non-zero timeout.
nets, _ := pool.Client.ListNetworks() | ||
for _, n := range nets { | ||
for k, v := range n.Labels { | ||
if k == CleanupLabel && v == testName { | ||
_ = pool.Client.RemoveNetwork(n.ID) | ||
break | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this can be
pool.Client.PruneNetworks(docker.PruneNetworksOptions{
Filters: map[string][]string{
"label": CleanupLabel + "=" + testName,
},
})
For the instability we have been seeing on M1 Mac machines, anything we can do to reduce the number of API calls seems like a win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this was copy/pasted from the original implementation in the ibctest root level package. All that nesting didn't sit well with me either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion, btw!
// dockerCleanup will clean up Docker containers, networks, and the other various config files generated in testing | ||
func dockerCleanup(testName string, pool *dockertest.Pool) func() { | ||
return func() { | ||
cont, _ := pool.Client.ListContainers(docker.ListContainersOptions{All: true}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should set the filters too like the suggestion for PruneNetworks
.
There is a PruneContainers
method that only acts on stopped containers, so that isn't quite a good fit here -- unless this changes to ListContainers
with label filter and All=false
, to call StopContainer
on each running one; then a final PruneContainers
since all matching the filter should be stopped.
In either case, it should also be possible to concurrently stop containers. But I don't know how often we match more than a single container for cleanup.
} | ||
|
||
// dockerCleanup will clean up Docker containers, networks, and the other various config files generated in testing | ||
func dockerCleanup(testName string, pool *dockertest.Pool) func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mildly concerned that we may be hiding some important information by ignoring all the returned errors in here. WDYT about changing the signature to accept a *testing.T
and simply calling t.Logf
for non-nil errors, so that a user has more information about failed cleanups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
What also doesn't sit well is the fact the testName
can easily change if the developer decides to refactor the name of a test. E.g. TestChainFoo(t *testing.T)
-> TestChainBar(t *testing.T)
.
You also have to be sure to pass the testName correctly down the call stack.
The cleanup strategy is delicate and may be missing some key parts. It's largely copy/pasted from the original ibctest location.
Thanks for the review Mark! To unblock you experimenting with the Docker SDK, I'll merge once tests pass. Fyi, the "flush packets" conformance test seems to have gotten flaky. I'll address your comments in a quick followup PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome cleanup 👏
Description, Motivation, and Context
90% finishes: #158. The remaining 10% is exposing the HomeDir and container hostnames which are often required for commands.
A job container is a short-lived container that runs a command and exits. I still think it's possible to exec into an already running container vs. running one-off containers. The only edge case is needing to run a chain command BEFORE you start the chain. Locally and on CI creating/destroying these job containers is quite performant.
This is an attempt to DRY up some of the docker code and tame it by abstracting it into our own package
dockerutil
. The docker config and order of operations is delicate.Also, fixes the problem of a non-zero status code omitting useful error info.
Known Limitations, Trade-offs, Tech Debt