Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Exec arbitrary container commands for chain #173

Merged
merged 45 commits into from
Jun 27, 2022
Merged

Conversation

DavidNix
Copy link
Contributor

@DavidNix DavidNix commented Jun 24, 2022

Description, Motivation, and Context

90% finishes: #158. The remaining 10% is exposing the HomeDir and container hostnames which are often required for commands.

A job container is a short-lived container that runs a command and exits. I still think it's possible to exec into an already running container vs. running one-off containers. The only edge case is needing to run a chain command BEFORE you start the chain. Locally and on CI creating/destroying these job containers is quite performant.

This is an attempt to DRY up some of the docker code and tame it by abstracting it into our own package dockerutil. The docker config and order of operations is delicate.

Also, fixes the problem of a non-zero status code omitting useful error info.

Known Limitations, Trade-offs, Tech Debt

  • Chain needs to expose more docker information like host and home dir. See this example.
  • Bloats the ibc.Chain interface a little more.
  • Provides abstraction but does not use that abstraction for long lived containers yet.
  • Cleaning up via DockerSetup is still more brittle than I'd like because of the test name as the label value. You have to ensure you are using that name consistently and it's not super obvious. Also the test name could change.
  • Relayer probably needs exec as well.

StartWithGenesisFile(testName string, ctx context.Context, home string, pool *dockertest.Pool, networkID string, genesisFilePath string) error
// Exec runs an arbitrary command using Chain's docker environment.
// "env" are environment variables in the format "MY_ENV_VAR=value"
Exec(ctx context.Context, cmd []string, env []string) (stdout, stderr []byte, err error)
Copy link
Contributor Author

@DavidNix DavidNix Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think we need to break this interface up into separate categories of chains.

Such as:

BaseChain // probably includes IBCTransfer
InterchainAccountChain
InterchainSecurityChain
QueryChain // IBC queries
SmartContractChain
ExportableChain // export and start from saved state

@DavidNix DavidNix marked this pull request as ready for review June 24, 2022 23:55
@DavidNix DavidNix requested a review from a team as a code owner June 24, 2022 23:55
@DavidNix DavidNix requested review from agouin and mark-rushakoff and removed request for a team June 24, 2022 23:55
@DavidNix DavidNix changed the title feat: Exec arbitrary container commands feat: Exec arbitrary container commands for chain Jun 24, 2022
@@ -678,11 +682,6 @@ func (tn *ChainNode) CreateNodeContainer(ctx context.Context) error {
return nil
}

func (tn *ChainNode) StopContainer(ctx context.Context) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was also used for the juno halt recovery to test taking down nodes, writing a new genesis file, then bringing them back up. Will be nice for one off tests to still have the ability to do this.

)

// Image is a docker image.
type Image struct {
Copy link
Contributor Author

@DavidNix DavidNix Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Image type and Container are the meat of this PR.

@chatton chatton mentioned this pull request Jun 27, 2022
33 tasks
Copy link
Member

@mark-rushakoff mark-rushakoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a handful of comments, but nothing blocking merge -- any of the more involved details can be a separate change later.

This updated structure should make it easier to isolate the cause of the containers being stuck in created state, if it does continue to happen.

}

// at this point stdout should look like this:
// interchain_account_address: cosmos1p76n3mnanllea4d3av0v0e42tjj03cae06xq8fwn9at587rqp23qvxsv0j
// we split the string at the : and then just grab the address before returning.
parts := strings.SplitN(stdout, ":", 2)
parts := strings.SplitN(string(stdout), ":", 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it wasn't doing it before, but this should still check len(parts) before indexing into the slice.

var err error
txResp, err = authTx.QueryTx(c.getFullNode().CliContext(), txHash)
return err
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like retry.Do defaults to 10 attempts, with a 100ms delay between each. Is that appropriate here, or is there a better arbitrary count/delay we can use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's completely arbitrary. The package's defaults sounded good to me. I may make sense to tie it to the estimate block creation time.

@@ -388,15 +389,15 @@ func (c *PenumbraChain) start(testName string, ctx context.Context, genesisFileP
eg, egCtx = errgroup.WithContext(ctx)
for _, n := range c.PenumbraNodes {
n := n
fmt.Printf("{%s} => starting container...\n", n.TendermintNode.Name())
c.log.Info("Staring tendermint container", zap.String("container", n.TendermintNode.Name()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Staring/Starting/ 👀
And once more below on L400.

// start a chain with a provided genesis file. Will override validators for first 2/3 of voting power
StartWithGenesisFile(testName string, ctx context.Context, home string, pool *dockertest.Pool, networkID string, genesisFilePath string) error
// Exec runs an arbitrary command using Chain's docker environment.
// "env" are environment variables in the format "MY_ENV_VAR=value"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help me as a reader if this comment clarified whether this was a new Docker container as a one-off command (which I think it is), as opposed to a docker exec in a long-lived container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be either. Right now, it's always a one-off command. I'll clarify.

}

func (image *Image) wrapErr(err error) error {
return fmt.Errorf("image %s:%s: %w", image.repository, image.tag, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is a case where it would be better to have an ImageError struct that can capture more fields, such as the command. Easy enough to update later if the need arises.


if exitCode != 0 {
out := strings.Join([]string{stdoutBuf.String(), stderrBuf.String()}, " ")
return nil, nil, fmt.Errorf("exit code %d: %s", exitCode, out)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my previous experience with writing wrappers around the Docker API, and with wrappers around external processes, this is the case where we particularly want a structured error so that a caller can easily inspect the exit code, stdout, or stderr. We generally expect all invoked commands to exit 0, but sooner or later there will be one that may have a non-zero exit that we need to handle gracefully.

It could be

type ExecError struct {
  Stdout, Stderr string
  Code int
}

But since we have stdout and stderr as part of the return signature already, it could just be type NonZeroExitError{Code int} and we could keep returning stdout and stderr for that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it!

notRunning *docker.ContainerNotRunning
)

err := client.StopContainerWithContext(c.container.ID, 0, ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StopContainerWithContext stops a container, killing it after the given timeout (in seconds). The context can be used to cancel the stop container request.

This sounds like we would still want a non-zero timeout.

Comment on lines +69 to +77
nets, _ := pool.Client.ListNetworks()
for _, n := range nets {
for k, v := range n.Labels {
if k == CleanupLabel && v == testName {
_ = pool.Client.RemoveNetwork(n.ID)
break
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this can be

pool.Client.PruneNetworks(docker.PruneNetworksOptions{
  Filters: map[string][]string{
    "label": CleanupLabel + "=" + testName,
  },
})

For the instability we have been seeing on M1 Mac machines, anything we can do to reduce the number of API calls seems like a win.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this was copy/pasted from the original implementation in the ibctest root level package. All that nesting didn't sit well with me either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, btw!

// dockerCleanup will clean up Docker containers, networks, and the other various config files generated in testing
func dockerCleanup(testName string, pool *dockertest.Pool) func() {
return func() {
cont, _ := pool.Client.ListContainers(docker.ListContainersOptions{All: true})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should set the filters too like the suggestion for PruneNetworks.

There is a PruneContainers method that only acts on stopped containers, so that isn't quite a good fit here -- unless this changes to ListContainers with label filter and All=false, to call StopContainer on each running one; then a final PruneContainers since all matching the filter should be stopped.

In either case, it should also be possible to concurrently stop containers. But I don't know how often we match more than a single container for cleanup.

}

// dockerCleanup will clean up Docker containers, networks, and the other various config files generated in testing
func dockerCleanup(testName string, pool *dockertest.Pool) func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mildly concerned that we may be hiding some important information by ignoring all the returned errors in here. WDYT about changing the signature to accept a *testing.T and simply calling t.Logf for non-nil errors, so that a user has more information about failed cleanups?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

What also doesn't sit well is the fact the testName can easily change if the developer decides to refactor the name of a test. E.g. TestChainFoo(t *testing.T) -> TestChainBar(t *testing.T).

You also have to be sure to pass the testName correctly down the call stack.

The cleanup strategy is delicate and may be missing some key parts. It's largely copy/pasted from the original ibctest location.

@DavidNix
Copy link
Contributor Author

Thanks for the review Mark! To unblock you experimenting with the Docker SDK, I'll merge once tests pass.

Fyi, the "flush packets" conformance test seems to have gotten flaky.

I'll address your comments in a quick followup PR.

@DavidNix DavidNix merged commit 143b5ad into main Jun 27, 2022
@DavidNix DavidNix deleted the nix/feat/chain-exec branch June 27, 2022 17:57
Copy link
Member

@agouin agouin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome cleanup 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants