feat: Exec arbitrary container commands for chain #173

DavidNix · 2022-06-24T21:11:08Z

Description, Motivation, and Context

90% finishes: #158. The remaining 10% is exposing the HomeDir and container hostnames which are often required for commands.

A job container is a short-lived container that runs a command and exits. I still think it's possible to exec into an already running container vs. running one-off containers. The only edge case is needing to run a chain command BEFORE you start the chain. Locally and on CI creating/destroying these job containers is quite performant.

This is an attempt to DRY up some of the docker code and tame it by abstracting it into our own package dockerutil. The docker config and order of operations is delicate.

Also, fixes the problem of a non-zero status code omitting useful error info.

Known Limitations, Trade-offs, Tech Debt

Chain needs to expose more docker information like host and home dir. See this example.
Bloats the ibc.Chain interface a little more.
Provides abstraction but does not use that abstraction for long lived containers yet.
Cleaning up via DockerSetup is still more brittle than I'd like because of the test name as the label value. You have to ensure you are using that name consistently and it's not super obvious. Also the test name could change.
Relayer probably needs exec as well.

DavidNix · 2022-06-24T23:10:05Z

ibc/chain.go

-	StartWithGenesisFile(testName string, ctx context.Context, home string, pool *dockertest.Pool, networkID string, genesisFilePath string) error
+	// Exec runs an arbitrary command using Chain's docker environment.
+	// "env" are environment variables in the format "MY_ENV_VAR=value"
+	Exec(ctx context.Context, cmd []string, env []string) (stdout, stderr []byte, err error)


FWIW, I think we need to break this interface up into separate categories of chains.

Such as:

BaseChain // probably includes IBCTransfer InterchainAccountChain InterchainSecurityChain QueryChain // IBC queries SmartContractChain ExportableChain // export and start from saved state

DavidNix · 2022-06-27T04:18:41Z

chain/cosmos/chain_node.go

@@ -678,11 +682,6 @@ func (tn *ChainNode) CreateNodeContainer(ctx context.Context) error {
 	return nil
 }

-func (tn *ChainNode) StopContainer(ctx context.Context) error {


Never called.

Was also used for the juno halt recovery to test taking down nodes, writing a new genesis file, then bringing them back up. Will be nice for one off tests to still have the ability to do this.

DavidNix · 2022-06-27T04:34:53Z

internal/dockerutil/image.go

+)
+
+// Image is a docker image.
+type Image struct {


This Image type and Container are the meat of this PR.

mark-rushakoff

I've left a handful of comments, but nothing blocking merge -- any of the more involved details can be a separate change later.

This updated structure should make it easier to isolate the cause of the containers being stuck in created state, if it does continue to happen.

mark-rushakoff · 2022-06-27T13:03:15Z

chain/cosmos/chain_node.go

 	}

 	// at this point stdout should look like this:
 	// interchain_account_address: cosmos1p76n3mnanllea4d3av0v0e42tjj03cae06xq8fwn9at587rqp23qvxsv0j
 	// we split the string at the : and then just grab the address before returning.
-	parts := strings.SplitN(stdout, ":", 2)
+	parts := strings.SplitN(string(stdout), ":", 2)


I see it wasn't doing it before, but this should still check len(parts) before indexing into the slice.

mark-rushakoff · 2022-06-27T13:08:17Z

chain/cosmos/cosmos_chain.go

+		var err error
+		txResp, err = authTx.QueryTx(c.getFullNode().CliContext(), txHash)
+		return err
+	})


It looks like retry.Do defaults to 10 attempts, with a 100ms delay between each. Is that appropriate here, or is there a better arbitrary count/delay we can use?

It's completely arbitrary. The package's defaults sounded good to me. I may make sense to tie it to the estimate block creation time.

mark-rushakoff · 2022-06-27T13:10:33Z

chain/penumbra/penumbra_chain.go

@@ -388,15 +389,15 @@ func (c *PenumbraChain) start(testName string, ctx context.Context, genesisFileP
 	eg, egCtx = errgroup.WithContext(ctx)
 	for _, n := range c.PenumbraNodes {
 		n := n
-		fmt.Printf("{%s} => starting container...\n", n.TendermintNode.Name())
+		c.log.Info("Staring tendermint container", zap.String("container", n.TendermintNode.Name()))


s/Staring/Starting/ 👀
And once more below on L400.

mark-rushakoff · 2022-06-27T13:12:55Z

ibc/chain.go

-	// start a chain with a provided genesis file. Will override validators for first 2/3 of voting power
-	StartWithGenesisFile(testName string, ctx context.Context, home string, pool *dockertest.Pool, networkID string, genesisFilePath string) error
+	// Exec runs an arbitrary command using Chain's docker environment.
+	// "env" are environment variables in the format "MY_ENV_VAR=value"


It would help me as a reader if this comment clarified whether this was a new Docker container as a one-off command (which I think it is), as opposed to a docker exec in a long-lived container.

It could be either. Right now, it's always a one-off command. I'll clarify.

mark-rushakoff · 2022-06-27T13:19:58Z

internal/dockerutil/image.go

+}
+
+func (image *Image) wrapErr(err error) error {
+	return fmt.Errorf("image %s:%s: %w", image.repository, image.tag, err)


I'm not sure if this is a case where it would be better to have an ImageError struct that can capture more fields, such as the command. Easy enough to update later if the need arises.

mark-rushakoff · 2022-06-27T13:24:58Z

internal/dockerutil/image.go

+
+	if exitCode != 0 {
+		out := strings.Join([]string{stdoutBuf.String(), stderrBuf.String()}, " ")
+		return nil, nil, fmt.Errorf("exit code %d: %s", exitCode, out)


In my previous experience with writing wrappers around the Docker API, and with wrappers around external processes, this is the case where we particularly want a structured error so that a caller can easily inspect the exit code, stdout, or stderr. We generally expect all invoked commands to exit 0, but sooner or later there will be one that may have a non-zero exit that we need to handle gracefully.

It could be

type ExecError struct { Stdout, Stderr string Code int }

But since we have stdout and stderr as part of the return signature already, it could just be type NonZeroExitError{Code int} and we could keep returning stdout and stderr for that case.

mark-rushakoff · 2022-06-27T13:28:33Z

internal/dockerutil/image.go

+		notRunning *docker.ContainerNotRunning
+	)
+
+	err := client.StopContainerWithContext(c.container.ID, 0, ctx)


StopContainerWithContext stops a container, killing it after the given timeout (in seconds). The context can be used to cancel the stop container request.

This sounds like we would still want a non-zero timeout.

mark-rushakoff · 2022-06-27T13:36:17Z

internal/dockerutil/setup.go

+		nets, _ := pool.Client.ListNetworks()
+		for _, n := range nets {
+			for k, v := range n.Labels {
+				if k == CleanupLabel && v == testName {
+					_ = pool.Client.RemoveNetwork(n.ID)
+					break
+				}
+			}
+		}


I'm pretty sure this can be

pool.Client.PruneNetworks(docker.PruneNetworksOptions{ Filters: map[string][]string{ "label": CleanupLabel + "=" + testName, }, })

For the instability we have been seeing on M1 Mac machines, anything we can do to reduce the number of API calls seems like a win.

Yup, this was copy/pasted from the original implementation in the ibctest root level package. All that nesting didn't sit well with me either.

Great suggestion, btw!

mark-rushakoff · 2022-06-27T13:40:26Z

internal/dockerutil/setup.go

+// dockerCleanup will clean up Docker containers, networks, and the other various config files generated in testing
+func dockerCleanup(testName string, pool *dockertest.Pool) func() {
+	return func() {
+		cont, _ := pool.Client.ListContainers(docker.ListContainersOptions{All: true})


This should set the filters too like the suggestion for PruneNetworks.

There is a PruneContainers method that only acts on stopped containers, so that isn't quite a good fit here -- unless this changes to ListContainers with label filter and All=false, to call StopContainer on each running one; then a final PruneContainers since all matching the filter should be stopped.

In either case, it should also be possible to concurrently stop containers. But I don't know how often we match more than a single container for cleanup.

mark-rushakoff · 2022-06-27T13:42:51Z

internal/dockerutil/setup.go

+}
+
+// dockerCleanup will clean up Docker containers, networks, and the other various config files generated in testing
+func dockerCleanup(testName string, pool *dockertest.Pool) func() {


I'm mildly concerned that we may be hiding some important information by ignoring all the returned errors in here. WDYT about changing the signature to accept a *testing.T and simply calling t.Logf for non-nil errors, so that a user has more information about failed cleanups?

I agree.

What also doesn't sit well is the fact the testName can easily change if the developer decides to refactor the name of a test. E.g. TestChainFoo(t *testing.T) -> TestChainBar(t *testing.T).

You also have to be sure to pass the testName correctly down the call stack.

The cleanup strategy is delicate and may be missing some key parts. It's largely copy/pasted from the original ibctest location.

DavidNix · 2022-06-27T17:48:18Z

Thanks for the review Mark! To unblock you experimenting with the Docker SDK, I'll merge once tests pass.

Fyi, the "flush packets" conformance test seems to have gotten flaky.

I'll address your comments in a quick followup PR.

agouin

Awesome cleanup 👏

DavidNix added 27 commits June 24, 2022 09:12

Test if job containers need exposed ports

8163c10

Remove exposed ports from node job

cdc902c

Move DockerSetup into internal pkg

9c9cafd

Happy path docker job container

69e2c0b

Add edge case tests

1ef3146

Remove unused function

4701a4a

Rename file to mirror type

e532c84

Comment on dockertest api

d554b02

DRY up cosmos chain node with new JobContainer

b9f6da5

Fix lint

a033773

Specify full docker repo

6f46ee6

Fix failing tests: Be sure to pull images.

280af2a

Clean up tests

71cdbad

Add back runtime info for container names

7fa68f8

Add logging to container job

bce399a

Add image name to error

a7c0e0a

Simplify feature test images

e4d42a1

wip: experiment with test failure

b1bee39

Force remove container

ee591fc

Update label key to match rest of codebase

89fba5e

Ignore container already exists error.

6b39864

Refactor label value to exported constant

7c9f239

Remove unused ibc.Chain method

154c60d

Clean up docker cleanup

8fcd51c

Use higher level api for DockerSetup

dc11180

Revert docker label to original value

e884ebe

More explicit container name

99cd4ca

DavidNix force-pushed the nix/feat/chain-exec branch from eee1bb5 to 32ccc30 Compare June 24, 2022 21:29

Prevent finding tx before it's committed

88eb4ed

DavidNix force-pushed the nix/feat/chain-exec branch from 32ccc30 to 88eb4ed Compare June 24, 2022 21:42

DavidNix added 2 commits June 24, 2022 16:55

Use TestName for job name.

04a5a7a

fix: Penumbra needs root user

ff92dbb

DavidNix commented Jun 24, 2022

View reviewed changes

tmp: print debug

f760531

DavidNix force-pushed the nix/feat/chain-exec branch from 1f8f451 to f760531 Compare June 24, 2022 23:28

DavidNix added 2 commits June 24, 2022 17:36

It's getting late on a friday

5e05aeb

Random isn't random enough

8c51c52

DavidNix marked this pull request as ready for review June 24, 2022 23:55

DavidNix requested a review from a team as a code owner June 24, 2022 23:55

DavidNix requested review from agouin and mark-rushakoff and removed request for a team June 24, 2022 23:55

DavidNix changed the title ~~feat: Exec arbitrary container commands~~ feat: Exec arbitrary container commands for chain Jun 24, 2022

DavidNix added 4 commits June 26, 2022 21:53

Refactor job container to image abstraction

c0c9775

Rename file to image

96d3ebc

Fix method sig for chain implementations

2849da7

More test coverage

b8a6d8b

DavidNix commented Jun 27, 2022

View reviewed changes

Fix comments, let caller decide stop timeout.

d28516c

DavidNix commented Jun 27, 2022

View reviewed changes

Misc code cleanup

b093e26

chatton mentioned this pull request Jun 27, 2022

E2E Testing Spike cosmos/ibc-go#1583

Closed

33 tasks

mark-rushakoff approved these changes Jun 27, 2022

View reviewed changes

DavidNix merged commit 143b5ad into main Jun 27, 2022

DavidNix deleted the nix/feat/chain-exec branch June 27, 2022 17:57

agouin reviewed Jun 27, 2022

View reviewed changes

This was referenced Jun 27, 2022

fix: Minor fixes to cosmos implementation #179

Merged

fix: Minor docker cleanup improvements #180

Merged

agouin mentioned this pull request Jun 28, 2022

Adding support for RecoverKey in the chain interface. #176

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Exec arbitrary container commands for chain #173

feat: Exec arbitrary container commands for chain #173

DavidNix commented Jun 24, 2022 •

edited

Loading

DavidNix Jun 24, 2022 •

edited

Loading

DavidNix Jun 27, 2022

agouin Jun 27, 2022

DavidNix Jun 27, 2022 •

edited

Loading

mark-rushakoff left a comment

mark-rushakoff Jun 27, 2022

mark-rushakoff Jun 27, 2022

DavidNix Jun 27, 2022

mark-rushakoff Jun 27, 2022

mark-rushakoff Jun 27, 2022

DavidNix Jun 27, 2022

mark-rushakoff Jun 27, 2022

mark-rushakoff Jun 27, 2022

DavidNix Jun 27, 2022

mark-rushakoff Jun 27, 2022

mark-rushakoff Jun 27, 2022

DavidNix Jun 27, 2022

DavidNix Jun 27, 2022

mark-rushakoff Jun 27, 2022

mark-rushakoff Jun 27, 2022

DavidNix Jun 27, 2022

DavidNix commented Jun 27, 2022

agouin left a comment

feat: Exec arbitrary container commands for chain #173

feat: Exec arbitrary container commands for chain #173

Conversation

DavidNix commented Jun 24, 2022 • edited Loading

Description, Motivation, and Context

Known Limitations, Trade-offs, Tech Debt

DavidNix Jun 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavidNix Jun 27, 2022 • edited Loading

Choose a reason for hiding this comment

mark-rushakoff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavidNix commented Jun 27, 2022

agouin left a comment

Choose a reason for hiding this comment

DavidNix commented Jun 24, 2022 •

edited

Loading

DavidNix Jun 24, 2022 •

edited

Loading

DavidNix Jun 27, 2022 •

edited

Loading