Skip to content

Commit

Permalink
roachtest: move node-kill operation to pkill/pgrep-based kill approach
Browse files Browse the repository at this point in the history
For some reason, `StopServiceForVirtualCluster` fails with this error on
drt clusters:

```
20:23:41 node_kill.go:51: operation status: killing node 1  with signal 15
20:23:41 cluster.go:2148: stoping virtual cluster
20:23:41 operation_impl.go:128: operation failure #1: no service for virtual cluster ""
```

The debug message has a bug, the virtual cluster is set to "system" but it
seems like the service discovery process isn't able to determine the cockroach
process based on dns settings in the drt project. This change makes the
node-kill operation more dns-agnostic by looking for the cockroach process.

Epic: none

Release note: None
  • Loading branch information
itsbilal committed May 2, 2024
1 parent 00dd860 commit 9db386a
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 7 deletions.
1 change: 0 additions & 1 deletion pkg/cmd/roachtest/operations/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ go_library(
"//pkg/cmd/roachtest/roachtestflags",
"//pkg/cmd/roachtest/roachtestutil",
"//pkg/roachprod",
"//pkg/roachprod/install",
"//pkg/util/randutil",
],
)
29 changes: 23 additions & 6 deletions pkg/cmd/roachtest/operations/node_kill.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ package operations
import (
"context"
"fmt"
"strings"
"time"

"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/operation"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/option"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/registry"
"github.com/cockroachdb/cockroach/pkg/roachprod"
"github.com/cockroachdb/cockroach/pkg/roachprod/install"
"github.com/cockroachdb/cockroach/pkg/util/randutil"
)

Expand Down Expand Up @@ -79,11 +79,28 @@ func runNodeKill(
}
o.Status(fmt.Sprintf("killing node %s with signal %d", node.NodeIDsString(), signal))

stopOpts := option.StopVirtualClusterOpts(install.SystemInterfaceName, node)
stopOpts.RoachprodOpts.Sig = signal
stopOpts.RoachprodOpts.Wait = true
stopOpts.RoachprodOpts.MaxWait = 300 // 5 minutes
c.StopServiceForVirtualCluster(ctx, o.L(), stopOpts)
err := c.RunE(ctx, option.WithNodes(node), "pkill", fmt.Sprintf("-%d", signal), "-f", "cockroach\\ start")
if err != nil {
o.Fatal(err)
}
o.Status(fmt.Sprintf("sent signal %d to node %s, waiting for process to exit", signal, node.NodeIDsString()))

for {
if err := ctx.Err(); err != nil {
o.Fatal(err)
}
err := c.RunE(ctx, option.WithNodes(node), "pgrep", "-f", "cockroach\\ start")
if err != nil {
if strings.Contains(err.Error(), "status 1") {
// pgrep returns error code 1 if no processes are found.
break
}
o.Fatal(err)
}

time.Sleep(1 * time.Second)
}

o.Status(fmt.Sprintf("killed node %s with signal %d", node.NodeIDsString(), signal))

return &cleanupNodeKill{
Expand Down

0 comments on commit 9db386a

Please sign in to comment.