Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: ryuk restart/create after first instance has timed out fails #1671

Closed
srenatus opened this issue Sep 22, 2023 · 9 comments · Fixed by #2084
Closed

[Bug]: ryuk restart/create after first instance has timed out fails #1671

srenatus opened this issue Sep 22, 2023 · 9 comments · Fixed by #2084
Labels
bug An issue with the library

Comments

@srenatus
Copy link
Contributor

srenatus commented Sep 22, 2023

Testcontainers version

0.24.1

Using the latest Testcontainers version?

Yes

Host OS

macos, linux

Host arch

x86

Go version

1.21

Docker version

Client:
 Version:           24.0.6
 API version:       1.43
 Go version:        go1.20.7
 Git commit:        ed223bc
 Built:             Mon Sep  4 12:28:49 2023
 OS/Arch:           darwin/amd64
 Context:           orbstack

Server: Docker Engine - Community
 Engine:
  Version:          24.0.6
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.7
  Git commit:       1a79695
  Built:            Mon Sep  4 12:32:17 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.7.5
  GitCommit:        fe457eb99ac0e27b3ce638175ef8e68a7d2bc373
 runc:
  Version:          1.1.9
  GitCommit:        82f18fe0e44a59034f3e1f45e475fa5636e539aa
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Docker info

Client:
 Version:    24.0.6
 Context:    orbstack
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.1
    Path:     /Users/stephan/.docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.21.0
    Path:     /Users/stephan/.docker/cli-plugins/docker-compose
  dev: Docker Dev Environments (Docker Inc.)
    Version:  v0.0.5
    Path:     /Users/stephan/.docker/cli-plugins/docker-dev
  extension: Manages Docker extensions (Docker Inc.)
    Version:  v0.2.17
    Path:     /Users/stephan/.docker/cli-plugins/docker-extension
  sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc.)
    Version:  0.6.0
    Path:     /Users/stephan/.docker/cli-plugins/docker-sbom
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /Users/stephan/.docker/cli-plugins/docker-scan

Server:
 Containers: 202
  Running: 0
  Paused: 0
  Stopped: 202
 Images: 120
 Server Version: 24.0.6
 Storage Driver: overlay2
  Backing Filesystem: btrfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: fe457eb99ac0e27b3ce638175ef8e68a7d2bc373
 runc version: 82f18fe0e44a59034f3e1f45e475fa5636e539aa
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.4.16-orbstack-00105-g14094bfeec09
 Operating System: OrbStack
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 7.754GiB
 Name: orbstack
 ID: 8f3602a6-fb4a-4318-9f3f-22677f7a23df
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine
 Default Address Pools:
   Base: 192.168.215.0/24, Size: 24
   Base: 192.168.228.0/24, Size: 24
   Base: 192.168.247.0/24, Size: 24
   Base: 192.168.207.0/24, Size: 24
   Base: 192.168.167.0/24, Size: 24
   Base: 192.168.107.0/24, Size: 24
   Base: 192.168.237.0/24, Size: 24
   Base: 192.168.148.0/24, Size: 24
   Base: 192.168.214.0/24, Size: 24
   Base: 192.168.165.0/24, Size: 24
   Base: 192.168.227.0/24, Size: 24
   Base: 192.168.181.0/24, Size: 24
   Base: 192.168.158.0/24, Size: 24
   Base: 192.168.117.0/24, Size: 24
   Base: 192.168.155.0/24, Size: 24
   Base: 192.168.147.0/24, Size: 24
   Base: 192.168.229.0/24, Size: 24
   Base: 192.168.183.0/24, Size: 24
   Base: 192.168.156.0/24, Size: 24
   Base: 192.168.97.0/24, Size: 24
   Base: 192.168.171.0/24, Size: 24
   Base: 192.168.186.0/24, Size: 24
   Base: 192.168.216.0/24, Size: 24
   Base: 192.168.242.0/24, Size: 24
   Base: 192.168.166.0/24, Size: 24
   Base: 192.168.239.0/24, Size: 24
   Base: 192.168.223.0/24, Size: 24
   Base: 192.168.164.0/24, Size: 24
   Base: 192.168.163.0/24, Size: 24
   Base: 192.168.172.0/24, Size: 24
   Base: 172.17.0.0/16, Size: 16
   Base: 172.18.0.0/16, Size: 16
   Base: 172.19.0.0/16, Size: 16
   Base: 172.20.0.0/14, Size: 16
   Base: 172.24.0.0/14, Size: 16
   Base: 172.28.0.0/14, Size: 16

What happened?

With this test definition,

package foo

import (
	"context"
	"os"
	"testing"
	"time"

	"github.com/testcontainers/testcontainers-go/modules/redpanda"
)

func TestA(t *testing.T) {
	ctx := context.Background()
	tc, err := redpanda.RunContainer(ctx)
	if err != nil {
		t.Fatal(err)
	}
	t.Cleanup(func() { tc.Terminate(ctx) })
}

func TestB(t *testing.T) {
	d := os.Getenv("SLEEP")
	if d == "" {
		return
	}
	dur, err := time.ParseDuration(d)
	if err != nil {
		t.Fatal(err)
	}
	time.Sleep(dur)
}

func TestC(t *testing.T) {
	ctx := context.Background()
	tc, err := redpanda.RunContainer(ctx)
	if err != nil {
		t.Fatal(err)
	}
	t.Cleanup(func() { tc.Terminate(ctx) })
}

I noticed that with no sleep (or a sleep below 10 seconds, the default ryuk reconnection timeout), the tests pass. With a sleep of 10s or more, it'll fail. Both runs are in the log output below 👇

Relevant log output

$ SLEEP=1s go test ./foo -v 
=== RUN   TestA
2023/09/22 20:37:56 github.com/testcontainers/testcontainers-go - Connected to docker: 
  Server Version: 24.0.6
  API Version: 1.43
  Operating System: OrbStack
  Total Memory: 7940 MB
  Resolved Docker Host: unix:///var/run/docker.sock
  Resolved Docker Socket Path: /var/run/docker.sock
  Test SessionID: 7f4e4184606722bddeedc93c95afad24affc09b3b01f3cc76db453d060a91753
  Test ProcessID: 5081b87f-894e-4521-b79d-2be908622f0b
2023/09/22 20:37:56 🐳 Creating container for image docker.io/testcontainers/ryuk:0.5.1
2023/09/22 20:37:56 ✅ Container created: ad8cd8e64d90
2023/09/22 20:37:56 🐳 Starting container: ad8cd8e64d90
2023/09/22 20:37:57 ✅ Container started: ad8cd8e64d90
2023/09/22 20:37:57 🚧 Waiting for container id ad8cd8e64d90 image: docker.io/testcontainers/ryuk:0.5.1. Waiting for: &{Port:8080/tcp timeout:<nil> PollInterval:100ms}
2023/09/22 20:37:57 🐳 Creating container for image docker.redpanda.com/redpandadata/redpanda:v23.1.7
2023/09/22 20:37:57 ✅ Container created: 4d24a10613f8
2023/09/22 20:37:57 🐳 Starting container: 4d24a10613f8
2023/09/22 20:37:57 ✅ Container started: 4d24a10613f8
2023/09/22 20:37:58 🐳 Terminating container: 4d24a10613f8
2023/09/22 20:37:58 🚫 Container terminated: 4d24a10613f8
--- PASS: TestA (1.32s)
=== RUN   TestB
--- PASS: TestB (1.00s)
=== RUN   TestC
2023/09/22 20:37:59 🐳 Creating container for image docker.redpanda.com/redpandadata/redpanda:v23.1.7
2023/09/22 20:37:59 ✅ Container created: 992bcf3663ca
2023/09/22 20:37:59 🐳 Starting container: 992bcf3663ca
2023/09/22 20:37:59 ✅ Container started: 992bcf3663ca
2023/09/22 20:38:00 🐳 Terminating container: 992bcf3663ca
2023/09/22 20:38:00 🚫 Container terminated: 992bcf3663ca
--- PASS: TestC (1.00s)
PASS
ok      github.com/myorg/mystuff/pkg/foo      3.498s
$ SLEEP=10s go test ./foo -v
=== RUN   TestA
2023/09/22 20:37:20 github.com/testcontainers/testcontainers-go - Connected to docker: 
  Server Version: 24.0.6
  API Version: 1.43
  Operating System: OrbStack
  Total Memory: 7940 MB
  Resolved Docker Host: unix:///var/run/docker.sock
  Resolved Docker Socket Path: /var/run/docker.sock
  Test SessionID: 63e89548a1f5a097d41c77c6206b6fc9c59c2672e40610c077e57659b70364b5
  Test ProcessID: fe3f072d-3099-4eb0-8e25-4de472a4fb98
2023/09/22 20:37:20 🐳 Creating container for image docker.io/testcontainers/ryuk:0.5.1
2023/09/22 20:37:20 ✅ Container created: 6270d9bba7df
2023/09/22 20:37:20 🐳 Starting container: 6270d9bba7df
2023/09/22 20:37:20 ✅ Container started: 6270d9bba7df
2023/09/22 20:37:20 🚧 Waiting for container id 6270d9bba7df image: docker.io/testcontainers/ryuk:0.5.1. Waiting for: &{Port:8080/tcp timeout:<nil> PollInterval:100ms}
2023/09/22 20:37:20 🐳 Creating container for image docker.redpanda.com/redpandadata/redpanda:v23.1.7
2023/09/22 20:37:20 ✅ Container created: de7650b6eb59
2023/09/22 20:37:20 🐳 Starting container: de7650b6eb59
2023/09/22 20:37:21 ✅ Container started: de7650b6eb59
2023/09/22 20:37:21 🐳 Terminating container: de7650b6eb59
2023/09/22 20:37:21 🚫 Container terminated: de7650b6eb59
--- PASS: TestA (1.33s)
=== RUN   TestB
--- PASS: TestB (10.00s)
=== RUN   TestC
    foo_test.go:37: Error response from daemon: No such container: 6270d9bba7dfc44cead74463883767627801f927563f1048d76514c5a1182c5d: creating reaper failed: failed to create container
--- FAIL: TestC (0.01s)
FAIL
FAIL    github.com/myorg/mystuff/pkg/foo      11.626s

I'm using redpanda here, but that's for illustration only. I don't believe it has anything to do with the module that's used.

Additional information

Increasing the reconnection timeout (#1668) was a workaround for this.

@srenatus srenatus added the bug An issue with the library label Sep 22, 2023
@mdelapenya
Copy link
Member

Hi @srenatus I think this is the expected behavior, as Ryuk will check the configured timeout and die if no connection is made during that time. The workaround is actually the way to increase the time Ryuk waits for connections, so I'd say with a lot of caution this is a feature not a bug 🪲

@srenatus
Copy link
Contributor Author

srenatus commented Nov 8, 2023

Is it? I mean, checking if there is still an instance and reusing it if possible makes sense, but if there is no instance running anymore, it could start a new one. Similar to how it starts one at the very start. (Extending the timeout would then be an optimisation.)

@mdelapenya
Copy link
Member

Oh I'm sorry I probably misunderstood the issue, as it's not about Ryuk not receiving connections and dying, but about the need of Ryuk recreation if for some reason the test took more time than the reconnection timeout, so it would need to be recreated, right?

If the latter, I think the work in #1904 could help in making the issue more visible.

@mdelapenya
Copy link
Member

@srenatus given #1904 was merged, could you try with the main branch instead of the release one? I'm preparing a new release, and would like to include this big as fixed if possible.

Thanks!

@srenatus
Copy link
Contributor Author

I'm out for the rest of the week but I can give it a try next week!

@srenatus
Copy link
Contributor Author

Update -- I've run our CI bits with testcontainers-go from main, a9f0ac8, and it failed again with reaper-related errors

 === RUN   TestDecisionLogsS3Output
    s3_test.go:301: Error response from daemon: No such container: 1769a2c2c0ec375f576f1e4d92fd8dbc8175b360f042ec634ac935f180fa1d7e: creating reaper failed: failed to create container
--- FAIL: TestDecisionLogsS3Output (0.00s)
=== RUN   TestDecisionLogsS3Batching
    s3_test.go:301: Error response from daemon: No such container: 1769a2c2c0ec375f576f1e4d92fd8dbc8175b360f042ec634ac935f180fa1d7e: creating reaper failed: failed to create container
--- FAIL: TestDecisionLogsS3Batching (0.00s)

@Mathew-Estafanous
Copy link
Contributor

Mathew-Estafanous commented Dec 28, 2023

Looking into this, it appears that the reaper is supposed to be recreated but fails during the state check because the container is already gone by then.

https://github.com/testcontainers/testcontainers-go/blob/9060ff78823e0b6f1689016efa2164ffd12e9fec/reaper.go#L139C2-L151

We could check if it's an ErrNotFound state error, and if so then treat that differently. I propose something a solution like this.

// 1. if the reaper instance has been already created, return it
if reaperInstance != nil {
	// Verify this instance is still running by checking state.
	// Can't use Container.IsRunning because the bool is not updated when Reaper is terminated
	state, err := reaperInstance.container.State(ctx)
	if err != nil {
		if !errdefs.IsNotFound(err) {
			return nil, err
		}
	} else {
		if state.Running {
			return reaperInstance, nil
		}
	}
	// else: the reaper instance has been terminated, so we need to create a new one
	reaperOnce = sync.Once{}
}

@mdelapenya
Copy link
Member

Looking into this, it appears that the reaper is supposed to be recreated but fails during the state check because the container is already gone by then.

https://github.com/testcontainers/testcontainers-go/blob/9060ff78823e0b6f1689016efa2164ffd12e9fec/reaper.go#L139C2-L151

We could check if it's an ErrNotFound state error, and if so then treat that differently. I propose something a solution like this.

// 1. if the reaper instance has been already created, return it
if reaperInstance != nil {
	// Verify this instance is still running by checking state.
	// Can't use Container.IsRunning because the bool is not updated when Reaper is terminated
	state, err := reaperInstance.container.State(ctx)
	if err != nil {
		if !errdefs.IsNotFound(err) {
			return nil, err
		}
	} else {
		if state.Running {
			return reaperInstance, nil
		}
	}
	// else: the reaper instance has been terminated, so we need to create a new one
	reaperOnce = sync.Once{}
}

@Mathew-Estafanous would you mind submitting a PR with your fix? And just to confirm, the last line, where the reaperOnce is reset, is needed because a previous call was executed, and for some reason Ryuk died in between, therefore the once state is protecting that block. Is this assumption correct?

@Mathew-Estafanous
Copy link
Contributor

Mathew-Estafanous commented Jan 9, 2024

would you mind submitting a PR with your fix?

Here you go! #2084

And just to confirm, the last line, where the reaperOnce is reset, is needed because a previous call was executed, and for some reason Ryuk died in between, therefore the once state is protecting that block. Is this assumption correct?

Yup!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An issue with the library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants