-
-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: don't panic when logs waits for more than 5 seconds #1777
Conversation
The consumer's Accept method is running in a different gorutine than the main test. We need to synchronize access to the messages slice. Some tests waited for Ack before accessing Msgs and were properly synchronized. TestContainerLogWithErrClosed accessed Msgs while the consumer was running and the memory access was not synchronized. Test_StartStop did not wait for the expected lines to appear before stopping the log producer, there was nothing to guarantee that at least one line would be read. The test passed because currently the producer does not interrupt outstanding HTTP requests to fetch logs, so it implicitly waited for some logs to be received. We now wait for the mlem line to appear before stopping the log producer.
This removes panic when logs endpoint takes more than 5 seconds to respond. The panic happened at least with podman when no new logs appear when using follow and since parameters. We keep retrying until the context is canceled (the retry request would fail anyway with canceled context) or the producer is stopped, whichever comes first. This makes the retry behavior consistent with closed connections handling. Outstanding HTTP calls for fetching logs are now interrupted when a producer is stopped. Previously the consumer and StopProducer() waited for the HTTP call to complete. This should fix testcontainers#946
If the context is done, we close the log producer. That is not an error, the context cancellation signals that the consumer should stop. If there is a non-context error during the HTTP call or while reading the response, retry the HTTP request in 1 second again. Previously, the error handling was inconsistent: - an error while reading HTTP response headers would retry the HTTP request - but an error while reading the body would just end the log producer With this commit, the error handling should be more consistent.
* main: (137 commits) Fix wrong module names (testcontainers#1776) docs: add default options to k6 module (testcontainers#1744) fix race condition in Test_StartStop (testcontainers#1700) chore(deps): bump github.com/aws/aws-sdk-go-v2/{service/s3,credentials,config} in /modules/localstack (testcontainers#1773) chore(deps): bump cloud.google.com/go/{datastore,bigtable,spanner} in /modules/gcloud (testcontainers#1774) chore(deps): bump golang.org/x/net from 0.15.0 to 0.17.0 (testcontainers#1772) docs: Fix typo and mention the relevant function name in doc (testcontainers#1745) DOCKER_HOST var typo (testcontainers#1743) feat: Add Cassandra module (testcontainers#1726) K6 module (testcontainers#1721) Rancher Desktop instructions (testcontainers#1724) chore(deps): bump github.com/shirou/gopsutil/v3 from 3.23.8 to 3.23.9 (testcontainers#1720) chore(deps): bump urllib3 from 2.0.5 to 2.0.6 (testcontainers#1718) chore(deps): bump github.com/twmb/franz-go/pkg/kadm in /modules/redpanda (testcontainers#1714) chore(deps): bump github.com/couchbase/gocb/v2 in /modules/couchbase (testcontainers#1704) chore(deps): bump github.com/neo4j/neo4j-go-driver/v5 in /modules/neo4j (testcontainers#1713) chore(deps): bump github.com/nats-io/nats.go in /modules/nats (testcontainers#1705) chore(deps): bump cloud.google.com/go/firestore from 1.12.0 to 1.13.0, google.golang.org/api from 0.142.0 to 0.143.0 and cloud.google.com/ge, google.golang.org/api from 0.142.0 to 0.143.0 and cloud.google.com/go/bigquery from 1.53.0 to 1.55 in /modules/gcloud (testcontainers#1716) chore(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.39.0 to 1.40.0 and github.com/aws/aws-sdk-go from 1.45.15 to 1.45.19 in /modules/localstack (testcontainers#1717) chore: prepare for next minor development cycle (0.26.0) ...
✅ Deploy Preview for testcontainers-go ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
* main: fix: data race in docker client `Info()` (testcontainers#1779) Use correct formatting directive for errors in lifecycle logs (testcontainers#1780) chore(deps): bump golang.org/x/mod from 0.12.0 to 0.13.0 in /modules/{elasticsearch,kafka} and /modulegen (testcontainers#1778) chore(deps): bump github.com/rabbitmq/amqp091-go in /modules/rabbitmq (testcontainers#1728) chore(deps): bump github.com/ClickHouse/clickhouse-go/v2 (testcontainers#1732) ignore patterns defined in dockerignore (testcontainers#1725)
docker.go
Outdated
select { | ||
case <-ctx.Done(): | ||
return | ||
default: | ||
c.logger.Printf("read log header: %+v. %s", err, logRestartedForOutOfSyncMessage) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd have to excuse me for my previous comments. While this is totally OK, the select
statement isn't really necessary here. There's a very slim chance of hitting the case <-ctx.Done()
branch here so we might as well remove the whole select
statement here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want the default case to be moved to the second select, or simply remove it completely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can basically do this:
select { | |
case <-ctx.Done(): | |
return | |
default: | |
c.logger.Printf("read log header: %+v. %s", err, logRestartedForOutOfSyncMessage) | |
} | |
c.logger.Printf("read log header: %+v. %s", err, logRestartedForOutOfSyncMessage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool! we are going to roll it back to the initial state where the print was out of the select 😅
In any case, thanks for your assistance in this part of the code, some time you find it difficult to control all of them
docker.go
Outdated
// we need a separate select here in order to be able to print the error | ||
// and this cannot be done in the same select that checks for the timeout. | ||
select { | ||
case <-ctx.Done(): | ||
return | ||
default: | ||
c.logger.Printf("cannot get logs for container %q: %v", c.ID, err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto as elsewhere, no point in the separate select
, there's a very slim chance the context gets done between check in line 634 and here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can extract the log out and remove the select, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup
docker.go
Outdated
select { | ||
case <-ctx.Done(): | ||
return | ||
default: | ||
c.logger.Printf("read log message: %+v. %s", err, logRestartedForOutOfSyncMessage) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
* main: chore(deps): combine and bump compose dependencies (testcontainers#1787) feat: support for replacing images with custom substitutions (testcontainers#1719)
…operation that does not do anything with context
…operation that does not do anything with context
Sure, and I do appreciate your help here. Many times, debugging concurrent tasks is extremely difficult, so having more eyes on it is really helpful 🙇 |
err := r.Close() | ||
if err != nil { | ||
// we can't close the read closer, this should never happen | ||
panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we panicking here? If you just want to log the error and return then isn't it better to consistently do it in the same way?
If the purpose is to print the stack trace then maybe panic is best, but why here?
I would say this change is still absolutely needed after #1701. I'm still observing |
Hey @pckhoi 👋 I've pushed #1911 which should fix this very issue. Still, as you've suggested some refactor of internal structure of |
I think #1971 supersedes this one. Closing it |
Disclaimer
Credits to @martin-sucha, as the original author of this PR. 🙌
I resolved the conflicts manually, reverting the changes on the
logconsumer_test.go
file. I sent a different PR because I could not push to the upstream of #1278. But all the commits have been preserved to keep history and authorship untouched.What does this PR do?
This should fix #946
There are two commits.
The first commit fixes synchronization issues in producer tests.
The second commit removes panic when logs endpoint takes more than 5 seconds to respond. The panic happened at least with podman when no new logs appear when using follow and since parameters.
We keep retrying until the context is canceled (the retry request would fail anyway) or the producer is stopped, whichever comes first. This makes the retry behavior consistent. The HTTP calls to fetch logs are interrupted immediately (previously the producer waited for in-progress calls to complete).
Why is it important?
Panicking the whole test suite breaks all my tests if the container ends early or there is some temporary error fetching the logs.
Related issues