Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't panic when logs waits for more than 5 seconds #1777

Closed
wants to merge 14 commits into from

Conversation

mdelapenya
Copy link
Member

@mdelapenya mdelapenya commented Oct 16, 2023

Disclaimer

Credits to @martin-sucha, as the original author of this PR. 🙌

I resolved the conflicts manually, reverting the changes on the logconsumer_test.go file. I sent a different PR because I could not push to the upstream of #1278. But all the commits have been preserved to keep history and authorship untouched.

What does this PR do?

This should fix #946

There are two commits.

The first commit fixes synchronization issues in producer tests.

The second commit removes panic when logs endpoint takes more than 5 seconds to respond. The panic happened at least with podman when no new logs appear when using follow and since parameters.

We keep retrying until the context is canceled (the retry request would fail anyway) or the producer is stopped, whichever comes first. This makes the retry behavior consistent. The HTTP calls to fetch logs are interrupted immediately (previously the producer waited for in-progress calls to complete).

Why is it important?

Panicking the whole test suite breaks all my tests if the container ends early or there is some temporary error fetching the logs.

Related issues

martin-sucha and others added 6 commits June 8, 2023 16:40
The consumer's Accept method is running in a different gorutine than
the main test.
We need to synchronize access to the messages slice.

Some tests waited for Ack before accessing Msgs and were properly
synchronized.

TestContainerLogWithErrClosed accessed Msgs while the consumer was
running and the memory access was not synchronized.

Test_StartStop did not wait for the expected lines to appear before
stopping the log producer, there was nothing to guarantee that
at least one line would be read.
The test passed because currently the producer does not interrupt
outstanding HTTP requests to fetch logs, so it implicitly waited for
some logs to be received.
We now wait for the mlem line to appear before stopping the log
producer.
This removes panic when logs endpoint takes more than 5 seconds to respond.
The panic happened at least with podman when no new logs appear when using follow and since parameters.

We keep retrying until the context is canceled
(the retry request would fail anyway with canceled context)
or the producer is stopped,
whichever comes first.
This makes the retry behavior consistent with closed connections
handling.

Outstanding HTTP calls for fetching logs are now interrupted when
a producer is stopped.
Previously the consumer and StopProducer() waited for the HTTP call
to complete.

This should fix testcontainers#946
If the context is done, we close the log producer.
That is not an error, the context cancellation signals that the consumer
should stop.

If there is a non-context error during the HTTP call or while reading
the response, retry the HTTP request in 1 second again.

Previously, the error handling was inconsistent:
- an error while reading HTTP response headers would retry
  the HTTP request
- but an error while reading the body would just end the log producer

With this commit, the error handling should be more consistent.
* main: (137 commits)
  Fix wrong module names (testcontainers#1776)
  docs: add default options to k6 module (testcontainers#1744)
  fix race condition in Test_StartStop (testcontainers#1700)
  chore(deps): bump github.com/aws/aws-sdk-go-v2/{service/s3,credentials,config} in /modules/localstack (testcontainers#1773)
  chore(deps): bump cloud.google.com/go/{datastore,bigtable,spanner} in /modules/gcloud (testcontainers#1774)
  chore(deps): bump golang.org/x/net from 0.15.0 to 0.17.0 (testcontainers#1772)
  docs: Fix typo and mention the relevant function name in doc (testcontainers#1745)
  DOCKER_HOST var typo (testcontainers#1743)
  feat: Add Cassandra module (testcontainers#1726)
  K6 module (testcontainers#1721)
  Rancher Desktop instructions (testcontainers#1724)
  chore(deps): bump github.com/shirou/gopsutil/v3 from 3.23.8 to 3.23.9 (testcontainers#1720)
  chore(deps): bump urllib3 from 2.0.5 to 2.0.6 (testcontainers#1718)
  chore(deps): bump github.com/twmb/franz-go/pkg/kadm in /modules/redpanda (testcontainers#1714)
  chore(deps): bump github.com/couchbase/gocb/v2 in /modules/couchbase (testcontainers#1704)
  chore(deps): bump github.com/neo4j/neo4j-go-driver/v5 in /modules/neo4j (testcontainers#1713)
  chore(deps): bump github.com/nats-io/nats.go in /modules/nats (testcontainers#1705)
  chore(deps): bump cloud.google.com/go/firestore from 1.12.0 to 1.13.0, google.golang.org/api from 0.142.0 to 0.143.0 and cloud.google.com/ge, google.golang.org/api from 0.142.0 to 0.143.0 and cloud.google.com/go/bigquery from 1.53.0 to 1.55 in /modules/gcloud (testcontainers#1716)
  chore(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.39.0 to 1.40.0 and github.com/aws/aws-sdk-go from 1.45.15 to 1.45.19 in /modules/localstack (testcontainers#1717)
  chore: prepare for next minor development cycle (0.26.0)
  ...
@mdelapenya mdelapenya added the bug An issue with the library label Oct 16, 2023
@mdelapenya mdelapenya self-assigned this Oct 16, 2023
@mdelapenya mdelapenya requested a review from a team as a code owner October 16, 2023 08:06
@netlify
Copy link

netlify bot commented Oct 16, 2023

Deploy Preview for testcontainers-go ready!

Name Link
🔨 Latest commit 51d6f8b
🔍 Latest deploy log https://app.netlify.com/sites/testcontainers-go/deploys/6538d66a826a6500088ebb07
😎 Deploy Preview https://deploy-preview-1777--testcontainers-go.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

docker.go Outdated Show resolved Hide resolved
docker.go Outdated Show resolved Hide resolved
* main:
  fix: data race in docker client `Info()` (testcontainers#1779)
  Use correct formatting directive for errors in lifecycle logs (testcontainers#1780)
  chore(deps): bump golang.org/x/mod from 0.12.0 to 0.13.0 in /modules/{elasticsearch,kafka} and /modulegen (testcontainers#1778)
  chore(deps): bump github.com/rabbitmq/amqp091-go in /modules/rabbitmq (testcontainers#1728)
  chore(deps): bump github.com/ClickHouse/clickhouse-go/v2 (testcontainers#1732)
  ignore patterns defined in dockerignore (testcontainers#1725)
docker.go Outdated Show resolved Hide resolved
docker.go Outdated Show resolved Hide resolved
docker.go Outdated
Comment on lines 684 to 689
select {
case <-ctx.Done():
return
default:
c.logger.Printf("read log header: %+v. %s", err, logRestartedForOutOfSyncMessage)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd have to excuse me for my previous comments. While this is totally OK, the select statement isn't really necessary here. There's a very slim chance of hitting the case <-ctx.Done() branch here so we might as well remove the whole select statement here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want the default case to be moved to the second select, or simply remove it completely?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can basically do this:

Suggested change
select {
case <-ctx.Done():
return
default:
c.logger.Printf("read log header: %+v. %s", err, logRestartedForOutOfSyncMessage)
}
c.logger.Printf("read log header: %+v. %s", err, logRestartedForOutOfSyncMessage)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool! we are going to roll it back to the initial state where the print was out of the select 😅

In any case, thanks for your assistance in this part of the code, some time you find it difficult to control all of them

docker.go Outdated Show resolved Hide resolved
docker.go Outdated Show resolved Hide resolved
docker.go Outdated
Comment on lines 639 to 646
// we need a separate select here in order to be able to print the error
// and this cannot be done in the same select that checks for the timeout.
select {
case <-ctx.Done():
return
default:
c.logger.Printf("cannot get logs for container %q: %v", c.ID, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto as elsewhere, no point in the separate select, there's a very slim chance the context gets done between check in line 634 and here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can extract the log out and remove the select, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup

docker.go Outdated
Comment on lines 739 to 744
select {
case <-ctx.Done():
return
default:
c.logger.Printf("read log message: %+v. %s", err, logRestartedForOutOfSyncMessage)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

* main:
  chore(deps): combine and bump compose dependencies (testcontainers#1787)
  feat: support for replacing images with custom substitutions (testcontainers#1719)
…operation that does not do anything with context
…operation that does not do anything with context
@mdelapenya mdelapenya added the hacktoberfest Pull Requests accepted for Hacktoberfest. label Oct 24, 2023
@mdelapenya
Copy link
Member Author

@pmalek we merged #1701 yesterday, which is causing the conflicts here. Do you think that this PR is still needed after that?

@pmalek
Copy link
Contributor

pmalek commented Oct 25, 2023

@pmalek we merged #1701 yesterday, which is causing the conflicts here. Do you think that this PR is still needed after that?

I'm not sure, haven't compared #1701 with this one in detail. Was just commenting out of good will on this PR :)

@mdelapenya
Copy link
Member Author

I'm not sure, haven't compared #1701 with this one in detail. Was just commenting out of good will on this PR :)

Sure, and I do appreciate your help here. Many times, debugging concurrent tasks is extremely difficult, so having more eyes on it is really helpful 🙇

err := r.Close()
if err != nil {
// we can't close the read closer, this should never happen
panic(err)
Copy link

@pckhoi pckhoi Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we panicking here? If you just want to log the error and return then isn't it better to consistently do it in the same way?

If the purpose is to print the stack trace then maybe panic is best, but why here?

@pckhoi
Copy link

pckhoi commented Nov 9, 2023

I would say this change is still absolutely needed after #1701. I'm still observing StopLogProducer hanging when the go routine panic. We should replace the stopProducer chan bool with the stopProducer context.CancelFunc sooner or later, if not in this PR then in a separate PR.

@pmalek
Copy link
Contributor

pmalek commented Nov 11, 2023

I would say this change is still absolutely needed after #1701. I'm still observing StopLogProducer hanging when the go routine panic. We should replace the stopProducer chan bool with the stopProducer context.CancelFunc sooner or later, if not in this PR then in a separate PR.

Hey @pckhoi 👋

I've pushed #1911 which should fix this very issue. Still, as you've suggested some refactor of internal structure of DockerContainer might be desirable.

@mdelapenya
Copy link
Member Author

I think #1971 supersedes this one. Closing it

@mdelapenya mdelapenya closed this Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An issue with the library hacktoberfest Pull Requests accepted for Hacktoberfest.
Projects
None yet
4 participants