Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't panic when logs waits for more than 5 seconds (v2) #1278

Closed
wants to merge 5 commits into from

Conversation

martin-sucha
Copy link
Contributor

This should fix #946

What does this PR do?

There are two commits.

The first commit fixes synchronization issues in producer tests.

The second commit removes panic when logs endpoint takes more than 5 seconds to respond. The panic happened at least with podman when no new logs appear when using follow and since parameters.

We keep retrying until the context is canceled (the retry request would fail anyway) or the producer is stopped, whichever comes first. This makes the retry behavior consistent. The HTTP calls to fetch logs are interrupted immediately (previously the producer waited for in-progress calls to complete).

Why is it important?

Panicking the whole test suite breaks all my tests if the container ends early or there is some temporary error fetching the logs.

Related issues

The consumer's Accept method is running in a different gorutine than
the main test.
We need to synchronize access to the messages slice.

Some tests waited for Ack before accessing Msgs and were properly
synchronized.

TestContainerLogWithErrClosed accessed Msgs while the consumer was
running and the memory access was not synchronized.

Test_StartStop did not wait for the expected lines to appear before
stopping the log producer, there was nothing to guarantee that
at least one line would be read.
The test passed because currently the producer does not interrupt
outstanding HTTP requests to fetch logs, so it implicitly waited for
some logs to be received.
We now wait for the mlem line to appear before stopping the log
producer.
This removes panic when logs endpoint takes more than 5 seconds to respond.
The panic happened at least with podman when no new logs appear when using follow and since parameters.

We keep retrying until the context is canceled
(the retry request would fail anyway with canceled context)
or the producer is stopped,
whichever comes first.
This makes the retry behavior consistent with closed connections
handling.

Outstanding HTTP calls for fetching logs are now interrupted when
a producer is stopped.
Previously the consumer and StopProducer() waited for the HTTP call
to complete.

This should fix testcontainers#946
@martin-sucha martin-sucha requested a review from a team as a code owner June 8, 2023 14:54
@netlify
Copy link

netlify bot commented Jun 8, 2023

Deploy Preview for testcontainers-go ready!

Name Link
🔨 Latest commit bc8715e
🔍 Latest deploy log https://app.netlify.com/sites/testcontainers-go/deploys/64e5cb40f704190008655394
😎 Deploy Preview https://deploy-preview-1278--testcontainers-go.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Jun 8, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

No Coverage information No Coverage information
13.2% 13.2% Duplication

Copy link
Member

@mdelapenya mdelapenya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing this PR with the same script as in #947:

for x in {1..20}; do go test -timeout 600s -run ^Test_StartStop$ github.com/testcontainers/testcontainers-go -count=1; done

And in general I still see certain flakiness, although different to the one described in that ticket: instead of receiving inconsistent messages, I get a context deadline exceeded error.

Could you take a look at it?

Apart from that, the code at this moment LGTM

@@ -31,7 +42,41 @@ func (g *TestLogConsumer) Accept(l Log) {
return
}

g.Msgs = append(g.Msgs, s)
g.mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a comment here explaining why the synchronisation must be done at the test level? You already added it in #947 :)

@mdelapenya
Copy link
Member

mdelapenya commented Aug 11, 2023

Hey @martin-sucha I came back from parental leave and I'm revisiting this PR. Did you see my previous comments?

@martin-sucha
Copy link
Contributor Author

Hey @mdelapenya yes, thank you for the comments! I've been quite busy recently, this keeps falling through the cracks. I hope I can revisit this soon.

If the context is done, we close the log producer.
That is not an error, the context cancellation signals that the consumer
should stop.

If there is a non-context error during the HTTP call or while reading
the response, retry the HTTP request in 1 second again.

Previously, the error handling was inconsistent:
- an error while reading HTTP response headers would retry
  the HTTP request
- but an error while reading the body would just end the log producer

With this commit, the error handling should be more consistent.
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

No Coverage information No Coverage information
11.2% 11.2% Duplication

@martin-sucha
Copy link
Contributor Author

I'm testing this PR with the same script as in #947:

for x in {1..20}; do go test -timeout 600s -run ^Test_StartStop$ github.com/testcontainers/testcontainers-go -count=1; done

And in general I still see certain flakiness, although different to the one described in that ticket: instead of receiving inconsistent messages, I get a context deadline exceeded error.

I fixed the error handling so that context canceled is not treated as an error. It is merely a signal that the log producer should stop. Also updated the error handling for other errors to always retry until the log producer is stopped.

@Tofel
Copy link
Contributor

Tofel commented Oct 30, 2023

wdyt about a different approach:

  • allow the caller to pass an optional timeout duration
  • instead of panicking return an error if log producer cannot be started?

In this way caller would have bigger control over what it wants to do:

  • retry
  • pass the error higher up the call stack
  • panic, etc

Why?:

  • retrying until success might not always be what we want (what if it never succeeds and we retry it until test timeouts?)
  • panicking inside a different goroutine makes it impossible to handle it in any way and some programs might be fine with running without logs

@mdelapenya
Copy link
Member

wdyt about a different approach:

  • allow the caller to pass an optional timeout duration
  • instead of panicking return an error if log producer cannot be started?

In this way caller would have bigger control over what it wants to do:

  • retry
  • pass the error higher up the call stack
  • panic, etc

Why?:

  • retrying until success might not always be what we want (what if it never succeeds and we retry it until test timeouts?)
  • panicking inside a different goroutine makes it impossible to handle it in any way and some programs might be fine with running without logs

@Tofel I like this. Would you be able to work on it? We are thinking about an eventual v1 version, so adding a breaking change in the signature/behaviour, if done properly, wouldn't be that bad.

@Tofel
Copy link
Contributor

Tofel commented Oct 30, 2023

@mdelapenya I will be able to work on that, yes, but I need to confirm with me team when. Will circle back once I know it.

@Tofel
Copy link
Contributor

Tofel commented Nov 13, 2023

@mdelapenya it's me again... so my team is okay with me taking the time to propose an alternative solution (although I guess it will have to be a bit different, since we cannot return the error from that goroutine, but why not have an error channel on which the caller could listen in a dedicated goroutine?).

Before I do start with it, could you tell me when you're thinking that v1 might see the light of the day?

@mdelapenya
Copy link
Member

@mdelapenya it's me again... so my team is okay with me taking the time to propose an alternative solution (although I guess it will have to be a bit different, since we cannot return the error from that goroutine, but why not have an error channel on which the caller could listen in a dedicated goroutine?).

Indeed, I admit that we should totally rewrite the log consumer implementation for the v1, as I think it has grown in a non-organic manner.

Before I do start with it, could you tell me when you're thinking that v1 might see the light of the day?

Not sure about it. I already have a draft doc with the things that I'd like to do/fix, which I'd make it public right after completed, in the form of a GH discussion. From there, I'll start creating issues in case the community wants to participate.

@mdelapenya
Copy link
Member

This PR has been superseded by #1971, please take a look at #1971 (comment)

I'd like to thank you for your time elaborating this PR. It has not been merged this time, but came with a lot of light for the final implementation. Thanks!

@mdelapenya mdelapenya closed this Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An issue with the library
Projects
None yet
3 participants