fix: don't panic when logs waits for more than 5 seconds (v2) #1278

martin-sucha · 2023-06-08T14:54:16Z

This should fix #946

What does this PR do?

There are two commits.

The first commit fixes synchronization issues in producer tests.

The second commit removes panic when logs endpoint takes more than 5 seconds to respond. The panic happened at least with podman when no new logs appear when using follow and since parameters.

We keep retrying until the context is canceled (the retry request would fail anyway) or the producer is stopped, whichever comes first. This makes the retry behavior consistent. The HTTP calls to fetch logs are interrupted immediately (previously the producer waited for in-progress calls to complete).

Why is it important?

Panicking the whole test suite breaks all my tests if the container ends early or there is some temporary error fetching the logs.

Related issues

The consumer's Accept method is running in a different gorutine than the main test. We need to synchronize access to the messages slice. Some tests waited for Ack before accessing Msgs and were properly synchronized. TestContainerLogWithErrClosed accessed Msgs while the consumer was running and the memory access was not synchronized. Test_StartStop did not wait for the expected lines to appear before stopping the log producer, there was nothing to guarantee that at least one line would be read. The test passed because currently the producer does not interrupt outstanding HTTP requests to fetch logs, so it implicitly waited for some logs to be received. We now wait for the mlem line to appear before stopping the log producer.

This removes panic when logs endpoint takes more than 5 seconds to respond. The panic happened at least with podman when no new logs appear when using follow and since parameters. We keep retrying until the context is canceled (the retry request would fail anyway with canceled context) or the producer is stopped, whichever comes first. This makes the retry behavior consistent with closed connections handling. Outstanding HTTP calls for fetching logs are now interrupted when a producer is stopped. Previously the consumer and StopProducer() waited for the HTTP call to complete. This should fix testcontainers#946

netlify · 2023-06-08T14:54:25Z

✅ Deploy Preview for testcontainers-go ready!

Name	Link
🔨 Latest commit	`bc8715e`
🔍 Latest deploy log	https://app.netlify.com/sites/testcontainers-go/deploys/64e5cb40f704190008655394
😎 Deploy Preview	https://deploy-preview-1278--testcontainers-go.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

sonarqubecloud · 2023-06-08T14:56:23Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
3 Code Smells

No Coverage information
13.2% Duplication

mdelapenya

I'm testing this PR with the same script as in #947:

for x in {1..20}; do go test -timeout 600s -run ^Test_StartStop$ github.com/testcontainers/testcontainers-go -count=1; done

And in general I still see certain flakiness, although different to the one described in that ticket: instead of receiving inconsistent messages, I get a context deadline exceeded error.

Could you take a look at it?

Apart from that, the code at this moment LGTM

mdelapenya · 2023-06-14T14:09:59Z

logconsumer_test.go

@@ -31,7 +42,41 @@ func (g *TestLogConsumer) Accept(l Log) {
 		return
 	}

-	g.Msgs = append(g.Msgs, s)
+	g.mu.Lock()


Could we add a comment here explaining why the synchronisation must be done at the test level? You already added it in #947 :)

mdelapenya · 2023-08-11T07:40:13Z

Hey @martin-sucha I came back from parental leave and I'm revisiting this PR. Did you see my previous comments?

martin-sucha · 2023-08-17T14:39:11Z

Hey @mdelapenya yes, thank you for the comments! I've been quite busy recently, this keeps falling through the cracks. I hope I can revisit this soon.

If the context is done, we close the log producer. That is not an error, the context cancellation signals that the consumer should stop. If there is a non-context error during the HTTP call or while reading the response, retry the HTTP request in 1 second again. Previously, the error handling was inconsistent: - an error while reading HTTP response headers would retry the HTTP request - but an error while reading the body would just end the log producer With this commit, the error handling should be more consistent.

sonarqubecloud · 2023-08-23T09:03:31Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
3 Code Smells

No Coverage information
11.2% Duplication

martin-sucha · 2023-08-23T09:07:35Z

I'm testing this PR with the same script as in #947:
for x in {1..20}; do go test -timeout 600s -run ^Test_StartStop$ github.com/testcontainers/testcontainers-go -count=1; done
And in general I still see certain flakiness, although different to the one described in that ticket: instead of receiving inconsistent messages, I get a context deadline exceeded error.

I fixed the error handling so that context canceled is not treated as an error. It is merely a signal that the log producer should stop. Also updated the error handling for other errors to always retry until the log producer is stopped.

Tofel · 2023-10-30T07:35:39Z

wdyt about a different approach:

allow the caller to pass an optional timeout duration
instead of panicking return an error if log producer cannot be started?

In this way caller would have bigger control over what it wants to do:

retry
pass the error higher up the call stack
panic, etc

Why?:

retrying until success might not always be what we want (what if it never succeeds and we retry it until test timeouts?)
panicking inside a different goroutine makes it impossible to handle it in any way and some programs might be fine with running without logs

mdelapenya · 2023-10-30T08:35:18Z

wdyt about a different approach:

allow the caller to pass an optional timeout duration

instead of panicking return an error if log producer cannot be started?

In this way caller would have bigger control over what it wants to do:

retry

pass the error higher up the call stack

panic, etc

Why?:

retrying until success might not always be what we want (what if it never succeeds and we retry it until test timeouts?)

panicking inside a different goroutine makes it impossible to handle it in any way and some programs might be fine with running without logs

@Tofel I like this. Would you be able to work on it? We are thinking about an eventual v1 version, so adding a breaking change in the signature/behaviour, if done properly, wouldn't be that bad.

Tofel · 2023-10-30T12:35:54Z

@mdelapenya I will be able to work on that, yes, but I need to confirm with me team when. Will circle back once I know it.

Tofel · 2023-11-13T11:10:17Z

@mdelapenya it's me again... so my team is okay with me taking the time to propose an alternative solution (although I guess it will have to be a bit different, since we cannot return the error from that goroutine, but why not have an error channel on which the caller could listen in a dedicated goroutine?).

Before I do start with it, could you tell me when you're thinking that v1 might see the light of the day?

mdelapenya · 2023-11-14T06:33:24Z

@mdelapenya it's me again... so my team is okay with me taking the time to propose an alternative solution (although I guess it will have to be a bit different, since we cannot return the error from that goroutine, but why not have an error channel on which the caller could listen in a dedicated goroutine?).

Indeed, I admit that we should totally rewrite the log consumer implementation for the v1, as I think it has grown in a non-organic manner.

Before I do start with it, could you tell me when you're thinking that v1 might see the light of the day?

Not sure about it. I already have a draft doc with the things that I'd like to do/fix, which I'd make it public right after completed, in the form of a GH discussion. From there, I'll start creating issues in case the community wants to participate.

mdelapenya · 2024-01-11T15:11:00Z

This PR has been superseded by #1971, please take a look at #1971 (comment)

I'd like to thank you for your time elaborating this PR. It has not been merged this time, but came with a lot of light for the final implementation. Thanks!

martin-sucha added 2 commits June 8, 2023 16:40

martin-sucha requested a review from a team as a code owner June 8, 2023 14:54

martin-sucha mentioned this pull request Jun 8, 2023

fix: don't panic when logs waits for more than 5 seconds #947

Merged

mdelapenya reviewed Jun 14, 2023

View reviewed changes

martin-sucha added 3 commits August 23, 2023 09:37

Merge branch 'main' into ms/issue-946-v2

8b68f2a

Add a comment explaining synchronization

0b6d915

mdelapenya self-assigned this Aug 31, 2023

mdelapenya added the bug An issue with the library label Aug 31, 2023

nadiamoe mentioned this pull request Sep 27, 2023

[Bug]: Attaching a LogConsumer to a container makes the test hang if the container exits #1669

Closed

mdelapenya mentioned this pull request Oct 16, 2023

fix: don't panic when logs waits for more than 5 seconds #1777

Closed

Tofel mentioned this pull request Oct 19, 2023

[WIP] [TT-590] add support for multiple log targets to LogWatch smartcontractkit/chainlink-testing-framework#741

Closed

This was referenced Nov 28, 2023

Tt 590 custom log targets smartcontractkit/chainlink-testing-framework#787

Closed

[TT-590] custom log targets smartcontractkit/chainlink-testing-framework#788

Closed

Tofel mentioned this pull request Dec 5, 2023

[breaking] Add err chan to log producer and don't panic on error #1971

Merged

mdelapenya closed this Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: don't panic when logs waits for more than 5 seconds (v2) #1278

fix: don't panic when logs waits for more than 5 seconds (v2) #1278

martin-sucha commented Jun 8, 2023

netlify bot commented Jun 8, 2023 •

edited

Loading

sonarqubecloud bot commented Jun 8, 2023

mdelapenya left a comment

mdelapenya Jun 14, 2023

mdelapenya commented Aug 11, 2023 •

edited

Loading

martin-sucha commented Aug 17, 2023

sonarqubecloud bot commented Aug 23, 2023

martin-sucha commented Aug 23, 2023

Tofel commented Oct 30, 2023

mdelapenya commented Oct 30, 2023

Tofel commented Oct 30, 2023

Tofel commented Nov 13, 2023

mdelapenya commented Nov 14, 2023

mdelapenya commented Jan 11, 2024

fix: don't panic when logs waits for more than 5 seconds (v2) #1278

fix: don't panic when logs waits for more than 5 seconds (v2) #1278

Conversation

martin-sucha commented Jun 8, 2023

What does this PR do?

Why is it important?

Related issues

netlify bot commented Jun 8, 2023 • edited Loading

✅ Deploy Preview for testcontainers-go ready!

sonarqubecloud bot commented Jun 8, 2023

mdelapenya left a comment

Choose a reason for hiding this comment

mdelapenya Jun 14, 2023

Choose a reason for hiding this comment

mdelapenya commented Aug 11, 2023 • edited Loading

martin-sucha commented Aug 17, 2023

sonarqubecloud bot commented Aug 23, 2023

martin-sucha commented Aug 23, 2023

Tofel commented Oct 30, 2023

mdelapenya commented Oct 30, 2023

Tofel commented Oct 30, 2023

Tofel commented Nov 13, 2023

mdelapenya commented Nov 14, 2023

mdelapenya commented Jan 11, 2024

netlify bot commented Jun 8, 2023 •

edited

Loading

mdelapenya commented Aug 11, 2023 •

edited

Loading