Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test][exporter/prometheus] TestPrometheusExporter failures #36139

Closed
pjanotti opened this issue Nov 1, 2024 · 6 comments · Fixed by #36164
Closed

[test][exporter/prometheus] TestPrometheusExporter failures #36139

pjanotti opened this issue Nov 1, 2024 · 6 comments · Fixed by #36164

Comments

@pjanotti
Copy link
Contributor

pjanotti commented Nov 1, 2024

Component(s)

exporter/prometheus

Describe the issue you're reporting

A few instances of failures:

=== FAIL: . TestPrometheusExporter (0.00s)
    prometheus_test.go:87: 
        	Error Trace:	D:/a/opentelemetry-collector-contrib/opentelemetry-collector-contrib/exporter/prometheusexporter/prometheus_test.go:87
        	Error:      	Received unexpected error:
        	            	close tcp 127.0.0.1:8999: use of closed network connection
        	Test:       	TestPrometheusExporter

=== FAIL: . TestPrometheusExporter (re-run 1) (0.01s)
    prometheus_test.go:87: 
        	Error Trace:	D:/a/opentelemetry-collector-contrib/opentelemetry-collector-contrib/exporter/prometheusexporter/prometheus_test.go:87
        	Error:      	Received unexpected error:
        	            	close tcp 127.0.0.1:8999: use of closed network connection
        	Test:       	TestPrometheusExporter

DONE 2 runs, 73 tests, 2 failures in 361.711s

In quick glance this seems likely because the server using the port is actually launched in a goroutine and the port can be closed by shutdown before the server starts.

@pjanotti pjanotti added os:windows needs triage New item requiring triage labels Nov 1, 2024
Copy link
Contributor

github-actions bot commented Nov 1, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole dashpole self-assigned this Nov 1, 2024
@dashpole dashpole removed the needs triage New item requiring triage label Nov 1, 2024
@dashpole
Copy link
Contributor

dashpole commented Nov 1, 2024

@Argannor this seems likely. to be related to #35465.

I'm OOO for the next few weeks. Are you able to investigate + fix or alternatively revert your PR?

Thanks!

@Argannor
Copy link
Contributor

Argannor commented Nov 1, 2024

@dashpole I can have a closer look on monday and come up with a new PR. If that's not fast enough for you feel free to revert and I'll open a new PR reintroducing the changes + fix.

In quick glance this seems likely because the server using the port is actually launched in a goroutine and the port can be closed by shutdown before the server starts.

I think that's a likely explanation. I'll do some experimentation to confirm your hypothesis. If I'm not mistaken the error only occurs sporadically, so it's likely to be a race condition.

My apologies.

@dashpole
Copy link
Contributor

dashpole commented Nov 1, 2024

Monday should be fine. I won't be able to review for a while, but other approvers are free to do so.

@Argannor
Copy link
Contributor

Argannor commented Nov 4, 2024

I can confirm that this error happens when srv.Serve(ln) is called after ln.Close() has already been called. Will follow up with PR.

Edit: While this sequence causes an error like this, I don't think this is the root cause, since the error returned by srv.Serve(ln) is ignored, and the test fails during a call to prometheusExporter.Shutdown().

Argannor pushed a commit to Argannor/opentelemetry-collector-contrib that referenced this issue Nov 4, 2024
Argannor added a commit to Argannor/opentelemetry-collector-contrib that referenced this issue Nov 4, 2024
@pjanotti
Copy link
Contributor Author

pjanotti commented Nov 8, 2024

@Argannor correct, that first glance was wrong. Looking again at the error message it says close tcp 127.0.0.1:8999: use of closed network connection the first "close" confirms the operation attempted, so as you saw it was on shutdown.

pull bot pushed a commit to abaguas/opentelemetry-collector-contrib that referenced this issue Nov 9, 2024
…pen-telemetry#36164)

#### Description

Adjusted the Start and Shutdown sequence of the prometheusexporter to
prevent a race condition causing the `close tcp 127.0.0.1:8999: use of
closed network connection` error as observed in open-telemetry#36139.

The behaviour was changed in the following ways:

- If an error occurs during the creation of the server, close the
listener immediately, leaving shutdown a noop
- Since `srv.Shutdown` will close all open listeners, the explicit call
to `ln.Close` in the shutdown routine was removed

#### Link to tracking issue
Fixes open-telemetry#36139

#### Testing

Unit testing, I temporarily increased the number of iterations on
`TestPrometheusExporter` to 2000. The error did no longer occur. However
sporadically another error occured:

```
=== RUN   TestPrometheusExporter
    prometheus_test.go:103: 
        	Error Trace:	C:/development/code/opentelemetry-collector-contrib/exporter/prometheusexporter/prometheus_test.go:84
        	Error:      	Received unexpected error:
        	            	listen tcp 127.0.0.1:8999: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.
        	Test:       	TestPrometheusExporter
--- FAIL: TestPrometheusExporter (1.16s)
```

I assume that this is because the OS (in my case Windows) won't always
close the underlying sockets immediately, blocking it for some time
afterwards. I'm not sure there is anything we can do about that.

---------

Signed-off-by: Argannor <arga@argannor.com>
sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this issue Dec 17, 2024
…pen-telemetry#36164)

#### Description

Adjusted the Start and Shutdown sequence of the prometheusexporter to
prevent a race condition causing the `close tcp 127.0.0.1:8999: use of
closed network connection` error as observed in open-telemetry#36139.

The behaviour was changed in the following ways:

- If an error occurs during the creation of the server, close the
listener immediately, leaving shutdown a noop
- Since `srv.Shutdown` will close all open listeners, the explicit call
to `ln.Close` in the shutdown routine was removed

#### Link to tracking issue
Fixes open-telemetry#36139

#### Testing

Unit testing, I temporarily increased the number of iterations on
`TestPrometheusExporter` to 2000. The error did no longer occur. However
sporadically another error occured:

```
=== RUN   TestPrometheusExporter
    prometheus_test.go:103: 
        	Error Trace:	C:/development/code/opentelemetry-collector-contrib/exporter/prometheusexporter/prometheus_test.go:84
        	Error:      	Received unexpected error:
        	            	listen tcp 127.0.0.1:8999: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.
        	Test:       	TestPrometheusExporter
--- FAIL: TestPrometheusExporter (1.16s)
```

I assume that this is because the OS (in my case Windows) won't always
close the underlying sockets immediately, blocking it for some time
afterwards. I'm not sure there is anything we can do about that.

---------

Signed-off-by: Argannor <arga@argannor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants