[BUG] Continuous SSL exceptions post upgrade from 2.11 to 2.15 #4718

blueish-eyez · 2024-08-16T10:05:48Z

Describe the bug

I had a working cluster free of errors, however post upgrade to 2.15 (also tested 2.16) I'm getting a ton of the following error on worker nodes:

[2024-08-12T16:48:40,662][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [opensearch-0] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
	at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
	at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
	at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.16.0.jar:2.16.0]
	at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.16.0.jar:2.16.0]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994) [netty-common-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.111.Final.jar:4.1.111.Final]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

This message is always identical and I cannot pinpoint it to a specific action performed by OpenSearch. The only thing that's also repeatedly logged is:

[INFO ][o.o.j.s.JobSweeper       ] [opensearch-0] Running full sweep

But I feel it's a long shot.

I did not see this problem on 2.11 and this only started appearing after upgrading. Has anyone else experienced this? Is there any direction you could recommend me to go other than verifying certificates (already done, none expired, CAs are there, as well as the keys etc). Is there perhaps a way to connect the SSL exception with a specific task that caused it? Was it communication from a specific node? From master node maybe? Was it in regards to snapshots or whatever?

Any help is highly appreciated!

Related component

Other

To Reproduce

Setup a working cluster on 2.11 with Security plugin
Upgrade the cluster to 2.15
Check worker node logs for SSL exceptions

Expected behavior

No SSL exceptions post upgrade

Additional Details

Plugins
Security plugin

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

Docker image based OpenSearch 2.15

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

zane-neo · 2024-08-19T03:03:53Z

@blueish-eyez I tried this on my local but I can't reproduce it successfully. Here is what I have done:

Create a 2.11 opensearch cluster with docker-compose file and create several documents in a index.
Change the docker-compose file opensearch version to 2.16 and restart.
Can you share steps that can consistently reproduce the issue?

blueish-eyez · 2024-08-19T09:25:45Z

Thinking about it in larger picture there will be plenty of additional points of influence here.. it's a production grade cluster (although sandbox, but with the same principles of access and usage). There are java applications directly writing to the cluster. I tried setting up docker-compose and upgrade but I also could not reproduce in such scenario. I also cannot reproduce on another regular production grade cluster that is using traditional filebeat-logstash-opensearch write pipeline.

Is it possible to decode [2024-08-12T16:48:40,662][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport], exactly the "o.o.h.n.s" bit, as mentioned in https://opensearch.org/docs/latest/install-and-configure/configuring-opensearch/logs/?
Like o.o.i.r stands for logger.org.opensearch.index.reindex. Perhaps knowing the call it was made from I could pin-point the faulty module?

zane-neo · 2024-08-20T03:49:56Z

The o.o.h.n.s stands for org.opensearch.http.netty4.ssl, you can try change the log level of this package or the class org.opensearch.http.netty4.ssl.SecureNetty4HttpServerTransport to see if more details can be found.

machenity · 2024-09-02T04:33:32Z

Hi. I'm experiencing the same issue.
In my OpenSearch cluster on version 2.16.0, the above issue is repeating on coordinating nodes.

The strange thing is that it's happening exactly once every 10 hours, and almost exclusively on the coordinating nodes.

This happens even when no indexing or query requests are being sent.

I've created and tested clusters with different versions, and the same issue doesn't happen with 2.13.0 and earlier,
but only with 2.14.0 and later.

2.12.0: not occurred
2.13.0: not occurred
2.14.0: occurred
2.16.0: occurred

So, I'm not sure, but I suspect it's related to this commit.

10000-ki · 2024-09-02T05:06:05Z

Hi
The same phenomenon occurs to me, and the issue seems to exist starting from v2.14.0 or higher.

dblock · 2024-09-02T14:42:10Z

We still need to narrow this down and get an easy repro, maybe @stephen-crawford can help us here?

reta · 2024-09-03T12:47:14Z

@dblock I will also take a look

reta · 2024-09-03T17:03:35Z

In my OpenSearch cluster on version 2.16.0, the above issue is repeating on coordinating nodes.

@machenity do you have any proxy / gateway in front of the cluster?

I had a working cluster free of errors, however post upgrade to 2.15 (also tested 2.16) I'm getting a ton of the following error on worker nodes:

@blueish-eyez do you have periodic pattern as well or it is very random?

reta · 2024-09-03T17:33:41Z

OK folks, I think the mystery is resolved: TLDR; no functional regressions have been introduced.

So pre-2.14.0, the secure HTTP transport didn't log any exceptions (see please https://github.com/opensearch-project/security/blob/2.13.0.0/src/main/java/org/opensearch/security/ssl/OpenSearchSecuritySSLPlugin.java#L270 where the error handler was set to NOOP).

In 2.14.0 and later, the handler was switched from NOOP to the one which logs (see please https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/OpenSearchSecurityPlugin.java#L2125) and here is why the exceptions are appearing now.

The takeaway is that those were present before but swallowed.

dblock · 2024-09-04T13:12:38Z

@reta Thanks! There's still something causing these errors that's not supposed to, no?

reta · 2024-09-04T13:17:44Z

@reta Thanks! There's still something causing these errors that's not supposed to, no?

Thanks @dblock , correct, these errors are caused by clients closing the connection. It is not possible to pinpoint the exact reasons but just a few:

proxy / gateways in the middle may close the connection
application / service crash that abruptly closes the connection
...

machenity · 2024-09-04T14:29:27Z

@reta thanks for the debug!

In my OpenSearch cluster on version 2.16.0, the above issue is repeating on coordinating nodes.

@machenity do you have any proxy / gateway in front of the cluster?

My clusters were provisioned on top of our private cloud, in-house Kubernetes, so all coordinating nodes were behind the LoadBalancer service. I asked our infrastructure team if there is any monitoring job involved.

Thanks again 🙇

machenity · 2024-09-05T06:22:23Z

I checked it again, and the 10-hour duration patterns are only for low-used clusters.
For the highly used clusters, these logs have occurred randomly.

blueish-eyez · 2024-09-05T10:50:10Z

Hi @reta
I restarted one of the worker nodes and collected sample stdout for roughly 30 minutes. The SSL exception appears roughly every 5-30 seconds. Unfortunately to the point, where logs become unusable after days of uptime. Is there a way to suppress this?
Please also note that while others mentioned they are seeing this on master nodes, I'm seeing this only on worker nodes.

reta · 2024-09-05T12:03:55Z

The SSL exception appears roughly every 5-30 seconds. Unfortunately to the point, where logs become unusable after days of uptime. Is there a way to suppress this?

Thanks @blueish-eyez , yes, I will be working on restoring the previous behavior when such exceptions where swallowed. @dblock could you please transfer this ticket to security plugin?

dblock · 2024-09-09T16:06:23Z

[Catch All Triage - 1, 2, 3, 4, 5]

reta · 2024-09-10T19:47:31Z

@blueish-eyez the problem with sslExceptionHandler shoudl be fixed in 2.18.0 but it may not get rid of the this exceptions in the logs, fe I could reproduce (by simulation) those on 2.11.0:

[2024-09-10T19:45:25,650][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [21140679ffc1] Exception during establishing a SSL connection: java.net.SocketException: Connection reset                                                                                                                                                           
java.net.SocketException: Connection reset                                                                                                                                                                                                                                                                                                            
        at sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394) ~[?:?]                                                                                                                                                                                                                                                       
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426) ~[?:?]                                                                                                                                                                                                                                                                       
        at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.11.0.jar:2.11.0]
        at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.11.0.jar:2.11.0]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.100.Final.jar:4.1.100.Final]
        at java.lang.Thread.run(Thread.java:833) [?:?]

By any chance, did you (re)configure logging on your clusters? One notable difference between 2.11.x and 2.15.x is that transports are now in different package.

blueish-eyez · 2024-09-11T13:15:33Z

@reta thanks for the update. Logging is configured at info level and I believe this was the baseline. No specific packages have elevated / suppressed messaging (except for testing that suggested @zane-neo in the first response, that however did not show any relevant messages around SSL exceptions). For the time being I can set org.opensearch.http.netty4.ssl to FATAL until I can get the 2.18 image to test with

rlueckl · 2024-10-16T11:29:19Z

FYI: we see the same exceptions (currently running 2.17, didn't check when the exceptions were introduced in the logs).

A very simple way to reproduce - and how we noticed the exceptions in the first place: try the Nagios check_http plugin.

# /usr/lib/nagios/plugins/check_http -H opensearch-host01.int.lan --port=9200 --ssl --certificate=15
OK - Certificate 'opensearch-host01.int.lan' will expire on Sun 17 Nov 2024 06:05:05 AM GMT +0000.

Running the check will immediately trigger an exception. We didn't change anything in our monitoring and the check_http script wasn't updated for a long time either.

reta · 2024-10-16T19:53:46Z

@rlueckl thanks, could you reproduce it with docker setup? I cannot, the log is clean:

$ docker run -it -p 9200:9200 -p 9600:9600 -e OPENSEARCH_INITIAL_ADMIN_PASSWORD=_ad0m#Ns_ -e "discovery.type=single-node"  opensearchproject/opensearch:2.17.1

$ /usr/local/nagios/libexec/check_http  -H localhost --port=9200 --ssl --certificate=15
SSL OK - Certificate 'node-0.example.com' will expire in 3410 days on 2034-02-17 12:03 -0500/EST.

rlueckl · 2024-10-17T06:23:05Z

Hi @reta,
which version of check_http are you using? I've tested the docker image on my workstation but could not reproduce the issue, but my Linux Mint has a slightly older version of check_http than the servers.

check_http 2.3.1 (workstation) -> docker 2.17.0 (workstation) -> no exception
check_http 2.3.1 (workstation) -> docker 2.17.1 (workstation) -> no exception
check_http 2.3.1 (workstation) -> deb install 2.17.0 (server) -> no exception
check_http 2.3.3 (server) -> deb install 2.17.0 (server) -> connection reset exception
check_http 2.3.3 (server) -> docker 2.17.x (workstation) -> can't test, firewall between the server and my workstation.

The last test would be interesting if you could test with check_http 2.3.3 and the docker image to see if there's an exception. Unfortunately I'm somewhat limited because of our company firewalls and can't test this case.

EDIT: also tested with check_http 2.4.9 -> no exception. So it seems that the exception is (only) triggered by check_http 2.3.3 (maybe other versions, I want to test 2.3.5 as soon as I have some time).

reta · 2024-10-17T13:06:42Z

Hi @reta,
which version of check_http are you using?

Thanks @rlueckl

$ /usr/local/nagios/libexec/check_http  --version
check_http v2.4.11 (nagios-plugins 2.4.11)

I will try this configuration now, will update you shortly:

check_http 2.3.3 (server) -> deb install 2.17.0 (server) -> connection reset exception

UPD: Debian Bookworm + OpenSearch 2.17.1 (on Docker):

root@c2d1f85d6cd6:/tmp/nagios-plugins-2.3.3# /usr/local/nagios/libexec/check_http -H c2d1f85d6cd6 --port=9200 --ssl --certificate=15
SSL OK - Certificate 'node-0.example.com' will expire in 3410 days on 2034-02-17 17:03 +0000/UTC.

Log is clean :(

rlueckl · 2024-10-18T07:15:26Z

Interesting. I can't reproduce the exception with 2.3.3 from my workstation, but running the check locally on the server I get the exception.

check_http 2.3.3 (server) -> deb install 2.17.0 (server) -> connection reset exception
check_http 2.3.3 (workstation) -> deb install 2.17.0 (server) -> no exception

kkoki-ctrl · 2024-11-08T02:30:36Z

I encountered the same error, and after checking, I found the following:

The error occurs sporadically 0 to 3 times per hour, but only on the master node.
The error happens regardless of the version (OpenSearch 2.13.0, 2.14.0, 2.16.0).
The error does not occur when the elasticsearch-exporter:v1.8.0 is stopped.

https://forum.opensearch.org/t/ssl-exception-connection-reset-error-on-master-nodes/22281/5

blueish-eyez added bug Something isn't working untriaged Require the attention of the repository maintainers and may need to be prioritized labels Aug 16, 2024

dblock mentioned this issue Aug 23, 2024

[BUG] Exception during establishing a SSL connection opensearch-project/OpenSearch#15332

Closed

reta self-assigned this Sep 5, 2024

dblock removed the untriaged Require the attention of the repository maintainers and may need to be prioritized label Sep 9, 2024

dblock transferred this issue from opensearch-project/OpenSearch Sep 9, 2024

github-actions bot added the untriaged Require the attention of the repository maintainers and may need to be prioritized label Sep 9, 2024

dblock removed the untriaged Require the attention of the repository maintainers and may need to be prioritized label Sep 9, 2024

reta added v3.0.0 v2.18.0 Issues targeting release v2.18.0 labels Sep 9, 2024

reta mentioned this issue Sep 10, 2024

Use evaluateSslExceptionHandler() when constructing OpenSearchSecureSettingsFactory #4725

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Continuous SSL exceptions post upgrade from 2.11 to 2.15 #4718

[BUG] Continuous SSL exceptions post upgrade from 2.11 to 2.15 #4718

blueish-eyez commented Aug 16, 2024 •

edited by dblock

Loading

zane-neo commented Aug 19, 2024

blueish-eyez commented Aug 19, 2024

zane-neo commented Aug 20, 2024

machenity commented Sep 2, 2024 •

edited

Loading

10000-ki commented Sep 2, 2024

dblock commented Sep 2, 2024

reta commented Sep 3, 2024

reta commented Sep 3, 2024

reta commented Sep 3, 2024 •

edited

Loading

dblock commented Sep 4, 2024

reta commented Sep 4, 2024 •

edited

Loading

machenity commented Sep 4, 2024

machenity commented Sep 5, 2024 •

edited

Loading

blueish-eyez commented Sep 5, 2024

reta commented Sep 5, 2024 •

edited

Loading

dblock commented Sep 9, 2024

reta commented Sep 10, 2024

blueish-eyez commented Sep 11, 2024

rlueckl commented Oct 16, 2024

reta commented Oct 16, 2024

rlueckl commented Oct 17, 2024 •

edited

Loading

reta commented Oct 17, 2024 •

edited

Loading

rlueckl commented Oct 18, 2024

kkoki-ctrl commented Nov 8, 2024

[BUG] Continuous SSL exceptions post upgrade from 2.11 to 2.15 #4718

[BUG] Continuous SSL exceptions post upgrade from 2.11 to 2.15 #4718

Comments

blueish-eyez commented Aug 16, 2024 • edited by dblock Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

zane-neo commented Aug 19, 2024

blueish-eyez commented Aug 19, 2024

zane-neo commented Aug 20, 2024

machenity commented Sep 2, 2024 • edited Loading

10000-ki commented Sep 2, 2024

dblock commented Sep 2, 2024

reta commented Sep 3, 2024

reta commented Sep 3, 2024

reta commented Sep 3, 2024 • edited Loading

dblock commented Sep 4, 2024

reta commented Sep 4, 2024 • edited Loading

machenity commented Sep 4, 2024

machenity commented Sep 5, 2024 • edited Loading

blueish-eyez commented Sep 5, 2024

reta commented Sep 5, 2024 • edited Loading

dblock commented Sep 9, 2024

reta commented Sep 10, 2024

blueish-eyez commented Sep 11, 2024

rlueckl commented Oct 16, 2024

reta commented Oct 16, 2024

rlueckl commented Oct 17, 2024 • edited Loading

reta commented Oct 17, 2024 • edited Loading

rlueckl commented Oct 18, 2024

kkoki-ctrl commented Nov 8, 2024

blueish-eyez commented Aug 16, 2024 •

edited by dblock

Loading

machenity commented Sep 2, 2024 •

edited

Loading

reta commented Sep 3, 2024 •

edited

Loading

reta commented Sep 4, 2024 •

edited

Loading

machenity commented Sep 5, 2024 •

edited

Loading

reta commented Sep 5, 2024 •

edited

Loading

rlueckl commented Oct 17, 2024 •

edited

Loading

reta commented Oct 17, 2024 •

edited

Loading