Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] opensearch crashes on closed client connection before search reply #3557

Closed
dbonf opened this issue Jun 10, 2022 · 3 comments · Fixed by #3626
Closed

[BUG] opensearch crashes on closed client connection before search reply #3557

dbonf opened this issue Jun 10, 2022 · 3 comments · Fixed by #3626
Assignees
Labels
bug Something isn't working v2.1.0 Issues and PRs related to version 2.1.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@dbonf
Copy link

dbonf commented Jun 10, 2022

Describe the bug
In some circumstances, when running a search operation, if the client TCP connection is closed before the search operation is completed, opensearch crashes with:

[2022-06-09T16:54:39,763][ERROR][org.opensearch.bootstrap.OpenSearchUncaughtExceptionHandler] fatal error in thread [opensearch[mycluster-elasticsearch-5][search][T#10]], exiting
java.lang.AssertionError: unexpected higher total ops [18] compared to expected [17]

To Reproduce
Steps to reproduce the behaviour:

We can consistently reproduce this behaviour in several environments.

  1. run the long running search query, e.g. curl -s "http://localhost:9200/myindex*/_search (this is the same with any client/library)
  2. kill the client connection after a couple of seconds, e.g stop the curl command with ctrl+c (same as a connection that is closed by the client after a timeout, before the reply from the OS node)
  3. the opensearch node crashes

Expected behavior
No crashes.

Plugins

  • repository-s3
  • repository-gcs

Host/Environment (please complete the following information):

  • OS: Red Hat Enterprise Linux 8 (in a container, UBI minimal)
  • Version: Opensearch 1.2.4

Additional context
Relevant logs:

Jun 9 18:54:18 mycluster-elasticsearch-1 elasticsearch fatal fatal error in thread [opensearch[mycluster-elasticsearch-1][search][T#6]], exiting
Jun 9 18:54:18 mycluster-elasticsearch-1 elasticsearch java.lang.AssertionError: unexpected higher total ops [19] compared to expected [18]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:461)
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$2(AbstractSearchAsyncAction.java:303)
	at org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:338)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
	at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:57)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: Failed to execute phase [query], Shard failures; shardFailures {[L_zfhb4RRHm2BRsxkzqkig][myindex-2022.06.09][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[L_zfhb4RRHm2BRsxkzqkig][myindex-2022.06.08][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.07][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[L_zfhb4RRHm2BRsxkzqkig][myindex-2022.06.06][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.05][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.04][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.03][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.02][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.01][0]: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]}
	... 10 more
Caused by: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]
	at org.opensearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:534)
	at org.opensearch.tasks.TaskManager.registerChildNode(TaskManager.java:226)

See attachment for complete logs.
log.txt

@dbonf dbonf added bug Something isn't working untriaged labels Jun 10, 2022
@dbonf
Copy link
Author

dbonf commented Jun 10, 2022

this could be similar to elastic/elasticsearch#71453

@reta
Copy link
Collaborator

reta commented Jun 17, 2022

I was looking into this one, so far was not able to reproduce but it seems like the problem is that onShardFailure is called more than once for the same shardIndex, this is happening when the number of requests per node is throttled:

Jun 9 18:54:18 mycluster-elasticsearch-1 elasticsearch ERROR [org.opensearch.bootstrap.OpenSearchUncaughtExceptionHandler] fatal error in thread [opensearch[mycluster-elasticsearch-1][search][T#7]], exiting
Jun 9 18:54:18 mycluster-elasticsearch-1 elasticsearch java.lang.AssertionError: unexpected higher total ops [19] compared to expected [18]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:461) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$2(AbstractSearchAsyncAction.java:303) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:338) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:57) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.2.4.jar:1.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.opensearch.action.search.SearchPhaseExecutionException: Shard failures
	... 10 more
Caused by: org.opensearch.tasks.TaskCancelledException: The parent task was cancelled, shouldn't start any child tasks
	at org.opensearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:534) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.tasks.TaskManager.registerChildNode(TaskManager.java:226) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.transport.TransportService.sendRequest(TransportService.java:731) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.transport.TransportService.sendChildRequest(TransportService.java:838) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.transport.TransportService.sendChildRequest(TransportService.java:826) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.search.SearchTransportService.sendExecuteQuery(SearchTransportService.java:196) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:124) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:278) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction$PendingExecutions.tryRun(AbstractSearchAsyncAction.java:773) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:310) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:465) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:82) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:291) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:84) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:72) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:593) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.transport.TransportService.sendRequest(TransportService.java:772) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.transport.TransportService.sendChildRequest(TransportService.java:838) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.transport.TransportService.sendChildRequest(TransportService.java:826) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.search.SearchTransportService.sendExecuteQuery(SearchTransportService.java:196) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:124) ~[opensearch-1.2.4.jar:1.2.4]

@reta
Copy link
Collaborator

reta commented Jun 17, 2022

It seems like I have gotten a 100% test for it .... assigning to myself. The failure only seems to manifest when number of requests per node is throttled, and there is an exception happening during onPhaseDone() execution. In this case the onShardFailure is unexpectedly called again, mismatching the operations counter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2.1.0 Issues and PRs related to version 2.1.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants