Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NoSuchRemoteClusterException in docs checks #47718

Closed
DaveCTurner opened this issue Oct 8, 2019 · 9 comments
Closed

NoSuchRemoteClusterException in docs checks #47718

DaveCTurner opened this issue Oct 8, 2019 · 9 comments
Assignees
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

This PR build failed:

https://gradle-enterprise.elastic.co/s/qt3vwzsysfowc/console-log?task=:docs:integTestRunner

org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT > test {yaml=reference/ccr/apis/auto-follow/put-auto-follow-pattern/line_88} FAILED
java.lang.AssertionError: Failure at [reference/ccr/apis/auto-follow/put-auto-follow-pattern:74]: expected [2xx] status code but api [raw[method=PUT path=_ccr/auto_follow/my_auto_follow_pattern]] returned [400 Bad Request] [{"error":{"root_cause":[{"type":"no_such_remote_cluster_exception","reason":"no such remote cluster: [remote_cluster]","stack_trace":"org.elasticsearch.transport.NoSuchRemoteClusterException: no such remote cluster: [remote_cluster]...

I see other exceptions about remote clusters in the logs near the bottom:

»    ↓ errors and warnings from /dev/shm/elastic+elasticsearch+pull-request+docs-check/docs/build/testclusters/integTest-0/logs/es.stdout.log ↓
» WARN ][o.e.d.FileBasedSeedHostsProvider] [node-0] expected, but did not find, a dynamic hosts list at [/dev/shm/elastic+elasticsearch+pull-request+docs-check/docs/build/testclusters/integTest-0/config/unicast_hosts.txt]
» WARN ][o.e.t.RemoteConnectionStrategy] [node-0] fetching nodes from external cluster [cluster_two] failed
»  org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:9301] connect_exception
»  	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:989) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$3(ActionListener.java:162) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
»  	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
»  	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
»  	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?]
»  	at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:68) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111) ~[?:?]
»  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323) ~[?:?]
»  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) ~[?:?]
»  	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) ~[?:?]
»  	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
»  	at java.lang.Thread.run(Thread.java:834) [?:?]
»  Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9301
»  Caused by: java.net.ConnectException: Connection refused
»  	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
»  	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[?:?]
»  	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
»  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336) ~[?:?]
»  	... 7 more
»   ↑ repeated 2 times ↑

Build stats shows a handful of failures like this daily across PRs, 7.x and master since 2019-09-27. I suspect this is a build issue and not something CCR-specific so I'm labelling it accordingly.

@DaveCTurner DaveCTurner added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI labels Oct 8, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Build)

@rjernst
Copy link
Member

rjernst commented Oct 10, 2019

I don't see anything obvious in the changelog around that time. This seems something specific to CCR documentation testing setup, not a build infrastructure issue.

@rjernst rjernst added :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features and removed :Delivery/Build Build or test infrastructure labels Oct 10, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/CCR)

@ywelsch
Copy link
Contributor

ywelsch commented Oct 11, 2019

This is a race in the doc tests, unfortunately one that we can't handle with the current YAML test infrastructure. The problem is that defining a remote cluster is currently done through the settings infrastructure, and there's no guarantee that when

PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster_one": {

returns that the nodes have actually connected to the remote cluster.

This means that subsequent requests to the cluster can fail if a connection has not been established yet by the time a request comes in.

In an ideal world we would have a proper API for defining remote clusters, similar to what we have for defining repositories. We could then, similar to repositories, make sure that the remote clusters are correctly connected to before acking to the user. In lack of this, I see only these options here:

  • Add random activity (akin to a sleep) to the YML test before having it use the connection.
  • Add some kind of wait-for / assertBusy logic to the yml tests (which would have been handy in many other situations as well).
  • Adapt the connection behavior for remote clusters to shortly block the cluster state update thread on a first definition of a cluster for up to 3 seconds or so to establish an initial connection (i.e. something similar to what we do with NodeConnectionsService.

I see other exceptions about remote clusters in the logs near the bottom:

These are unrelated, the docs define remote clusters in remote-clusters.asciidoc that can't be connected to. As an enhancement, we could (similar to cluster_one) have those point at an actual cluster.

@tlrx
Copy link
Member

tlrx commented Oct 22, 2019

Add some kind of wait-for / assertBusy logic to the yml tests (which would have been handy in many other situations as well).

I opened #48353 for this but I'm not sure it will be accepted.

romseygeek added a commit that referenced this issue Oct 23, 2019
This test is failing frequently, due to #47718
romseygeek added a commit that referenced this issue Oct 23, 2019
This test is failing frequently, due to #47718
romseygeek added a commit that referenced this issue Oct 23, 2019
This test is failing frequently, due to #47718
@romseygeek
Copy link
Contributor

get-ccr-stats was failing reasonably often, so I have muted it in master, 7.x and 7.5

@dliappis
Copy link
Contributor

Another occurrence in https://gradle-enterprise.elastic.co/s/6effjdor5vkzg on the 7.x branch.

This time its put-auto-follow-pattern

@Tim-Brooks
Copy link
Contributor

#47891 should fix this issue on master and 7.x.

Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this issue Oct 24, 2019
This is related to elastic#47718. It introduces a 10 seconds wait for a
connection to complete when remote clsuter settings introduce a new
remote cluster connection.
Tim-Brooks added a commit that referenced this issue Oct 25, 2019
This is related to #47718. It introduces a 10 seconds wait for a
connection to complete when remote clsuter settings introduce a new
remote cluster connection.
Tim-Brooks added a commit that referenced this issue Oct 25, 2019
This is related to #47718. It introduces a 10 seconds wait for a
connection to complete when remote clsuter settings introduce a new
remote cluster connection.
@ywelsch
Copy link
Contributor

ywelsch commented Oct 30, 2019

get-ccr-stats was failing reasonably often, so I have muted it in master, 7.x and 7.5

With this now fixed, I've unmuted get-ccr-stats on these branches

@ywelsch ywelsch closed this as completed Oct 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

8 participants