NoSuchRemoteClusterException in docs checks #47718

DaveCTurner · 2019-10-08T08:26:50Z

This PR build failed:

https://gradle-enterprise.elastic.co/s/qt3vwzsysfowc/console-log?task=:docs:integTestRunner

org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT > test {yaml=reference/ccr/apis/auto-follow/put-auto-follow-pattern/line_88} FAILED
java.lang.AssertionError: Failure at [reference/ccr/apis/auto-follow/put-auto-follow-pattern:74]: expected [2xx] status code but api [raw[method=PUT path=_ccr/auto_follow/my_auto_follow_pattern]] returned [400 Bad Request] [{"error":{"root_cause":[{"type":"no_such_remote_cluster_exception","reason":"no such remote cluster: [remote_cluster]","stack_trace":"org.elasticsearch.transport.NoSuchRemoteClusterException: no such remote cluster: [remote_cluster]...

I see other exceptions about remote clusters in the logs near the bottom:

»    ↓ errors and warnings from /dev/shm/elastic+elasticsearch+pull-request+docs-check/docs/build/testclusters/integTest-0/logs/es.stdout.log ↓
» WARN ][o.e.d.FileBasedSeedHostsProvider] [node-0] expected, but did not find, a dynamic hosts list at [/dev/shm/elastic+elasticsearch+pull-request+docs-check/docs/build/testclusters/integTest-0/config/unicast_hosts.txt]
» WARN ][o.e.t.RemoteConnectionStrategy] [node-0] fetching nodes from external cluster [cluster_two] failed
»  org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:9301] connect_exception
»  	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:989) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$3(ActionListener.java:162) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
»  	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
»  	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
»  	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?]
»  	at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
»  	at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:68) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111) ~[?:?]
»  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323) ~[?:?]
»  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) ~[?:?]
»  	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) ~[?:?]
»  	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
»  	at java.lang.Thread.run(Thread.java:834) [?:?]
»  Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9301
»  Caused by: java.net.ConnectException: Connection refused
»  	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
»  	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[?:?]
»  	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
»  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336) ~[?:?]
»  	... 7 more
»   ↑ repeated 2 times ↑

Build stats shows a handful of failures like this daily across PRs, 7.x and master since 2019-09-27. I suspect this is a build issue and not something CCR-specific so I'm labelling it accordingly.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-10-08T08:26:51Z

Pinging @elastic/es-core-infra (:Core/Infra/Build)

rjernst · 2019-10-10T17:53:16Z

I don't see anything obvious in the changelog around that time. This seems something specific to CCR documentation testing setup, not a build infrastructure issue.

elasticmachine · 2019-10-10T17:53:27Z

Pinging @elastic/es-distributed (:Distributed/CCR)

ywelsch · 2019-10-11T08:18:03Z

This is a race in the doc tests, unfortunately one that we can't handle with the current YAML test infrastructure. The problem is that defining a remote cluster is currently done through the settings infrastructure, and there's no guarantee that when

PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster_one": {

returns that the nodes have actually connected to the remote cluster.

This means that subsequent requests to the cluster can fail if a connection has not been established yet by the time a request comes in.

In an ideal world we would have a proper API for defining remote clusters, similar to what we have for defining repositories. We could then, similar to repositories, make sure that the remote clusters are correctly connected to before acking to the user. In lack of this, I see only these options here:

Add random activity (akin to a sleep) to the YML test before having it use the connection.
Add some kind of wait-for / assertBusy logic to the yml tests (which would have been handy in many other situations as well).
Adapt the connection behavior for remote clusters to shortly block the cluster state update thread on a first definition of a cluster for up to 3 seconds or so to establish an initial connection (i.e. something similar to what we do with NodeConnectionsService.

I see other exceptions about remote clusters in the logs near the bottom:

These are unrelated, the docs define remote clusters in remote-clusters.asciidoc that can't be connected to. As an enhancement, we could (similar to cluster_one) have those point at an actual cluster.

tlrx · 2019-10-22T13:19:06Z

Add some kind of wait-for / assertBusy logic to the yml tests (which would have been handy in many other situations as well).

I opened #48353 for this but I'm not sure it will be accepted.

This test is failing frequently, due to #47718

romseygeek · 2019-10-23T14:21:38Z

get-ccr-stats was failing reasonably often, so I have muted it in master, 7.x and 7.5

dliappis · 2019-10-24T14:45:03Z

Another occurrence in https://gradle-enterprise.elastic.co/s/6effjdor5vkzg on the 7.x branch.

This time its put-auto-follow-pattern

Tim-Brooks · 2019-10-24T16:14:23Z

#47891 should fix this issue on master and 7.x.

This is related to elastic#47718. It introduces a 10 seconds wait for a connection to complete when remote clsuter settings introduce a new remote cluster connection.

This is related to #47718. It introduces a 10 seconds wait for a connection to complete when remote clsuter settings introduce a new remote cluster connection.

ywelsch · 2019-10-30T10:16:01Z

get-ccr-stats was failing reasonably often, so I have muted it in master, 7.x and 7.5

With this now fixed, I've unmuted get-ccr-stats on these branches

DaveCTurner added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI labels Oct 8, 2019

DaveCTurner mentioned this issue Oct 8, 2019

Remove include_relocations setting #47717

Merged

rjernst added :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features and removed :Delivery/Build Build or test infrastructure labels Oct 10, 2019

dnhatn mentioned this issue Oct 11, 2019

[CI] DocsClientYamlTestSuiteIT test {yaml=reference/ccr/apis/get-ccr-stats} #47880

Closed

ywelsch assigned Tim-Brooks Oct 11, 2019

tlrx mentioned this issue Oct 22, 2019

Execution failed for task ':docs:integTestRunner'. #48343

Closed

romseygeek mentioned this issue Oct 23, 2019

Mute get-ccr-stats doctest #48375

Merged

romseygeek added a commit that referenced this issue Oct 23, 2019

Mute get-ccr-stats doctest (#48375)

f861927

This test is failing frequently, due to #47718

romseygeek added a commit that referenced this issue Oct 23, 2019

Mute get-ccr-stats doctest (#48375)

ac482f5

This test is failing frequently, due to #47718

romseygeek added a commit that referenced this issue Oct 23, 2019

Mute get-ccr-stats doctest (#48375)

3be6135

This test is failing frequently, due to #47718

ywelsch mentioned this issue Oct 23, 2019

[CI] Docs failure for resume-auto-follow-pattern.asciidoc #48410

Closed

Tim-Brooks mentioned this issue Oct 24, 2019

Wait for connect on remote settings update #48497

Merged

Tim-Brooks added a commit that referenced this issue Oct 25, 2019

Wait for connect on remote settings update (#48497)

c0ecdf8

This is related to #47718. It introduces a 10 seconds wait for a connection to complete when remote clsuter settings introduce a new remote cluster connection.

Tim-Brooks added a commit that referenced this issue Oct 25, 2019

Wait for connect on remote settings update (#48497)

d30c0e8

This is related to #47718. It introduces a 10 seconds wait for a connection to complete when remote clsuter settings introduce a new remote cluster connection.

ywelsch closed this as completed Oct 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NoSuchRemoteClusterException in docs checks #47718

NoSuchRemoteClusterException in docs checks #47718

DaveCTurner commented Oct 8, 2019

elasticmachine commented Oct 8, 2019

rjernst commented Oct 10, 2019

elasticmachine commented Oct 10, 2019

ywelsch commented Oct 11, 2019

tlrx commented Oct 22, 2019 •

edited

Loading

romseygeek commented Oct 23, 2019

dliappis commented Oct 24, 2019

Tim-Brooks commented Oct 24, 2019

ywelsch commented Oct 30, 2019

NoSuchRemoteClusterException in docs checks #47718

NoSuchRemoteClusterException in docs checks #47718

Comments

DaveCTurner commented Oct 8, 2019

elasticmachine commented Oct 8, 2019

rjernst commented Oct 10, 2019

elasticmachine commented Oct 10, 2019

ywelsch commented Oct 11, 2019

tlrx commented Oct 22, 2019 • edited Loading

romseygeek commented Oct 23, 2019

dliappis commented Oct 24, 2019

Tim-Brooks commented Oct 24, 2019

ywelsch commented Oct 30, 2019

tlrx commented Oct 22, 2019 •

edited

Loading