Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.cluster.coordination.CoordinationStateRejectedException upgrading 2.0.1 to 3.0 #3615

Closed
cliu123 opened this issue Jun 16, 2022 · 47 comments
Assignees
Labels
backwards-compatibility bug Something isn't working distributed framework v3.0.0 Issues and PRs related to version 3.0.0

Comments

@cliu123
Copy link
Member

cliu123 commented Jun 16, 2022

Describe the bug
Nodes cannot join back to the cluster after upgrading from 2.0.1 to 3.0.0.

To Reproduce
Failing GHA: https://github.com/cliu123/security/runs/6926965970?check_suite_focus=true

Expected behavior
A clear and concise description of what you expected to happen.

Plugins
Please list all plugins currently enabled.

Error logs

 WARN ][o.o.c.NodeConnectionsService] [securityBwcCluster0-2] failed to connect to {securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} (tried [1] times)
»  org.opensearch.transport.ConnectTransportException: [securityBwcCluster0-1][127.0.0.1:43131] connect_exception
»  	at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1076) ~[opensearch-2.0.1.jar:2.0.1]
»  	at org.opensearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:215) ~[opensearch-2.0.1.jar:2.0.1]
»  	at org.opensearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:55) ~[opensearch-core-2.0.1.jar:2.0.1]
»  	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
»  	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
»  	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
»  	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
»  	at org.opensearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:70) ~[opensearch-core-2.0.1.jar:2.0.1]
»  	at org.opensearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:81) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[?:?]
»  	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]
»  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]
»  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:623) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:586) ~[?:?]
»  	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]
»  	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[?:?]
»  	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
»  	at java.lang.Thread.run(Thread.java:833) [?:?]
»  Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 127.0.0.1/127.0.0.1:43131
»  Caused by: java.net.ConnectException: Connection refused
»  	at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
»  	at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
»  	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
»  	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[?:?]
»  	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[?:?]
»  	... 7 more
»   ↓ last 40 non error or warning messages from /home/runner/work/security/security/bwc-test/build/testclusters/securityBwcCluster0-2/logs/opensearch.stdout.log ↓
» [2022-06-16T22:10:14,761][INFO ][o.o.n.Node               ] [securityBwcCluster0-2] node name [securityBwcCluster0-2], node ID [jjX5m3WkSLe8eAwhuQhn7A], cluster name [securityBwcCluster0], roles [cluster_manager, remote_cluster_client, data, ingest]
» [2022-06-16T22:10:19,058][INFO ][o.o.t.NettyAllocator     ] [securityBwcCluster0-2] creating NettyAllocator with the following configs: [name=unpooled, suggested_max_allocation_size=256kb, factors={opensearch.unsafe.use_unpooled_allocator=null, g1gc_enabled=true, g1gc_region_size=1mb, heap_size=512mb}]
» [2022-06-16T22:10:19,129][INFO ][o.o.d.DiscoveryModule    ] [securityBwcCluster0-2] using discovery type [zen] and seed hosts providers [settings, file]
» [2022-06-16T22:10:19,577][INFO ][o.o.n.Node               ] [securityBwcCluster0-2] initialized
» [2022-06-16T22:10:19,577][INFO ][o.o.n.Node               ] [securityBwcCluster0-2] starting ...
» [2022-06-16T22:10:19,727][INFO ][o.o.t.TransportService   ] [securityBwcCluster0-2] publish_address {127.0.0.1:42529}, bound_addresses {[::1]:37963}, {127.0.0.1:42529}
» [2022-06-16T22:10:19,932][DEBUG][o.o.c.c.Coordinator      ] [securityBwcCluster0-2] startInitialJoin: coordinator becoming CANDIDATE in term 0 (was null, lastKnownLeader was [Optional.empty])
» [2022-06-16T22:10:19,960][INFO ][o.o.h.AbstractHttpServerTransport] [securityBwcCluster0-2] publish_address {127.0.0.1:46681}, bound_addresses {[::1]:41935}, {127.0.0.1:46681}
» [2022-06-16T22:10:19,962][INFO ][o.o.n.Node               ] [securityBwcCluster0-2] started
» [2022-06-16T22:10:19,962][INFO ][o.o.s.OpenSearchSecurityPlugin] [securityBwcCluster0-2] Node started
» [2022-06-16T22:10:19,968][INFO ][o.o.s.OpenSearchSecurityPlugin] [securityBwcCluster0-2] 0 OpenSearch Security modules loaded so far: []
» [2022-06-16T22:10:20,876][INFO ][o.o.c.c.Coordinator      ] [securityBwcCluster0-2] setting initial configuration to VotingConfiguration{jjX5m3WkSLe8eAwhuQhn7A,Us2JKk3aSU-lIJVCnUQ02w,{bootstrap-placeholder}-securityBwcCluster0-0}
» [2022-06-16T22:10:21,093][DEBUG][o.o.c.c.ElectionSchedulerFactory] [securityBwcCluster0-2] scheduling scheduleNextElection{gracePeriod=0s, thisAttempt=0, maxDelayMillis=100, delayMillis=22, ElectionScheduler{attempt=1, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}
» [2022-06-16T22:10:21,103][DEBUG][o.o.c.c.Coordinator      ] [securityBwcCluster0-2] joinLeaderInTerm: for [{securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}] with term 1
» [2022-06-16T22:10:21,104][DEBUG][o.o.c.c.CoordinationState] [securityBwcCluster0-2] handleStartJoin: leaving term [0] due to StartJoinRequest{term=1,node={securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}}
» [2022-06-16T22:10:21,129][DEBUG][o.o.c.c.ElectionSchedulerFactory] [securityBwcCluster0-2] scheduleNextElection{gracePeriod=0s, thisAttempt=0, maxDelayMillis=100, delayMillis=22, ElectionScheduler{attempt=1, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}} starting election
» [2022-06-16T22:10:21,131][DEBUG][o.o.c.c.ElectionSchedulerFactory] [securityBwcCluster0-2] scheduling scheduleNextElection{gracePeriod=500ms, thisAttempt=1, maxDelayMillis=200, delayMillis=645, ElectionScheduler{attempt=2, ElectionSchedulerFactory{initialTimeout=100ms, backoffTime=100ms, maxTimeout=10s}}}
» [2022-06-16T22:10:21,134][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] PreVotingRound{preVotesReceived={}, electionStarted=false, preVoteRequest=PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1}, isClosed=false} requesting pre-votes from [{securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, {securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, {securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}]
» [2022-06-16T22:10:21,157][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] PreVotingRound{preVotesReceived={{securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}}, electionStarted=false, preVoteRequest=PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1}, isClosed=false} added PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0} from {securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, no quorum yet
» [2022-06-16T22:10:21,159][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] PreVotingRound{preVotesReceived={{securityBwcCluster0-0}{MntY5LcxSzCwoWnBav5SUA}{QagsblRtSNihbQhNLBZukQ}{127.0.0.1}{127.0.0.1:46477}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}, {securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}}, electionStarted=false, preVoteRequest=PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1}, isClosed=false} added PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0} from {securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, no quorum yet
» [2022-06-16T22:10:21,150][DEBUG][o.o.c.c.PreVoteCollector ] [securityBwcCluster0-2] TransportResponseHandler{PreVoteCollector{state=Tuple [v1=null, v2=PreVoteResponse{currentTerm=1, lastAcceptedTerm=0, lastAcceptedVersion=0}]}, node={securityBwcCluster0-1}{Us2JKk3aSU-lIJVCnUQ02w}{5Wnx4OM1RkSweVKL6lkfdA}{127.0.0.1}{127.0.0.1:43131}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}} failed
»  org.opensearch.transport.RemoteTransportException: [securityBwcCluster0-1][127.0.0.1:43131][internal:cluster/request_pre_vote]
»  Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: rejecting PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1} as there is already a leader
»  	at org.opensearch.cluster.coordination.PreVoteCollector.handlePreVoteRequest(PreVoteCollector.java:162) ~[opensearch-2.0.1.jar:2.0.1]
»  	at org.opensearch.cluster.coordination.PreVoteCollector.lambda$new$0(PreVoteCollector.java:100) ~[opensearch-2.0.1.jar:2.0.1]
»  	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:103) ~[opensearch-2.0.1.jar:2.0.1]
»  	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.0.1.jar:2.0.1]
»  	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:798) ~[opensearch-2.0.1.jar:2.0.1]
»  	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.0.1.jar:2.0.1]
»  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
»  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
»  	at java.lang.Thread.run(Thread.java:833) [?:?]
@cliu123 cliu123 added bug Something isn't working untriaged labels Jun 16, 2022
@cliu123 cliu123 changed the title [BUG] OpenSearch 3.0 is not backward compatible with OpenSearch 2.0.1 [BUG] OpenSearch 3.0 is backward incompatible with OpenSearch 2.0.1 Jun 16, 2022
@dblock dblock changed the title [BUG] OpenSearch 3.0 is backward incompatible with OpenSearch 2.0.1 [BUG] org.opensearch.cluster.coordination.CoordinationStateRejectedException upgrading 2.0.1 to 3.0 Jun 17, 2022
@dblock
Copy link
Member

dblock commented Jun 17, 2022

Is this consistently reproducible? Did you try to debug it? The error looks suspicious:

rejecting PreVoteRequest{sourceNode={securityBwcCluster0-2}{jjX5m3WkSLe8eAwhuQhn7A}{haqxPg4aSa23pHJFt9_XUA}{127.0.0.1}{127.0.0.1:42529}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, currentTerm=1} as there is already a leader

could be related to master/clustermanager/leader renaming?

@saratvemulapalli
Copy link
Member

@cliu123 did you try running opensearch-min without security plugin?
OpenSearch 3.0.0 is passing bwc tests with 2.0.1.
Ref: https://github.com/opensearch-project/OpenSearch/blob/main/.ci/bwcVersions

@cliu123
Copy link
Member Author

cliu123 commented Jun 21, 2022

@cliu123 did you try running opensearch-min without security plugin? OpenSearch 3.0.0 is passing bwc tests with 2.0.1. Ref: https://github.com/opensearch-project/OpenSearch/blob/main/.ci/bwcVersions

That's a good point! I haven't tried that as I don't see any security specific errors. But I'll try.

@cliu123
Copy link
Member Author

cliu123 commented Jun 21, 2022

@saratvemulapalli Are there any passing BWC test runs from 2.0.1 to 3.0.0? I don't see any BWC runs in GitHub actions in this repo. Would you please share a pointer to the passing test run? I'd like to use it to investigate the failures in the security repo.

@saratvemulapalli
Copy link
Member

Sure, here is our testing process for bwc: https://github.com/opensearch-project/OpenSearch/blob/main/TESTING.md#testing-backwards-compatibility

We dont use Github workflows for testing, instead we use jenkins infra.
Here is one successful run on a PR: #3618 (comment)
Logs: https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_6087.log

@cliu123
Copy link
Member Author

cliu123 commented Jun 21, 2022

@saratvemulapalli Thanks! But I don't see any tests upgrading to 3.0.0 in the reports/logs. The error included in the issue description shows that while the upgrade node tries to join back to the cluster, issues happen during master election(Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: rejecting PreVoteRequest), which seems to be something wrong in the core engine.
Besides, without installing security plugin, I see the same error.

@saratvemulapalli
Copy link
Member

saratvemulapalli commented Jun 22, 2022

@cliu123 ./gradlew bwcTest should run all the tests.
I did run it locally, 3.0.0 node was able to join a 2.0.1 cluster.
How are your tests setup ? Also looking at your change, there are some changes with inclusive naming.

Sure let me take a stab at this.

@saratvemulapalli
Copy link
Member

I see the same error

The setup has problems with settings, very likely it didnt disable security.

»  java.lang.IllegalArgumentException: unknown setting [plugins.security.disabled] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
»  	at org.opensearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:591)
»  	at org.opensearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:532)
»  	at org.opensearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:502)
»  	at org.opensearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:472)
»  	at org.opensearch.common.settings.SettingsModule.<init>(SettingsModule.java:170)
»  	at org.opensearch.node.Node.<init>(Node.java:479)
»  	at org.opensearch.node.Node.<init>(Node.java:339)
»  	at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242)
»  	at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242)
»  	at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404)
»  	at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:180)
»  	at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171)
»  	at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104)
»  	at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
»  	at org.opensearch.cli.Command.main(Command.java:101)
»  	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137)
»  	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103)
»  For complete error details, refer to the log at /home/runner/work/security/security/bwc-test/build/testclusters/securityBwcCluster1-0/logs/securityBwcCluster1.log

@cliu123
Copy link
Member Author

cliu123 commented Jun 23, 2022

@saratvemulapalli Would you please check or provide the logs upgrade from 2.0.1 to 3.0.0 without security plugin? I don't see the related logs in https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_6087.log. Did I miss anything?

@saratvemulapalli
Copy link
Member

@Rishikesh1159 could you take a look at this.

@Rishikesh1159
Copy link
Member

@Rishikesh1159 could you take a look at this.

Sure

@andrross
Copy link
Member

andrross commented Jul 8, 2022

@saratvemulapalli Would you please check or provide the logs upgrade from 2.0.1 to 3.0.0 without security plugin? I don't see the related logs in https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_6087.log. Did I miss anything?

@cliu123 You can see the the backward compatibility tests being run as a part of that gradle check:

$ grep 'bwcTest' ~/Downloads/gradle_check_6087.log
> Task :qa:verify-version-constants:v2.1.0#bwcTest
> Task :qa:verify-version-constants:v2.0.2#bwcTest
> Task :qa:verify-version-constants:bwcTestSnapshots
> Task :qa:repository-multi-version:v2.0.2#bwcTest
> Task :qa:repository-multi-version:v2.1.0#bwcTest
> Task :qa:repository-multi-version:bwcTestSnapshots
> Task :qa:rolling-upgrade:v2.1.0#bwcTest
> Task :qa:rolling-upgrade:bwcTestSnapshots
> Task :qa:full-cluster-restart:v2.0.2#bwcTest
> Task :qa:full-cluster-restart:v2.1.0#bwcTest
> Task :qa:full-cluster-restart:bwcTestSnapshots
> Task :qa:mixed-cluster:v2.1.0#bwcTest
> Task :qa:mixed-cluster:bwcTestSnapshots

Are there specific logs that you're looking for? I don't think the test output logs a whole lot when the tasks succeed.

@ankitkala
Copy link
Member

Can someone please help with this issue? Its blocking security plugin version bump to 3.0.
We want to build cross-cluster-replication for 3.0 and since we've a dependency on security plugin, we're blocked as well.

@cliu123
Copy link
Member Author

cliu123 commented Jul 20, 2022

The BWC test failure still persists: https://github.com/cliu123/security/runs/7434973804?check_suite_focus=true.
Looks like node failed to join back to the cluster after upgrading. @CEHENKLE Could anyone take a look?

@cliu123
Copy link
Member Author

cliu123 commented Jul 21, 2022

@amitgalitz also got BWC test failures when upgrading to 3.0.0 with job-scheduler plugin installed but without security plugin installed: https://github.com/opensearch-project/job-scheduler/runs/7415033030?check_suite_focus=true.

@naveentatikonda
Copy link
Member

k-NN plugin also has same issue with BWC Tests when we are trying to upgrade from 2.1.0 to 3.0.0-SNAPSHOT. But, the interesting thing is Restart Upgrade BWC Tests are working and Rolling Upgrade BWC Tests are failing.
The link to GitHub Action - https://github.com/naveentatikonda/k-NN/runs/7508740179?check_suite_focus=true

@naveentatikonda
Copy link
Member

@amitgalitz also got BWC test failures when upgrading to 3.0.0 with job-scheduler plugin installed but without security plugin installed: https://github.com/opensearch-project/job-scheduler/runs/7415033030?check_suite_focus=true.

@cliu123 @amitgalitz In the above logs it says that you are trying to upgrade from 7.10.2 to 3.0.0. I think we cannot upgrade directly from 7.x to 3.0.0. Could you pls try to upgrade from 2.x to 3.0.0-SNAPSHOT?

java.lang.IllegalStateException: cannot upgrade a node from version [7.10.2] directly to version [3.0.0]
»  	at org.opensearch.env.NodeMetadata.upgradeToCurrentVersion(NodeMetadata.java:101)
»  	at org.opensearch.env.NodeEnvironment.loadNodeMetadata(NodeEnvironment.java:476)
»  	at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:369)
»  	at org.opensearch.node.Node.<init>(Node.java:433)
»  	at org.opensearch.node.Node.<init>(Node.java:342)
»  	at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242)
»  	at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242)
»  	at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404)
»  	at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:180)
»  	at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171)
»  	at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104)
»  	at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
»  	at org.opensearch.cli.Command.main(Command.java:101)
»  	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137)
»  	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103) 

@sbcd90
Copy link
Contributor

sbcd90 commented Jul 28, 2022

hi @naveentatikonda , yes for job-scheduler its the same issue. i tried upgrading from 2.x to 3.0.0 & faced the issue i linked.
today i'll test again to see if the issue is fixed.

@dblock
Copy link
Member

dblock commented Sep 1, 2022

@adnapibar Any progress here? Need help?

@adnapibar
Copy link
Contributor

adnapibar commented Sep 2, 2022

@adnapibar Any progress here? Need help?

@dblock Unfortunately no, also got busy with another issue. If I can't find anything by today, will need more eyes on it.

@adnapibar
Copy link
Contributor

adnapibar commented Sep 7, 2022

It looks like the BWC tests in job scheduler plugin started failing with this commit - 2d716ad

@dblock
Copy link
Member

dblock commented Sep 7, 2022

@nknize

@dblock
Copy link
Member

dblock commented Sep 9, 2022

We haven't made progress here since this was reported in June and it's blocking having a complete distribution build for 3.0. @nknize lmk if you don't have time to look into it and I can dig in.

@cliu123
Copy link
Member Author

cliu123 commented Sep 9, 2022

@cliu123 The BWC tests for OpenSearch seem to be having no issues. From the logs, it's not clear what is going on? k-NN fails on the rolling upgrade tests :qa:rolling-upgrade:knnBwcCluster-rolling while security plugin fails on the mixed cluster tests, :securityBwcCluster#mixedClusterTask with different errors. The underlying cause may be the same, but it's not obvious.

I tried to reproduce the issue locally from the branch https://github.com/cliu123/security/tree/bump_version_to_3.0.0.0 but the build fails. Can you rebase the branch so we can build and reproduce the errors locally.

I quickly tried building. These are compilation errors caused by new breaking changes in OpenSearch core 3.0. They are renaming changes, class signature changes etc.
@adnapibar kindly offered help with this as he has much more context on those breaking changes in OpenSearch core. Thank you so much @adnapibar !

@adnapibar
Copy link
Contributor

adnapibar commented Sep 13, 2022

It looks like the BWC tests in job scheduler plugin started failing with this commit - 2d716ad

For job scheduler the issue seems to be version used for bwc - https://github.com/opensearch-project/job-scheduler/blob/main/sample-extension-plugin/build.gradle#L142 - I tried changing this to 2.2.1 but getting some other issue

 Exception in thread "main" java.lang.IllegalArgumentException: property [opensearch.version] is missing for plugin [opendistro-job-scheduler]

I think because it's downloading the job-scheduler plugin artifact from https://github.com/opendistro-for-elasticsearch/job-scheduler/releases/download/v1.13.0.0/job-scheduler-artifacts.zip

@dblock
Copy link
Member

dblock commented Sep 14, 2022

Two 2.2.0 nodes.

wget https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.4.0/latest/linux/x64/tar/builds/opensearch/dist/opensearch-min-2.4.0-linux-x64.tar.gz
tar vfxz opensearch-min-2.4.0-linux-x64.tar.gz
cp -R opensearch-2.4.0 node1
cp -R opensearch-2.4.0 node1

node1/config/opensearch.yml

cluster.name: dblock
node.name: node-1
http.port: 9200
cluster.initial_master_nodes: ["node-1"]

node2/config/opensearch.yml

cluster.name: dblock
node.name: node-2
http.port: 9201
cluster.initial_master_nodes: ["node-1"]
curl http://localhost:9200/_cat/nodes
127.0.0.1  9 15 0 0.45 0.28 0.19 dimr cluster_manager,data,ingest,remote_cluster_client - node-2
127.0.0.1 11 15 0 0.45 0.28 0.19 dimr cluster_manager,data,ingest,remote_cluster_client * node-1

I tried with 2.2.0 + 3.0.0, failed with

java.lang.IllegalStateException: Received message from unsupported version: [2.2.0] minimal compatible version is: [2.4.0]
        at org.opensearch.transport.InboundDecoder.ensureVersionCompatibility(InboundDecoder.java:231) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.InboundDecoder.readHeader(InboundDecoder.java:197) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.InboundDecoder.internalDecode(InboundDecoder.java:95) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.InboundDecoder.decode(InboundDecoder.java:73) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:134) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:115) ~[opensearch-3.0.0.jar:3.0.0]

I tried with latest 2.4.0 (https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.4.0/latest/linux/x64/tar/builds/opensearch/dist/opensearch-min-2.4.0-linux-x64.tar.gz) and 3.3.0 (https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/3.0.0/latest/linux/x64/tar/builds/opensearch/dist/opensearch-min-3.0.0-linux-x64.tar.gz) and that worked.

/tmp$ curl http://localhost:9200/
{
  "name" : "node-1",
  "cluster_name" : "dblock",
  "cluster_uuid" : "7PBHy2tSQwKsyJVxvD6t3A",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.4.0",
    "build_type" : "tar",
    "build_hash" : "3080dfd11da697d783960630432c6cb31d3f758d",
    "build_date" : "2022-09-14T01:39:07.814439285Z",
    "build_snapshot" : false,
    "lucene_version" : "9.3.0",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}
/tmp$ curl http://localhost:9200/_cat/nodes
127.0.0.1 10 17 0 0.35 0.31 0.22 dimr cluster_manager,data,ingest,remote_cluster_client * node-1
127.0.0.1  9 17 0 0.35 0.31 0.22 dimr cluster_manager,data,ingest,remote_cluster_client - node-2
/tmp$ curl http://localhost:9201/
{
  "name" : "node-2",
  "cluster_name" : "dblock",
  "cluster_uuid" : "7PBHy2tSQwKsyJVxvD6t3A",
  "version" : {
    "distribution" : "opensearch",
    "number" : "3.0.0",
    "build_type" : "tar",
    "build_hash" : "51a529fc52ddcc79e84f44dc9b610043b0a6c495",
    "build_date" : "2022-09-14T01:55:39.367159293Z",
    "build_snapshot" : false,
    "lucene_version" : "9.4.0",
    "minimum_wire_compatibility_version" : "2.4.0",
    "minimum_index_compatibility_version" : "2.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

@dblock
Copy link
Member

dblock commented Sep 14, 2022

So for bcw do we (A) need to be testing against the latest build of 2.4.0 (I don't think this exists yet), and keep moving that number up every time, or (B) allow 3.0 to join any 2.x cluster?

@dblock
Copy link
Member

dblock commented Sep 14, 2022

@adnapibar The other issue you're running into is this one. See opensearch-project/job-scheduler#242

@adnapibar
Copy link
Contributor

adnapibar commented Sep 15, 2022

@adnapibar The other issue you're running into is this one. See opensearch-project/job-scheduler#242

Thanks @dblock!

@nknize
Copy link
Collaborator

nknize commented Sep 15, 2022

I tried with 2.2.0 + 3.0.0, failed with: java.lang.IllegalStateException: Received message from unsupported version: [2.2.0] minimal compatible version is: [2.4.0]

This is expected behavior. The next major version is only API compat w/ the last minor of the previous major. In this case it's 2.4 because the 2.3 branch was cut, so it tests against the 2.x branch (which is a 2.4 staged). That's why you'll only find gradle task :qa:mixed-cluster [v2.4.0#bwcTest] and qa:rolling-upgrade [v2.4.0#bwcTest] and nothing < 2.4.x

> Task :distribution:bwc:minor:checkoutBwcBranch
Performing checkout of opensearch-project/2.x...
Checkout hash for :distribution:bwc:minor is 3612b24729b07232414471d2629a59b616232a4f

> Task :server:compileJava

> Task :distribution:bwc:minor:buildBwcLinuxTar
> Task :buildSrc:reaper:compileJava UP-TO-DATE
 [2.4.0] > Task :buildSrc:reaper:processResources NO-SOURCE
 [2.4.0] > Task :buildSrc:reaper:classes UP-TO-DATE
 [2.4.0] > Task :buildSrc:reaper:jar UP-TO-DATE
 [2.4.0] > Task :buildSrc:reaper:assemble UP-TO-DATE
.
.
.
> Task :distribution:bwc:minor:buildBwcLinuxTar
 [2.4.0] > Task :distribution:archives:buildLinuxTar
 [2.4.0] > Task :distribution:archives:linux-tar:assemble
 [2.4.0] 
 [2.4.0] BUILD SUCCESSFUL in 42s
 [2.4.0] 169 actionable tasks: 14 executed, 155 up-to-date

> Task :qa:mixed-cluster:v2.4.0#mixedClusterTest
Test cluster endpoints are: [::1]:36577,127.0.0.1:43469,[::1]:32769,127.0.0.1:37333,[::1]:43355,127.0.0.1:38047,[::1]:46871,127.0.0.1:46001
Upgrading one node to create a mixed cluster
Upgrade complete, endpoints are: [::1]:45123,127.0.0.1:44939,[::1]:32769,127.0.0.1:37333,[::1]:43355,127.0.0.1:38047,[::1]:46871,127.0.0.1:46001
Upgrading another node to create a mixed cluster
Upgrading complete, endpoints are: [::1]:45123,127.0.0.1:44939,[::1]:35035,127.0.0.1:44375,[::1]:43355,127.0.0.1:38047,[::1]:46871,127.0.0.1:46001

So for bcw do we (A) need to be testing against the latest build of 2.4.0 (I don't think this exists yet), and keep moving that number up every time, or (B) allow 3.0 to join any 2.x cluster?

BWC testing is already configured to test against the appropriate branches. There is no API / wire compatibility between 3.0.0 and anything less than 2.4.0. If a BWC test in a 3.0.0 version bumped plugin is trying to test against <= 2.3.x cluster, then expect InboundDecoder to fail w/ a mincompat error! This is why users have to rolling upgrade to the last minor of the previous major release before rolling upgrading to the next major version. It's the cost for no downtime...

@dblock
Copy link
Member

dblock commented Sep 15, 2022

If what you say is true @nknize then we can close this by design. However:

  1. 2.4.0 hasn't shipped yet. So right now a user that wants to try upgrading to 3.0.0-SNAPSHOT to try it, would have to upgrade to 2.4.0-SNAPSHOT first, then to 3.0.0-SNAPSHOT, which seems ... odd.
  2. 2.4.0 may never ship and be abandoned in favor of 3.0. In which case we would have been testing bcw against the wrong version all along.

@nknize
Copy link
Collaborator

nknize commented Sep 15, 2022

  • 2.4.0 hasn't shipped yet. So right now a user that wants to try upgrading to 3.0.0-SNAPSHOT to try it, would have to upgrade to 2.4.0-SNAPSHOT first, then to 3.0.0-SNAPSHOT, which seems ... odd.

For rolling upgrade scenarios that's correct. But if a user is upgrading to an unstable snapshot build to test something out then they'd probably be just as fine using full cluster restart process to upgrade. The rolling upgrade process is really for those that don't want downtime, which usually means they're upgrading in production.

2. 2.4.0 may never ship and be abandoned in favor of 3.0. In which case we would have been testing bcw against the wrong version all along.

This isn't a problem because, bwc testing is transitive (e.g., 3.0 tests against 2.x which tests against 2.x-1) and our backport process which requires changes go main -> 2.x' -> 2.{minor versions}.{next bugfix}` regardless of if we release 2.x or not.

@dblock
Copy link
Member

dblock commented Sep 15, 2022

Thanks, going to close this by design.

@dblock dblock closed this as completed Sep 15, 2022
@dblock
Copy link
Member

dblock commented Sep 20, 2022

The job scheduler bcw problem was resolved in opensearch-project/job-scheduler#242, it had nothing to do with this.

@peternied
Copy link
Member

For the folks involved in this issue, while resolving this as by design makes sense, it still leaves our plugin stuck without a path to run CI with BWC tests enabled as we look to migrate to OpenSearch v3.0 I've filed opensearch-project/opensearch-plugins#167 to track this question and what our recommendation is - I am going to disable these tests as they are a road block to progress, but I feel like this is going to become a surprise during our project lifecycle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards-compatibility bug Something isn't working distributed framework v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

No branches or pull requests