-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] org.opensearch.cluster.coordination.CoordinationStateRejectedException upgrading 2.0.1 to 3.0 #3615
Comments
Is this consistently reproducible? Did you try to debug it? The error looks suspicious:
could be related to master/clustermanager/leader renaming? |
@cliu123 did you try running opensearch-min without security plugin? |
That's a good point! I haven't tried that as I don't see any security specific errors. But I'll try. |
@saratvemulapalli Are there any passing BWC test runs from 2.0.1 to 3.0.0? I don't see any BWC runs in GitHub actions in this repo. Would you please share a pointer to the passing test run? I'd like to use it to investigate the failures in the security repo. |
Sure, here is our testing process for bwc: https://github.com/opensearch-project/OpenSearch/blob/main/TESTING.md#testing-backwards-compatibility We dont use Github workflows for testing, instead we use jenkins infra. |
@saratvemulapalli Thanks! But I don't see any tests upgrading to 3.0.0 in the reports/logs. The error included in the issue description shows that while the upgrade node tries to join back to the cluster, issues happen during master election( |
@cliu123 Sure let me take a stab at this. |
The setup has problems with settings, very likely it didnt disable security.
|
@saratvemulapalli Would you please check or provide the logs upgrade from 2.0.1 to 3.0.0 without security plugin? I don't see the related logs in https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_6087.log. Did I miss anything? |
@Rishikesh1159 could you take a look at this. |
Sure |
@cliu123 You can see the the backward compatibility tests being run as a part of that gradle check:
Are there specific logs that you're looking for? I don't think the test output logs a whole lot when the tasks succeed. |
Can someone please help with this issue? Its blocking security plugin version bump to |
The BWC test failure still persists: https://github.com/cliu123/security/runs/7434973804?check_suite_focus=true. |
@amitgalitz also got BWC test failures when upgrading to 3.0.0 with job-scheduler plugin installed but without security plugin installed: https://github.com/opensearch-project/job-scheduler/runs/7415033030?check_suite_focus=true. |
k-NN plugin also has same issue with BWC Tests when we are trying to upgrade from 2.1.0 to 3.0.0-SNAPSHOT. But, the interesting thing is Restart Upgrade BWC Tests are working and Rolling Upgrade BWC Tests are failing. |
@cliu123 @amitgalitz In the above logs it says that you are trying to upgrade from 7.10.2 to 3.0.0. I think we cannot upgrade directly from 7.x to 3.0.0. Could you pls try to upgrade from 2.x to 3.0.0-SNAPSHOT?
|
hi @naveentatikonda , yes for |
@adnapibar Any progress here? Need help? |
@dblock Unfortunately no, also got busy with another issue. If I can't find anything by today, will need more eyes on it. |
It looks like the BWC tests in job scheduler plugin started failing with this commit - 2d716ad |
We haven't made progress here since this was reported in June and it's blocking having a complete distribution build for 3.0. @nknize lmk if you don't have time to look into it and I can dig in. |
I quickly tried building. These are compilation errors caused by new breaking changes in OpenSearch core 3.0. They are renaming changes, class signature changes etc. |
For job scheduler the issue seems to be version used for bwc - https://github.com/opensearch-project/job-scheduler/blob/main/sample-extension-plugin/build.gradle#L142 - I tried changing this to 2.2.1 but getting some other issue
I think because it's downloading the job-scheduler plugin artifact from https://github.com/opendistro-for-elasticsearch/job-scheduler/releases/download/v1.13.0.0/job-scheduler-artifacts.zip |
Two 2.2.0 nodes.
node1/config/opensearch.yml
node2/config/opensearch.yml
I tried with 2.2.0 + 3.0.0, failed with
I tried with latest 2.4.0 (https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.4.0/latest/linux/x64/tar/builds/opensearch/dist/opensearch-min-2.4.0-linux-x64.tar.gz) and 3.3.0 (https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/3.0.0/latest/linux/x64/tar/builds/opensearch/dist/opensearch-min-3.0.0-linux-x64.tar.gz) and that worked.
|
So for bcw do we (A) need to be testing against the latest build of 2.4.0 (I don't think this exists yet), and keep moving that number up every time, or (B) allow 3.0 to join any 2.x cluster? |
@adnapibar The other issue you're running into is this one. See opensearch-project/job-scheduler#242 |
Thanks @dblock! |
This is expected behavior. The next major version is only API compat w/ the last minor of the previous major. In this case it's 2.4 because the 2.3 branch was cut, so it tests against the 2.x branch (which is a 2.4 staged). That's why you'll only find gradle task
BWC testing is already configured to test against the appropriate branches. There is no API / wire compatibility between 3.0.0 and anything less than 2.4.0. If a BWC test in a 3.0.0 version bumped plugin is trying to test against |
If what you say is true @nknize then we can close this by design. However:
|
For rolling upgrade scenarios that's correct. But if a user is upgrading to an unstable snapshot build to test something out then they'd probably be just as fine using full cluster restart process to upgrade. The rolling upgrade process is really for those that don't want downtime, which usually means they're upgrading in production.
This isn't a problem because, bwc testing is transitive (e.g., 3.0 tests against 2.x which tests against 2.x-1) and our backport process which requires changes go |
Thanks, going to close this by design. |
The job scheduler bcw problem was resolved in opensearch-project/job-scheduler#242, it had nothing to do with this. |
For the folks involved in this issue, while resolving this as by design makes sense, it still leaves our plugin stuck without a path to run CI with BWC tests enabled as we look to migrate to OpenSearch v3.0 I've filed opensearch-project/opensearch-plugins#167 to track this question and what our recommendation is - I am going to disable these tests as they are a road block to progress, but I feel like this is going to become a surprise during our project lifecycle |
Describe the bug
Nodes cannot join back to the cluster after upgrading from 2.0.1 to 3.0.0.
To Reproduce
Failing GHA: https://github.com/cliu123/security/runs/6926965970?check_suite_focus=true
Expected behavior
A clear and concise description of what you expected to happen.
Plugins
Please list all plugins currently enabled.
Error logs
The text was updated successfully, but these errors were encountered: