Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds plugin version sweep background job #434

Conversation

downsrob
Copy link
Contributor

Issue #, if available:
#207

Description of changes:
Index Management currently skips all job executions when there are two differing versions of Index Management on the cluster. The plugin currently does this by performing a NodesInfoRequest to get and compare plugin versions whenever there is a node added or a new cluster, and set a flag, SkipExecution, to true when there are multiple plugin versions. We have seen cases where the SkipExecution flag is still set to true even though the upgrade process (early ES 7.x to later ES 7.x) has finished and the cluster is on the latest version w/ all nodes containing the same version of IM plugin.

From analyzing the code, we can see race conditions that would allow multiple requests to overwrite each other in the wrong order. Though the cluster changed events would come in order, the NodesInfoRequests may actually overwrite the flag out of order.

To resolve this race condition, this PR adds a background job which will run every five minutes to poll the plugin versions if the flag is currently set to true.

This is an alternative strategy to #423 and is also entirely by Stevan Buzejic, @stevanbz, I am just raising the PR for an early review.

CheckList:

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@downsrob downsrob requested a review from a team July 28, 2022 18:07
@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2022

Codecov Report

Merging #434 (47b7a24) into main (39be4e3) will increase coverage by 0.00%.
The diff coverage is 78.26%.

@@            Coverage Diff            @@
##               main     #434   +/-   ##
=========================================
  Coverage     75.94%   75.95%           
- Complexity     2480     2492   +12     
=========================================
  Files           315      316    +1     
  Lines         14500    14547   +47     
  Branches       2243     2248    +5     
=========================================
+ Hits          11012    11049   +37     
- Misses         2239     2246    +7     
- Partials       1249     1252    +3     
Impacted Files Coverage Δ
...exstatemanagement/PluginVersionSweepCoordinator.kt 69.69% <69.69%> (ø)
...pensearch/indexmanagement/IndexManagementPlugin.kt 90.00% <100.00%> (+0.11%) ⬆️
...exmanagement/indexstatemanagement/SkipExecution.kt 61.29% <100.00%> (-5.38%) ⬇️
...exstatemanagement/settings/ManagedIndexSettings.kt 98.49% <100.00%> (+0.05%) ⬆️
...ment/indexstatemanagement/util/RestHandlerUtils.kt 88.88% <0.00%> (-11.12%) ⬇️
...arch/indexmanagement/rollup/RollupSearchService.kt 57.40% <0.00%> (-3.71%) ⬇️
...exstatemanagement/resthandler/RestExplainAction.kt 100.00% <0.00%> (ø)
.../opensearch/indexmanagement/rollup/model/Rollup.kt 86.04% <0.00%> (+0.46%) ⬆️
...management/rollup/interceptor/RollupInterceptor.kt 80.15% <0.00%> (+0.79%) ⬆️
...t/resthandler/RestRetryFailedManagedIndexAction.kt 88.00% <0.00%> (+1.04%) ⬆️
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Member

@bowenlan-amzn bowenlan-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general questions:

  1. can we disable the trigger logic in skipExecution since we now have this background loop.
    trigger logic I am referring to
override fun clusterChanged(event: ClusterChangedEvent) {
        if (event.nodesChanged() || event.isNewCluster) {
            sweepISMPluginVersion()
        }
    }

in SkipExecution

Comment on lines 111 to 118
val SWEEP_SKIP_PERIOD: Setting<TimeValue> = Setting.timeSetting(
"opendistro.index_state_management.coordinator.sweep_skip_period",
TimeValue.timeValueMinutes(10),
TimeValue.timeValueMinutes(5),
Setting.Property.NodeScope,
Setting.Property.Dynamic,
Setting.Property.Deprecated
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to have this if we are adding a new setting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Tnx!

Comment on lines +65 to +80
if (!skipExecution.flag) {
logger.info("Canceling sweep ism plugin version job")
scheduledSkipExecution?.cancel()
} else {
skipExecution.sweepISMPluginVersion()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to cancel this job or let it run forever?

…he case of version discrepancy

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>
…r scheduling the skip execution task. Annotated tests in order to prevent thread leak error during integrational tests

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>
@stevanbz stevanbz force-pushed the bugfix/207-skip-execution-not-properly-set-job-scheduler-solution branch 2 times, most recently from 027e78e to 151fec9 Compare September 21, 2022 15:10
private fun isIndexStateManagementEnabled(): Boolean = indexStateManagementEnabled == true

companion object {
private const val RETRY_PERIOD_IN_MINUTES = 5L
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this same as sweepSkipPeriod? If so, should we use sweepSkipPeriod instead?

@stevanbz
Copy link
Contributor

stevanbz commented Sep 22, 2022

A general questions:

1. can we disable the trigger logic in skipExecution since we now have this background loop.
   trigger logic I am referring to
override fun clusterChanged(event: ClusterChangedEvent) {
        if (event.nodesChanged() || event.isNewCluster) {
            sweepISMPluginVersion()
        }
    }

in SkipExecution

Good question. And you are right - I am thinking the same. SkipExecution class should do only sweepISMPluginVersion, while the caller class will be responsible for triggering the request.

So, my proposal is:

Caller class, PluginVersionSweepCoordinator, will listen for cluster changed events and will be responsible for calling the sweepISM method. This class already has a scheduled job that can be canceled optionally (ie. if the skip flag is being set to true).

ie.


 override fun clusterChanged(event: ClusterChangedEvent) {
        if (event.nodesChanged() || event.isNewCluster) {
            skipExecution.sweepISMPluginVersion()
            initBackgroundSweepISMPluginVersionExecution()
        }
    }

…lag up to 5 mins

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>
@stevanbz stevanbz force-pushed the bugfix/207-skip-execution-not-properly-set-job-scheduler-solution branch from 85cca3c to 47b7a24 Compare September 22, 2022 21:10
Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>
@Angie-Zhang Angie-Zhang merged commit 4d844fa into opensearch-project:main Oct 4, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 4, 2022
* [207]: Added 5 min scheduled job for sweeping ISM plugin version in the case of version discrepancy

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* [207]: Created pluginVersionSweepCoordinator component responsible for scheduling the skip execution task. Annotated tests in order to prevent thread leak error during integrational tests

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* [207]: Increased retry period for background job that sets the skip flag up to 5 mins

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* Empty-Commit

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>
Co-authored-by: Stevan Buzejic <buzejic.stevan@gmail.com>
(cherry picked from commit 4d844fa)
Angie-Zhang pushed a commit that referenced this pull request Oct 4, 2022
* [207]: Added 5 min scheduled job for sweeping ISM plugin version in the case of version discrepancy

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* [207]: Created pluginVersionSweepCoordinator component responsible for scheduling the skip execution task. Annotated tests in order to prevent thread leak error during integrational tests

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* [207]: Increased retry period for background job that sets the skip flag up to 5 mins

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* Empty-Commit

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>
Co-authored-by: Stevan Buzejic <buzejic.stevan@gmail.com>
(cherry picked from commit 4d844fa)

Co-authored-by: Clay Downs <downsrob@amazon.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 6, 2022
* [207]: Added 5 min scheduled job for sweeping ISM plugin version in the case of version discrepancy

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* [207]: Created pluginVersionSweepCoordinator component responsible for scheduling the skip execution task. Annotated tests in order to prevent thread leak error during integrational tests

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* [207]: Increased retry period for background job that sets the skip flag up to 5 mins

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* Empty-Commit

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>
Co-authored-by: Stevan Buzejic <buzejic.stevan@gmail.com>
(cherry picked from commit 4d844fa)
Angie-Zhang added a commit that referenced this pull request Oct 14, 2022
* initial framework

Signed-off-by: Joanne Wang <jowg@amazon.com>

* Removed recursion from Explain Action to avoid stackoverflow in some situations (#419)

Signed-off-by: Petar Dzepina <petar.dzepina@gmail.com>
Signed-off-by: Joanne Wang <jowg@amazon.com>

* enabled by default integrated

Signed-off-by: Joanne Wang <jowg@amazon.com>

* cleaned up comments and logs, created unit test and updated previous integration tests

Signed-off-by: Joanne Wang <jowg@amazon.com>

* added delete validation logic

Signed-off-by: Joanne Wang <jowg@amazon.com>

* fixed rollover validation unit tests

Signed-off-by: Joanne Wang <jowg@amazon.com>

* added validation info field to ManagedIndexMetaData

Signed-off-by: Joanne Wang <jowg@amazon.com>

* removed step context as input

Signed-off-by: Joanne Wang <jowg@amazon.com>

* added validationmetadata class

Signed-off-by: Joanne Wang <jowg@amazon.com>

* restored old integration tests and changed validation service output

Signed-off-by: Joanne Wang <jowg@amazon.com>

* before integrated validation meta data into managed index meta data

Signed-off-by: Joanne Wang <jowg@amazon.com>

* integrated validation meta data

Signed-off-by: Joanne Wang <jowg@amazon.com>

* working version

Signed-off-by: Joanne Wang <jowg@amazon.com>

* added validation mapping

Signed-off-by: Joanne Wang <jowg@amazon.com>

* fixed integ tests

Signed-off-by: Joanne Wang <jowg@amazon.com>

* renamed some values

Signed-off-by: Joanne Wang <jowg@amazon.com>

* before removing from managed index meta data

Signed-off-by: Joanne Wang <jowg@amazon.com>

* created validation result object in explain

Signed-off-by: Joanne Wang <jowg@amazon.com>

* testing

Signed-off-by: Joanne Wang <jowg@amazon.com>

* run fails

Signed-off-by: Joanne Wang <jowg@amazon.com>

* integration test for delete + added framework for force merge

Signed-off-by: Joanne Wang <jowg@amazon.com>

* removed step validation metadata and still testing explain results

Signed-off-by: Joanne Wang <jowg@amazon.com>

* before removing from managed index runner

Signed-off-by: Joanne Wang <jowg@amazon.com>

* removed from managed index runner

Signed-off-by: Joanne Wang <jowg@amazon.com>

* clean up and tests

Signed-off-by: Joanne Wang <jowg@amazon.com>

* all validation tests pass

Signed-off-by: Joanne Wang <jowg@amazon.com>

* removed validation result from all managed index meta data

Signed-off-by: Joanne Wang <jowg@amazon.com>

* restored old IT tests

Signed-off-by: Joanne Wang <jowg@amazon.com>

* fixed it tests, set explain validation to false

Signed-off-by: Joanne Wang <jowg@amazon.com>

* clean up

Signed-off-by: Joanne Wang <jowg@amazon.com>

* Change test page size to avoid index/search TimeInMillis < 1 issue. (#460)

* Change test page size to avoid indexTimeInMillis < 1 issue.

Signed-off-by: Angie Zhang <langelzh@amazon.com>

* Change test page size to avoid indexTimeInMillis < 1 issue.

Signed-off-by: Angie Zhang <langelzh@amazon.com>

Signed-off-by: Angie Zhang <langelzh@amazon.com>

* Transform maxclauses fix (#477)

* transform maxClauses fix

Signed-off-by: Petar Dzepina <petar.dzepina@gmail.com>

* added bucket log to track processed buckets

Signed-off-by: Petar Dzepina <petar.dzepina@gmail.com>

* various renames/changes

Signed-off-by: Petar Dzepina <petar.dzepina@gmail.com>

* fixed detekt issues

Signed-off-by: Petar Dzepina <petar.dzepina@gmail.com>

* added comments to test

Signed-off-by: Petar Dzepina <petar.dzepina@gmail.com>

* removed debug logging

Signed-off-by: Petar Dzepina <petar.dzepina@gmail.com>

* empty commit to trigger checks

Signed-off-by: Petar Dzepina <petar.dzepina@gmail.com>

* reduced pageSize to 1 in few ITs to avoid flaky tests; fixed bug where pagesProcessed was calculated incorrectly

Signed-off-by: Petar Dzepina <petar.dzepina@gmail.com>

* reverted pagesProcessed change; fixed few ITs

Signed-off-by: Petar Dzepina <petar.dzepina@gmail.com>

Signed-off-by: Petar Dzepina <petar.dzepina@gmail.com>

* 483: Updated detekt plugin and snakeyaml dependency. Updated a code t… (#485)

* 483: Updated detekt plugin and snakeyaml dependency. Updated a code to reduce the number of issues after static analysis

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* 483: Updated snakeyaml version to use the latest

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* Remove HOST_DENY_LIST usage as Notification plugin will own it (#471)

(#107)

Signed-off-by: Xuesong Luo <lxuesong@amazon.com>

Signed-off-by: Xuesong Luo <lxuesong@amazon.com>

* Disable detekt because of the CVE (#497)

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Deprecate Master nonmenclature (#501)

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* [AUTO] Increment version to 2.3.0-SNAPSHOT (#484) (#503)

* fix#921-README-forum-link-index_mgmnt (#499)

Signed-off-by: cwillum <cwmmoore@amazon.com>

Signed-off-by: cwillum <cwmmoore@amazon.com>

* 64: Added rounding when using aggreagate script for avg metric. Added… (#490)

* 64: Added rounding when using aggreagate script for avg metric. Added unit tests for checking average aggregations against the target rollup index

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* 64: Rollup job renamed

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* 64: Removed unrelevant metrics for the avg calculation test

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* Revert Disable detekt and force choose snakeyml 1.32 (#528)

* Revert Disable detekt: 50ac1e9

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Remove force choosing snakeyml 1.31

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Force snakeyaml 1.32

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Empty commit

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>

* Added 2.3 release note (#507) (#515) (#517)

* Update 2.3 release note

Signed-off-by: Angie Zhang <langelzh@amazon.com>

* Update 2.3 release note

Signed-off-by: Angie Zhang <langelzh@amazon.com>

* Update 2.3 release note

Signed-off-by: Angie Zhang <langelzh@amazon.com>

* Update 2.3 release note

Signed-off-by: Angie Zhang <langelzh@amazon.com>

* Update 2.3 release note

Signed-off-by: Angie Zhang <langelzh@amazon.com>

Signed-off-by: Angie Zhang <langelzh@amazon.com>
(cherry picked from commit d9793ac)
Signed-off-by: Angie Zhang <langelzh@amazon.com>

Signed-off-by: Angie Zhang <langelzh@amazon.com>
(cherry picked from commit 7217b5b)

Co-authored-by: Angie Zhang <langelzh@amazon.com>

* Add 2.2 release note (#450) (#452) (#516)

* Add 2.2 release note

Signed-off-by: Angie Zhang <langelzh@amazon.com>

* Add 2.2 release note

Signed-off-by: Angie Zhang <langelzh@amazon.com>

Co-authored-by: Angie Zhang <langelzh@amazon.com>
(cherry picked from commit 8eb5da6)
Signed-off-by: Angie Zhang <langelzh@amazon.com>

Signed-off-by: Angie Zhang <langelzh@amazon.com>
Co-authored-by: Ashish Agrawal <ashisagr@amazon.com>

* Adds plugin version sweep background job (#434)

* [207]: Added 5 min scheduled job for sweeping ISM plugin version in the case of version discrepancy

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* [207]: Created pluginVersionSweepCoordinator component responsible for scheduling the skip execution task. Annotated tests in order to prevent thread leak error during integrational tests

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* [207]: Increased retry period for background job that sets the skip flag up to 5 mins

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* Empty-Commit

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>
Co-authored-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* flaky transform test fix attempt (#542)

* flaky transform test fix attempt

Signed-off-by: Petar Dzepina <petar.dzepina@vroom.com>

* accidental paste fix

Signed-off-by: Petar Dzepina <petar.dzepina@vroom.com>

Signed-off-by: Petar Dzepina <petar.dzepina@vroom.com>
Co-authored-by: Petar Dzepina <petar.dzepina@vroom.com>

Signed-off-by: Joanne Wang <jowg@amazon.com>
Signed-off-by: Petar Dzepina <petar.dzepina@gmail.com>
Signed-off-by: Angie Zhang <langelzh@amazon.com>
Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>
Signed-off-by: Xuesong Luo <lxuesong@amazon.com>
Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Signed-off-by: cwillum <cwmmoore@amazon.com>
Signed-off-by: Siddhant Deshmukh <deshsid@amazon.com>
Signed-off-by: Petar Dzepina <petar.dzepina@vroom.com>
Co-authored-by: Petar <petar.dzepina@gmail.com>
Co-authored-by: Angie Zhang <98716549+Angie-Zhang@users.noreply.github.com>
Co-authored-by: Stevan Buzejic <30922513+stevanbz@users.noreply.github.com>
Co-authored-by: xluo-aws <109580118+xluo-aws@users.noreply.github.com>
Co-authored-by: bowenlan-amzn <bowenlan23@gmail.com>
Co-authored-by: opensearch-trigger-bot[bot] <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com>
Co-authored-by: Chris Moore <107723039+cwillum@users.noreply.github.com>
Co-authored-by: Siddhant Deshmukh <deshsid@amazon.com>
Co-authored-by: Angie Zhang <langelzh@amazon.com>
Co-authored-by: Ashish Agrawal <ashisagr@amazon.com>
Co-authored-by: Clay Downs <downsrob@amazon.com>
Co-authored-by: Stevan Buzejic <buzejic.stevan@gmail.com>
Co-authored-by: Petar Dzepina <petar.dzepina@vroom.com>
wuychn pushed a commit to ochprince/index-management that referenced this pull request Mar 16, 2023
…ensearch-project#539)

* [207]: Added 5 min scheduled job for sweeping ISM plugin version in the case of version discrepancy

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* [207]: Created pluginVersionSweepCoordinator component responsible for scheduling the skip execution task. Annotated tests in order to prevent thread leak error during integrational tests

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* [207]: Increased retry period for background job that sets the skip flag up to 5 mins

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* Empty-Commit

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>
Co-authored-by: Stevan Buzejic <buzejic.stevan@gmail.com>
(cherry picked from commit 4d844fa)

Co-authored-by: Clay Downs <downsrob@amazon.com>
ronnaksaxena pushed a commit to ronnaksaxena/index-management that referenced this pull request Jul 19, 2023
…ensearch-project#539)

* [207]: Added 5 min scheduled job for sweeping ISM plugin version in the case of version discrepancy

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* [207]: Created pluginVersionSweepCoordinator component responsible for scheduling the skip execution task. Annotated tests in order to prevent thread leak error during integrational tests

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* [207]: Increased retry period for background job that sets the skip flag up to 5 mins

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

* Empty-Commit

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>

Signed-off-by: Stevan Buzejic <buzejic.stevan@gmail.com>
Co-authored-by: Stevan Buzejic <buzejic.stevan@gmail.com>
(cherry picked from commit 4d844fa)

Co-authored-by: Clay Downs <downsrob@amazon.com>
Signed-off-by: Ronnak Saxena <ronsax@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants