Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Optimization] Cluster State Update Optimization #7853

Merged
merged 24 commits into from
Jul 10, 2023

Conversation

sandeshkr419
Copy link
Contributor

@sandeshkr419 sandeshkr419 commented May 31, 2023

Description

This draft PR is to discuss the optimization changes for ClusterState computation by limiting the number of times objects related to indices and its lookups inside Metadata.java.

Will add relevant test cases, improve reduce excessive logging, modify Changelog in later commits.

The build() in Metadata.java is expensive operation because of computation of indicesLookup.

There are certain places where we know for sure that that computation of all these values is not necessary. For instance, when a template is modified, all the indices related objects are not modified; in MasterService, where after computation of Metadata, version is increment only.

Note when creating a new ClusterState, Metadata object is created twice. So with this changes, we are basically omitting the entire Metadata creation second time, which alone saves ~40% time for ClusterState computation.

This is done by introducing variables in Builder method to hold local copies of metadata related objects. Additionally, a new variable rebuildIndicesLookups can be set to false (default will be true to retain original workflow) with which you manually skip re-creation of objects related to indicesLookups (open, close, visible, hidden, indicesLookup map). The objects which are recomputed in build() method can skipped and utilized directly from copied objects (from last metadata) if indices and customs are unchanged. Checking the change in these 2 objects is less time consuming operation.

The idea is that in these scenarios, we can save ~40% time taken to create ClusterState in all API calls, which modify cluster state. For certain API calls, like master re-election, template related APIs, we should be able to gain 95% time required for ClusterState computation.

Is it possible to set rebuildIndicesLookups to false implicitly (by deciding from indices, etc) instead of explicit sertting?
The complication in doing so is that there are multiple objects which inherently depend on IndexMetadata/IndexAbstraction. This implicit logic will have to be driven out be checking and analyzing any values (open, close, hidden, etc) that can be changed. The effort is not worth the gain and can be erroneous to start with. By introducing the new flag variable, we only modify the work flows where we know for sure that Metadata is not changed. The default behavior will still stay unchanged - Metadata will be computed from scratch as before for all the workflows.

Changes noted on:

  • M1 Pro processor
  • main branch, 3.0, no plugins
  • Local cluster setup with 1 manager, 1 data node
  • ~50k aliases, 10 indices

Before (new alias creation):

[2023-05-31T14:17:45,992][DEBUG][o.o.c.s.MasterService    ] [master1] took [125ms] to compute cluster state update for [index-aliases]
[2023-05-31T14:17:47,219][DEBUG][o.o.c.s.MasterService    ] [master1] took [129ms] to compute cluster state update for [index-aliases]
[2023-05-31T14:17:48,427][DEBUG][o.o.c.s.MasterService    ] [master1] took [132ms] to compute cluster state update for [index-aliases]

After (new alias creation):

[2023-05-31T14:36:44,821][DEBUG][o.o.c.s.MasterService    ] [master1] took [71ms] to compute cluster state update for [index-aliases]
[2023-05-31T14:36:46,506][DEBUG][o.o.c.s.MasterService    ] [master1] took [69ms] to compute cluster state update for [index-aliases]
[2023-05-31T14:36:48,143][DEBUG][o.o.c.s.MasterService    ] [master1] took [70ms] to compute cluster state update for [index-aliases]
[2023-05-31T14:36:49,675][DEBUG][o.o.c.s.MasterService    ] [master1] took [71ms] to compute cluster state update for [index-aliases]

For Template Creation, the indicesLookup is not computed even once, so the time reduces to by ~95% since only deep copy is done from previous cluster state for indices related objects.

[2023-05-31T14:50:34,003][DEBUG][o.o.c.s.MasterService    ] [master1] took [6ms] to compute cluster state update for [create-index-template [t22], cause [api]]
[2023-05-31T14:50:54,948][DEBUG][o.o.c.s.MasterService    ] [master1] took [8ms] to compute cluster state update for [create-index-template [23], cause [api]]

Note: The time taken for the above operations are usually of higher magnitude (2-3x) when run on smaller instance types, such as m4.large, c4.large.

Related Issues

Resolves #7002

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

… lookups

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2023

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2023

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2023

Gradle Check (Jenkins) Run Completed with:

Copy link
Contributor

@amkhar amkhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sandeshkr419 for working on the POC to gauge the impact of the existing re-computation. I've left few questions, ideally want to understand the actual logic for skipping computation.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 7, 2023

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

Gradle Check (Jenkins) Run Completed with:

@prudhvigodithi
Copy link
Member

Hey @sandeshkr419 noticing lot gradle check errors, are you able to re-produce this locally ?

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

Gradle Check (Jenkins) Run Completed with:

@sandeshkr419
Copy link
Contributor Author

sandeshkr419 commented Jun 8, 2023

Hey @sandeshkr419 noticing lot gradle check errors, are you able to re-produce this locally ?

@prudhvigodithi Yes, I was able to reproduce and fix majority of these failures except these in latest CI:

org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test - This is failing on my local on main branch (with latest changes pulled) as well using JDK 19 & 20 both. Tried multiple times, to omit chances of flakiness with no luck.

Test/system config:

  Gradle Version        : 8.1.1
  OS Info               : Mac OS X 13.4 (aarch64)
  JDK Version           : 20 (Oracle JDK)
  JAVA_HOME             : /Library/Java/JavaVirtualMachines/jdk-20.jdk/Contents/Home
  Random Testing Seed   : 46906AD73A60AEB
  In FIPS 140 mode      : false

org.opensearch.search.backpressure.SearchBackpressureIT.testSearchTaskCancellationWithHighCpu - This is successful in my local - ran using JDK 19 & JDK 20 - probably a network issue causing it to fail in CI - as per the logs

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      2 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testNodeDropWithOngoingReplication
      1 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testPrimaryStopped_ReplicaPromoted
      1 org.opensearch.cluster.allocation.ClusterRerouteIT.testDelayWithALargeAmountOfShards

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
@sandeshkr419
Copy link
Contributor Author

@shwetathareja Did you happen to get a chance to review the recent changes after I pulled the PR out of draft mode?

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.remotestore.RemoteStoreIT.testStaleCommitDeletionWithoutInvokeFlush

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jul 7, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

@github-actions
Copy link
Contributor

github-actions bot commented Jul 7, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationIT.testScrollCreatedOnReplica

@shwetathareja shwetathareja merged commit cb0d13b into opensearch-project:main Jul 10, 2023
@sandeshkr419 sandeshkr419 deleted the indicesLookup branch July 10, 2023 09:10
@sandeshkr419
Copy link
Contributor Author

Thanks @shwetathareja for approving & merging this. Please add the backport 2.x label to the issue as well to auto-backport it to 2.x branch.

@shwetathareja shwetathareja added the backport 2.x Backport to 2.x branch label Jul 11, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-7853-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 cb0d13b9cab950b39269eb28691e6075cd9cf1aa
# Push it to GitHub
git push --set-upstream origin backport/backport-7853-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-7853-to-2.x.

sandeshkr419 added a commit to sandeshkr419/OpenSearch that referenced this pull request Jul 11, 2023
…7853)

* Cluster State Update Optimization - Optimize Metadata build() to skip redundant computations of indicesLookup as part of ClusterState build

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

---------

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
(cherry picked from commit cb0d13b)
@sandeshkr419
Copy link
Contributor Author

Manually backported to resolve conflicts in CHANGELOG (no other conflicts): #8644

vikasvb90 pushed a commit to raghuvanshraj/OpenSearch that referenced this pull request Jul 12, 2023
…7853)

* Cluster State Update Optimization - Optimize Metadata build() to skip redundant computations of indicesLookup as part of ClusterState build

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

---------

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
shwetathareja pushed a commit that referenced this pull request Jul 12, 2023
* Cluster State Update Optimization - Optimize Metadata build() to skip redundant computations of indicesLookup as part of ClusterState build

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

---------

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
(cherry picked from commit cb0d13b)
raghuvanshraj pushed a commit to raghuvanshraj/OpenSearch that referenced this pull request Jul 12, 2023
…7853)

* Cluster State Update Optimization - Optimize Metadata build() to skip redundant computations of indicesLookup as part of ClusterState build

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

---------

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
dzane17 pushed a commit to dzane17/OpenSearch that referenced this pull request Jul 12, 2023
…7853)

* Cluster State Update Optimization - Optimize Metadata build() to skip redundant computations of indicesLookup as part of ClusterState build

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

---------

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
buddharajusahil pushed a commit to buddharajusahil/OpenSearch that referenced this pull request Jul 18, 2023
…7853)

* Cluster State Update Optimization - Optimize Metadata build() to skip redundant computations of indicesLookup as part of ClusterState build

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

---------

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: sahil buddharaju <sahilbud@amazon.com>
baba-devv pushed a commit to baba-devv/OpenSearch that referenced this pull request Jul 29, 2023
…7853)

* Cluster State Update Optimization - Optimize Metadata build() to skip redundant computations of indicesLookup as part of ClusterState build

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

---------

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…7853)

* Cluster State Update Optimization - Optimize Metadata build() to skip redundant computations of indicesLookup as part of ClusterState build

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

---------

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Optimization] Optimize the creation/updation of Cluster state indices/aliases lookup map
5 participants