-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent Search Operational Readiness #12118
Comments
Thanks @andrross, this is great. Can we generalize this as a issue template and start enforcing operational readiness early in the feature delivery cycle? |
Yes, absolutely! I hope this can serve as an example of a generalized template.
This point is what I would love to start a conversation on with the broader community. I definitely don't want to impose more processes that are difficult for a newbie to discover and follow. On the other hand, OpenSearch is a complex distributed system and I think something like this can help folks build safe, correct, and performant features. |
Thanks @andrross for creating this template. Few suggestions that I can think of:
|
Closing this issue as concurrent search has been released. |
Note: The intent of this issue is twofold: to collect and document in a single place all the operational readiness work that went into the concurrent search feature in order to demonstrate that it is ready for general availability, and to dry run a more generic checklist that can be extracted and used as a process for releasing large features. In the future, this checklist/process would be referenced and incrementally completed throughout development, but in this particular case I'm using the concurrent search feature as a sort of dry-run of this operational readiness procedure. I'm looking for feedback on this process itself (in addition to the specific concurrent search content).
Dependencies
Enumerate all your dependencies, highlighting any new dependencies.
No new dependencies are added by concurrent search.
Are any plugins impacted by your change?
Concurrent search is backward compatible with plugins. However, plugins that implement an aggregator must make changes in order to support concurrent search. If an aggregator plugin does not support concurrent search then the system will fall back to the non-current behavior to preserve compatibility. See the documetation for more details.
Additionally, plugins that query indexes may utilize concurrent search if the indexes have been configured to use concurrent search, but search behavior should be identical to the non-concurrent case. System indexes (i.e. indexes created by plugins for internal system usage) will not have concurrent search enabled by default.
Can your feature independently be disabled?
Yes. This feature is enabled by either a per-index dynamic setting or a cluster-wide dynamic setting.
Have you added comprehensive user documentation on the documentation website?
Yes: https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/
Have you documented any expert-level settings that can be exercised by an operator?
Yes: https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#slicing-mechanisms
Failure Modes
Enumerate the list of failure modes or threats for the feature. Consider thinking of threats as unknown failures, e.g. anything that could possibly go wrong and lead to availability or durability loss. For each failure mode, list all available mitigations.
Concurrent search does not introduce any new dependencies or inter-node interactions. However, it is a significant change to a mission-critical code path (search). High level failure modes fall into two camps: performance and correctness.
Testing
Integration Tests
Have you added comprehensive integration tests that are run by default as a part of the
check
gradle task?Yes: #7440
Do you have any tests that are currently labeled as flaky?
Flaky Test Project Board. This will be completed prior to release (one outstanding issue remains as of this writing).
Do you have any tests that rely on the test-retry plugin to retry on flaky failures?
No.
Do you have any tests that are disabled with the
@AwaitsFix
annotation?No.
Scaling Tests
Have you tested with large clusters (100+ nodes)?
No. Concurrent search is a shard-level feature, so the scaling properties of a multi-node cluster do not fundamentally change with this feature.
Have you tested with variable shard numbers and sizes?
Results to be published here
Chaos Tests
Have you considered simulating faults to mimic hardware failures, network failures(packet loss) in each of your critical request paths?
Concurrent search does not introduce any new dependencies or failure points. Node behavior is not expected to be any different in the case of hardware failures. Many existing integration tests that focus on edge cases have been parameterized to run with concurrent search.
Performance Tests
Have you enabled your feature to be tested in OpenSearch Benchmark?
Yes. OpenSearch Benchmark has the ability to provide index settings, which is how concurrent search is enabled.
Have you enabled your feature to be a part of nightly benchmarking runs?
Yes
Share all performance data
Overall performance meta issue.
Regression/Rollback :
Can your feature cause a functionality or feature regression if the feature is disabled?
No. In the case that the setting is disabled, then the existing non-concurrent search path is exercised. Concurrent search has been present but disabled via the feature flag mechanism for multiple minor releases and has been shown to not cause a regression.
How have you validated that the feature rollback works?
Integration tests swap between concurrent and non-concurrent search cases by using the dynamic setting. A "rollback" in this context is to simply change the index setting to disable concurrent search.
Diagnostics
When a failure (known or unknown) happens on the system, do you have sufficient instrumentation to debug?
Yes. Logging exists at various levels and has been utilized during development to find bugs.
User Facing API/settings
What are the new REST APIs to be added by this feature?
https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#api-changes
What are the new settings to be added by this feature?
https://opensearch.org/docs/latest/search-plugins/concurrent-segment-search/#disabling-concurrent-search-at-the-index-or-cluster-level
Metrics, Notifications & Visibility
What are the user facing metrics?
What actions needs to be taken by the customer if those metrics increase a threshold? Are there recommended alarms users need to set up?
Guidance around general search performance monitoring remains unchanged and there are no new recommended alarms. The new stats provide concurrent search-specific insights into query performance and will be useful when doing intensive search performance tuning and analysis.
What are the metrics granularity (node level, index level, cluster level, etc.)?
There are index level and node level stats for concurrent search, which is appropriate for this feature.
Related component
Search:Performance
The text was updated successfully, but these errors were encountered: