[RFC] Search Query Sandboxing: User Experience #12342

kiranprakash154 · 2024-02-15T21:46:38Z

#8879 - [RFC] High Level Vision for Core Search in OpenSearch
#11061 - [RFC] Query Sandboxing for search requests

A common challenge with managing resources on OpenSearch clusters has been keeping runaway queries in check. With Search Backpressure, the ability to avoid running out of resources is now available, but there is no capability of protecting tenants who might be unfairly penalized. The goal of this RFC is to propose a mechanism for how admin users will be able to organize tenants into different groups (aka Sandboxes), and limit the cumulative resource utilization of these groups. We will mention an idea of how the sandbox is enforced, but will likely be a separate RFC due to its complexity.

What is a Sandbox ?

A sandbox is a logic construct designed for running search requests within the virtual limits. The sandboxing service will track and aggregate a node's resource usage for different sandboxes, and based on the mode of containment, limit or terminate tasks of a sandbox that's exceeding its limits.
A sandbox's definition is stored in cluster state, hence the limits that need to be enforced are available to nodes across the cluster.

Tenets

A sandbox can provide constraints on one or more resources.
A user may hint preference for a sandbox
A request can map to only one sandbox
A request that maps to multiple sandboxes will be assigned to a sandbox with the highest similarity score
A request maps to no sandbox or multiple sandboxes with the same similarity score will be assigned to a system-defined catchAll sandbox called genPop (general population).
The mapping attributes of a sandbox must be unique as compared to other sandboxes
The sandbox thresholds for a resource cumulatively cannot exceed 100%

Schemas

Sandbox Definition

The following is an abstract example of a sandbox’s definition, and is broadly broken into 4 essential elements within the document

{
    "name": "<nameOfSandbox>",
    "attributes": {
        "attributeA": "<condition1,condition2>",
        ...
    },
    "resources": {
        "resourceX": {
            "allocation": "<floatValue:(0-1)>", 
            "threshold": "<floatValue:(0-1)>" 
        },
        ...
    },    
    "enforcement": "<monitor|soft|hard>"
}

name : defines a unique String identifier. Internally a UUID is used
attributes : a collection of unique attributes that need to be matched by a request to be governed by that sandbox. We are considering user_name, indices_name, role, and custom_id
resources : The set of unique resources for which this sandbox is being defined. We are considering jvm and cpu
enforcement : How is the sandbox being enforced. This can be either monitor or soft or hard

Resource Definition

For each resource, a cluster level schema is also required, and the following is an abstract example

{
    "resources": {
        "resourceX": {
            "threshold": {
                "rejection" : "<floatValue:(0-1)>",
                "cancellation" : "<floatValue:(0-1)>"
            },
            "enforcement": "<monitor|soft|hard>"
        },
        ...
    }
}

resources : The limits of a resource within which all the sandboxes must cumulatively operate before requests get rejected or cancelled. rejectionThreshold <= cancellationThreshold
enforcement : How the containment of sandboxes for this resource will be managed. This can be either monitor or soft or hard , and can potentially override a sandbox’s enforcement.

High Level Flow

A request landing on a coordinator node will first need to be mapped to a sandbox, as per the tenets. Once a sandbox has been mapped, all child tasks spawned from the request will also inherit the sandbox allocation, irrespective on which node it runs.

Sandbox Resolution

The sandbox resolution happens on the coordinator node and will persist with all the tasks (and child tasks) of that request for the entirety of its lifecycle.

Thresholds Enforcement

As the sandboxes are enforced at a node level, for each resource, there are 2 thresholds defined cluster-wide:

rejection — Reject new tasks if this threshold is breached
cancellation — Cancel inflight tasks if this threshold is breached

Each Sandbox level thresholds are always proportional to the node level thresholds

The following is an example where

2 sandboxes have 20% and 25% thresholds for a resource, and the rest (100% - 20% - 25% = 55% ) is part of the general population (aka genPop) sandbox.
If a node exceeds 85% usage for a resource, it will actively reject new tasks destined for the overflowing sandboxes.
If the resource usage exceeds 95%, the node will actively cancel inflight tasks on the overflowing sandboxes

Additional Context

This RFC aims to discuss ideas at a high level. More details are provided in this google doc. Anyone with the link has comment access and we would love to gather feedback.

The text was updated successfully, but these errors were encountered:

kiranprakash154 · 2024-02-15T21:55:13Z

@msfroh @dblock @reta @sohami @Bukhtawar @nknize would love your thoughts !

msfroh · 2024-02-19T20:17:21Z

Could we piggy-back on the idea of "views" to enforce sandboxes? A view could be associated with a sandbox.

Alternatively, if we wanted to be more flexible, a view could enforce a search pipeline, then the search pipeline could (conditionally) modify the request to associate it with a sandbox.

peternied · 2024-02-21T16:37:49Z

[Triage - attendees 1 2 3 4 5]
@kiranprakash154 Thanks for creating this issue

msfroh · 2024-02-21T17:15:06Z

My next few messages contain notes from Search Backlog and Triage meeting: https://www.meetup.com/opensearch/events/298411954/. The @ references refer to the people who said the respective lines.

@kkhatua: This is a feature aimed at sysadmins. We want to allow them to segregate their user classes and ration resources per class.
@kiranprakash154: Enforcement has three levels: Monitor, soft, and hard enforcement. Monitor just gives you stats. Soft lets queries in until the node is under duress. Hard limit starts canceling queries when the sandbox is exceeded, whether the node is in duress or not.

Example from the Google doc that helps clarify things more than the example above:

Two Sandboxes defined for memory and CPU

[
 { 
   "name": "analysts",
   "attributes": {
       "role": "ba-*"
   },
   "resources": {
       "jvm": {
           "allocation": "0.5"
       },
       "cpu": {
           "allocation": "0.3"
       },
   },   
   "enforcement": "soft"
},
{
   "name": "operations",
   "attributes": {
       "role": "tellers",
       "indices_name": "transactions-*"
   },
   "resources": {
       "jvm": {
           "allocation": "0.25"
       }
   },   
   "enforcement": "hard"
 }
]

reta · 2024-02-21T17:27:32Z

So I think the usage of the sandbox concept is confusing here: we are not sandboxing anything (we cannot limit resources) but track the resource usage and react on that. Better names would be "resource limits group", "query prioritization group" or something along these lines.

andrross · 2024-02-21T17:34:40Z

Linking #1017 as it seems somewhat related

msfroh · 2024-02-21T17:34:51Z

Stavros: Is this here to prevent long-running queries? What if I have a long-running query and want to run it without impacting others. Can I suspend the query?
- @kkhatua -- Not really. Existing search back pressure will cancel the oldest query if the system is under duress. If another user comes in and spams a lot of queries, we may want to stop those queries and let the long-running query finish.
@ansjcy: What do you mean by "usage"? Queries span multiple nodes. Which node do we measure usage from?
- @kiranprakash154: We assign a query ID when it comes in and track the query across all nodes.
- @jainankitk: We don't need to track resource usage across all nodes. If we have a 10% limit on CPU usage, we enforce that on every node.
@reta: Sandboxing probably isn't the best name. It's more around limits and monitoring. @getsaurabh02: Should probably refer to it as query prioritization.
@kiranprakash154: We map requests to sandboxes based on attributes. If a query maps to more than one sandbox, we map it to a default sandbox.
@msfroh: Putting sandboxing preference in the hands of users seems unnecessarily complex. Only sysadmins should need to worry about resource utilization.
- @peternied: Users didn't want the rules in place. Operators want the rules in place.
- @kkhatua: What if users want to run a load test or something?

msfroh · 2024-02-21T17:39:34Z

@macrakis: This seems like a very low level mechanism. If I'm a sysadmin, I would like to guarantee some performance characteristics, like some queuing system. Classic operating system behavior. How are users supposed to know when they can run their long-running queries effectively?
- @peternied: Do we want to separate individual users?
- @macrakis: This is a bottom-up mechanism. We need to think about this top-down. How are users going to manage long-running queries? We don't have a way of putting queries to sleep, but maybe we need to think about how to do that in future. This feels like a very implementation-oriented approach, rather than a functionality-oriented approach.

msfroh · 2024-02-21T17:47:38Z

@kkhatua: Example usage is folks who change their dashboard to refresh every 30 seconds, down from 5 minutes.
- @macrakis: We can do that by limiting users to different levels. Allocate resources equitably between users. Each of 5 users should be treated equally well. What does equally mean? We can debate that. Maybe the 5th user runs a long-running query that can wait. We try to give them equal access to the system.
- @kkhatua: This is focused on containing users who can impact the system. Big users don't seem to be concerned with fair allocation of resources, but containing users who are overusing the system.

msfroh · 2024-02-21T17:56:01Z

@kkhatua: Expectation is that sysadmins will first run in monitoring mode, and then set soft/hard limits based on the observed steady state.

getsaurabh02 · 2024-02-21T18:01:48Z

Thanks for the RFC @kiranprakash154 and notes @msfroh . Regarding the term "Sandbox" used in the proposal, I'd like to offer another perspective. The term might not fully capture the essence of our approach, as we are not actually pausing or decelerating query execution to manage system load, which is often associated with a sandbox environment. Instead, we're implementing a more granular strategy on the lines of query-level circuit breakers for various resources, such as CPU and memory. This method ensures more precise control over resource allocation.

Additionally, incorporating a priority system for query execution, along with the capability to cancel queries when resource usage exceeds certain thresholds, seems to align with the principles outlined in this proposal.

peternied · 2024-02-21T19:32:14Z

Thanks for writing this all up and taking the time to present the document today. The level of detail and consideration for low level scenarios shows through. In the context of advancing this proposal - I would recommend creating a proof of concept and publishing it as a draft pull request. This approach will not only help us visualize the proposed mechanisms but also enable a hands-on exploration of scenarios that significantly impact users.

As we moving forward I would suggest probing into the following areas:

How much effort and understanding is needed by a sysadmin to enable and manage sandbox? The impact of a poorly tuned threshold can make the cluster unavailable, its important to experiment with how intuitive this process is especially with dynamically changing workloads.
Clusters evolve over time by scaling horizontally (additional nodes) or vertically (larger instances), how does changing these constraints impact sandboxing. What needs to be built so that after a scaling event sandboxing works without manual intervention?
There is overhead with additional monitoring systems and keeping 'head room' in thresholds that directly impacting billing. How can we make sure that the cluster is well tuned to be responsive without underutilization of resources?
When an enforcement an action is performed (queries are rejected or canceled), the feedback loop for the query author is crucial. How are users informed of these actions and what guidance are they provided to help them adjust their queries?
Specifically in OpenSearch Dashboards, how are these enforcement actions are communicated to the end-user is key to a smooth user experience. What methods will be used for graceful handling of these actions that avoiding user confusion?

jainankitk · 2024-02-21T20:50:12Z

@peternied - Thank you for reviewing the proposal and providing detailed feedback. Please find response to some of the questions below:

Clusters evolve over time by scaling horizontally (additional nodes) or vertically (larger instances), how does changing these constraints impact sandboxing. What needs to be built so that after a scaling event sandboxing works without manual intervention?

Scaling the cluster horizontally / vertically will not impact the sandboxing as constraints are such that they are agnostic to such events.

There is overhead with additional monitoring systems and keeping 'head room' in thresholds that directly impacting billing. How can we make sure that the cluster is well tuned to be responsive without underutilization of resources?

Underutilization is an interesting aspect and something we have talked about at length while coming up with the proposal. Every sandbox has option of soft enforcement mode that allows it to exceed the allocated quota assuming the node is not under duress. If the node is under duress, the sandbox will be hard enforced to its pre-allocated quota, since we want to prevent node from going down at any cost.

When an enforcement an action is performed (queries are rejected or canceled), the feedback loop for the query author is crucial. How are users informed of these actions and what guidance are they provided to help them adjust their queries?

While we will not provide any immediate guidance around adjusting the queries, the error message should inform the users clearly about the reason for rejection. In the later phases, we can leverage query insights for providing actionable feedback on making query more efficient.

macrakis · 2024-02-21T22:07:52Z

Thanks for the presentation today. I would like to understand how a sys admin would use sandboxes for the following scenarios: Classic search on the web, e.g., e-commerce (see the Atlassian comment): * I have a steady stream of small live queries (Web users) where I want to bound latency (mean < 100ms, P99 < 1000ms, Pmax < 2000ms). * I mix in some longer-running, lower-priority queries which may use a lot of resources, but which can be re-run if they fail. * I am also performing updates and I don’t want my search catalog to lag more than 10 sec behind the data feed. Fairness in updating dashboards: * I have 10 dashboards, where the user can specify update interval. If I can’t update them all within the update interval, I somehow need to fairly slow them all down – maybe by increasing the floor on the update interval on all of them. (But I also raise an alert.) Mix of batch jobs: * I run all batch jobs, and want them to run first-in, first-out. I assume that I’m running several in parallel at any one time, but when I need to abort one because others need more resources, I want to abort the most recent one. Let me admit that – except for the first case – I have no practical experience with these scenarios. Are they realistic? Can we handle them? -s

Bukhtawar · 2024-03-05T15:20:42Z

I echo my thoughts with @reta and had similar comments on labelling this as "sandbox" since this is more a tenant specific resource limit. While this is a good starting point, we need to see how we tie the bigger picture with other multi-tenant capabilities like

Index level encryption
Tenant specific shard placement for index patterns where users can choose to allocate indices on certain node groups for better query isolation
Prioritised queue to run some queries in the background based on constraints.

kaushalmahi12 · 2024-04-02T17:17:25Z

Thanks @macrakis for providing these scenarios

I have a steady stream of small live queries (Web users) where I want to bound latency (mean < 100ms, P99 < 1000ms, Pmax < 2000ms).

1st use case is more around end to end latency which gets affected due to multiple reasons and involve load characteristics on multiple nodes, while with this feature we are introducing node level resiliency so this doesn't come under the purview of this RFC.

I mix in some longer-running, lower-priority queries which may use a lot of resources, but which can be re-run if they fail.

This one hints more towards async completion of cancellable queries at node level, this is something we can support as 2nd iteration of this. Piggybacking too many things on this RFC will only clutter up and complicate the delivery of this.

I am also performing updates and I don’t want my search catalog to lag more than 10 sec behind the data feed.

This is something controlled using index.refresh_interval setting if I am right, So this is already present.

kaushalmahi12 · 2024-04-02T17:58:56Z

Thanks @Bukhtawar! for taking time to provide your insightful thoughts on this. Naming convention is something we call can come to an agreement to use throughout.

Regarding tying this feature with other multi-tenant capabilities

Index level encryption

Not sure how does this one correlates with node level resiliency, can you be more specific ?

Tenant specific shard placement for index patterns where users can choose to allocate indices on certain node groups for better query isolation

But given this is a node level resiliency feature and given sharding mechanism distributes the data fairly well. I think this can potentially create the skewness in data distribution, Since we will track and limit the resource usage at node level for search workload then This will work fine as long as those query groups are not oversubscribing the resources. We can only isolate the search queries when the indexing traffic is not shared with search traffic.

Prioritised queue to run some queries in the background based on constraints.

I suppose you are hinting towards async completion of cancellable queries or do you mean something else ?

Since we are not introducing the priority concept for these query groups in the first iteration, We can take this up later.

Regarding naming convention instead of Sandbox for this feature, what should we go with (I can make few initial suggestions)

QueryGroupResourceBucket
QuerySquadron
QueryConvoy

stephen-crawford · 2024-04-04T19:29:09Z

Hi @kaushalmahi12, @jainankitk, @kiranprakash154 I just wanted to check in over here.

I know we've chatted a bit about this and I want to share that I am neither for/nor against making this or any of the related changes from a code perspective at the moment. I went ahead and left some basic comments on @kaushalmahi12's draft and don't have too many suggestions at this point.

It seems like there is some confusion around the use case for this feature; while I appreciate how detailed this RFC is (wow!), I don't know if you can point to some other issues or comments which ask for something like this?

smacrakis · 2024-04-04T21:26:53Z

Looking this over again, several of us have pointed out that "sandbox" is probably the wrong name. Besides implying incorrectly that it has something to do with testing or with security, it doesn't explicitly mention that it is about resources or groups of users.
I like @reta's suggestion of "resource limits group".

kaushalmahi12 · 2024-06-19T22:28:28Z

Based on the discussion with folks we had decided to move the APIs to a plugin due to following reasons

Not to pollute the core code with feature specific APIs

We will make it as a core plugin due to following reasons

Since this feature sits majorly in core as the tracking and cancellation logic will reside in the core.
Fundamental constructs for this feature QueryGroup is also residing in core because of the tracking and cancellation.
It would be hard to correlate the feature if the partial changes are in core and partial in plugin.
We don;t expect iterative deleopment happening on the plugin side.

andrross · 2024-06-20T23:02:13Z

@kaushalmahi12 Should it be a module? If the reasons to separate it from the core code are for modularity/architectural reasons, but the majority of the feature actually sits in the core, then it sounds like a module might be a better fit. There are two major differences between a module and a plugin:

Modules use the same classloader of core, whereas each plugin gets its own classloader
Modules are installed by default with no reasonable way for users to opt-out of them

This plugin appears to have no new dependencies, so I'm not sure the classloader difference is important. The other point is an important difference though. Are there cases where you would not want this feature to be installed?

kaushalmahi12 · 2024-06-21T19:09:13Z

@andrross These are important facts to consider when making the decision to move the feature into a plugin/module. I think it might not make sense for all users since this feature will need enablement so the code might sit dead in the artifact.
If that is fine with the community then It should be fine to move it to a module.
@andrross What are your thoughts on this ?

andrross · 2024-06-24T17:05:45Z

code might sit dead in the artifact

Core plugins are in the min distribution artifact. They aren't loaded into the JVM unless installed, but the overhead of loading these classes is likely negligible.

feature will need enablement

If the act of installing the plugin is the enablement mechanism, then a plugin is probably the right place because then you'd have to build some other enabling mechanism.

@msfroh What do you think regarding plugin vs module here?

kiranprakash154 added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 15, 2024

github-actions bot added the Search Search query, autocomplete ...etc label Feb 15, 2024

github-project-automation bot added this to Search Project Board Feb 15, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 15, 2024

kiranprakash154 self-assigned this Feb 15, 2024

kiranprakash154 added the RFC Issues requesting major changes label Feb 15, 2024

kiranprakash154 added the discuss Issues intended to help drive brainstorming and decision making label Feb 15, 2024

kkhatua added the Search:Resiliency label Feb 16, 2024

peternied removed the untriaged label Feb 21, 2024

kaushalmahi12 mentioned this issue Apr 2, 2024

Qsb framework changes #13007

Closed

8 tasks

github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024

ruai0511 mentioned this issue Jun 19, 2024

Feature/QueryGroup Add Create API in plugin #14459

Closed

8 tasks

ruai0511 mentioned this issue Jun 21, 2024

Add QueryGroup API documentation opensearch-project/opensearch-api-specification#356

Closed

kaushalmahi12 mentioned this issue Jul 1, 2024

Add changes to propagate queryGroupId across child requests and nodes #14614

Merged

3 tasks

ruai0511 mentioned this issue Jul 9, 2024

Add Create QueryGroup API Logic #14680

Merged

8 tasks

kaushalmahi12 mentioned this issue Jul 10, 2024

Add queryGroupId to search workload tasks at co-ordinator and data node level #14708

Merged

3 tasks

This was referenced Jul 10, 2024

Add Get QueryGroup API Logic #14709

Merged

Add Delete QueryGroup API Logic #14735

Merged

Add Update QueryGroup API Logic #14775

Merged

ruai0511 mentioned this issue Jul 30, 2024

[Workload Management] Add QueryGroupServiceSettings #15028

Merged

3 tasks

kaushalmahi12 mentioned this issue Aug 5, 2024

[META] QueryGroup level stats structure and api #15120

Closed

This was referenced Aug 21, 2024

Add query group stats related structures #15337

Closed

Add query group stats related structures #15338

Closed

add query group stats constructs #15343

Merged

Add query group level rejection logic #15428

Merged

kkhatua moved this from New to In Progress in OpenSearch Roadmap Aug 26, 2024

kkhatua added the v2.17.0 label Aug 26, 2024

kkhatua assigned kaushalmahi12 Aug 26, 2024

ruai0511 mentioned this issue Sep 5, 2024

[Workload Management] QueryGroup Stats API Logic #15777

Merged

8 tasks

kaushalmahi12 mentioned this issue Sep 13, 2024

Add wlm resiliency orchestrator (query group service) #15925

Merged

3 tasks

ruai0511 mentioned this issue Sep 16, 2024

Add Integration Tests for Workload Management CRUD APIs #15955

Merged

3 tasks

kkhatua added the v2.18.0 Issues and PRs related to version 2.18.0 label Oct 7, 2024

kkhatua unassigned kiranprakash154 Oct 7, 2024

This was referenced Oct 15, 2024

[Workload Management] Add Workload Management Stats IT #16341

Open

[Workload Management] Add Workload Management IT #16359

Merged

kkhatua added the experimental Marks issues as experimental label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Search Query Sandboxing: User Experience #12342

[RFC] Search Query Sandboxing: User Experience #12342

kiranprakash154 commented Feb 15, 2024 •

edited

Loading

kiranprakash154 commented Feb 15, 2024 •

edited

Loading

msfroh commented Feb 19, 2024

peternied commented Feb 21, 2024

msfroh commented Feb 21, 2024 •

edited

Loading

reta commented Feb 21, 2024 •

edited

Loading

andrross commented Feb 21, 2024 •

edited

Loading

msfroh commented Feb 21, 2024

msfroh commented Feb 21, 2024

msfroh commented Feb 21, 2024

msfroh commented Feb 21, 2024

getsaurabh02 commented Feb 21, 2024

peternied commented Feb 21, 2024

jainankitk commented Feb 21, 2024

macrakis commented Feb 21, 2024 via email

Bukhtawar commented Mar 5, 2024

kaushalmahi12 commented Apr 2, 2024

kaushalmahi12 commented Apr 2, 2024 •

edited

Loading

stephen-crawford commented Apr 4, 2024

smacrakis commented Apr 4, 2024

kaushalmahi12 commented Jun 19, 2024 •

edited

Loading

andrross commented Jun 20, 2024

kaushalmahi12 commented Jun 21, 2024

andrross commented Jun 24, 2024

[RFC] Search Query Sandboxing: User Experience #12342

[RFC] Search Query Sandboxing: User Experience #12342

Comments

kiranprakash154 commented Feb 15, 2024 • edited Loading

What is a Sandbox ?

Tenets

Schemas

Sandbox Definition

Resource Definition

High Level Flow

Sandbox Resolution

Thresholds Enforcement

Additional Context

kiranprakash154 commented Feb 15, 2024 • edited Loading

msfroh commented Feb 19, 2024

peternied commented Feb 21, 2024

msfroh commented Feb 21, 2024 • edited Loading

reta commented Feb 21, 2024 • edited Loading

andrross commented Feb 21, 2024 • edited Loading

msfroh commented Feb 21, 2024

msfroh commented Feb 21, 2024

msfroh commented Feb 21, 2024

msfroh commented Feb 21, 2024

getsaurabh02 commented Feb 21, 2024

peternied commented Feb 21, 2024

jainankitk commented Feb 21, 2024

macrakis commented Feb 21, 2024 via email

Bukhtawar commented Mar 5, 2024

kaushalmahi12 commented Apr 2, 2024

kaushalmahi12 commented Apr 2, 2024 • edited Loading

stephen-crawford commented Apr 4, 2024

smacrakis commented Apr 4, 2024

kaushalmahi12 commented Jun 19, 2024 • edited Loading

andrross commented Jun 20, 2024

kaushalmahi12 commented Jun 21, 2024

andrross commented Jun 24, 2024

kiranprakash154 commented Feb 15, 2024 •

edited

Loading

kiranprakash154 commented Feb 15, 2024 •

edited

Loading

msfroh commented Feb 21, 2024 •

edited

Loading

reta commented Feb 21, 2024 •

edited

Loading

andrross commented Feb 21, 2024 •

edited

Loading

kaushalmahi12 commented Apr 2, 2024 •

edited

Loading

kaushalmahi12 commented Jun 19, 2024 •

edited

Loading