[RFC] Query Sandboxing for search requests #11061

kaushalmahi12 · 2023-11-02T17:34:15Z

Co-author - jainankitk
Parent RFC - #8879

Introduction

In most of the information retrieval systems it is very common to see the performance impact on few tenants of the system caused by other tenants. As few tenants can take significant amount of system resources leaving the others deprived. Hence it becomes critically important for a IR system to minimize the performance impact for all tenants. What could be a better way than providing the user to define and configure the limits and priority for the tenants of the system.
(Node drops)/(Poor performance) due to bad behaving tenant queries are one of the common pain in the OpenSearch clusters. Creating the tenant based performance isolation in OpenSearch becomes critically important improve the resiliency and stability of the cluster with Cx value as bonus.

Tenant here I am assuming an User or Index but not limited to these

Problem

In the current OpenSearch there is no mechanism for tenant based performance isolation for search workloads. We want to enable the admin users of OpenSearch cluster to manage tenant based Sandboxes
to enforce resource based limits on the tenant queries. Each Sandbox will have a priority which will determine the cancellation order in node duress situations.

Scope

Since we want to partition the resources amongst the tenants on a node, It makes more sense to confine this feature to node level so that node level resiliency is achieved.

Use cases

User/s based performance isolation
Index based resource usage enforcement for search workload e,g; hot and warm.
skewed search traffic throttling, i,e; if only one of the indices is getting most of the queries on a node and causing other search requests to be either throttled/rejected, we can avoid it by confining the same index queries to a sandbox.

Proposal

We are proposing to introduce a reactive mechanism to actively track the resources for tenants and cancel in case of oversubscription of system resources for tenant/s. This will help us in identifying and cancelling the rogue queries reactively and help us maintain the node stability.

As part of this proposal we will introduce new software constructs called Sandbox which will be attribute based and admin users of OpenSearch cluster can manage (CRUD ability) at node level. . The attributes we are selecting will be generic across all users(Domains/cluster). For the System resources we will track the jvmAllocations(due to jdk api limitation for thread level current jvm usage) and cpuUtilisation .
Other system resources like network IO, Disk IO we are not considering because of multiple reasons

No JAVA api for getting these per thread
Though we can get it from /proc but this data is loaded from kernel data structures which stores this info in binary. To access this information per thread stats for IO is 3 sys calls(open, read, close),

A Sandbox will track the resources for all the requests associated with it and will try to enforce the resource usage limits per sandbox. We will cancel the queries from low priority sandboxes in node duress scenarios. Since tracking the resource usage could be an overhead for too many sandboxes in the system, we can limit this with cluster level setting to enforce node level count of sandboxes.

We are planning to start with reactive mechanism i,e; track and cancel in case of contention or threshold breaches. But going forward we want to build a robust search query cost estimation framework to cancel majority of the search queries upfront.

Future Improvements

Hard cancellation - Since this feature will also be dependent on hard cancellation to be more effective. We will need hard cancellation for making this highly effective as max this feature can do is hint towards cancellation.
Search Query Cost Estimation - This component will help us estimate the resource usage for search queries which can help in rejecting search queries upfront based on estimated cost based framework.
Async Completion of cancellable queries - We can punt these queries to async queries specific sandbox which can complete at some later point in time.

The text was updated successfully, but these errors were encountered:

getsaurabh02 · 2023-11-02T21:42:29Z

Thanks @kaushalmahi12 for proposing this. This is going to be super useful for maintaining the resiliency of clusters, especially the large ones with multiple tenants or users.

I like the idea of constructs called Sandbox which will be attribute based. However, instead of making it generic across all users, can we introduce a concept of user-account/tenant-id which admins can use to define/configure and have multiple sandboxing configurations attached to them. That way queries when passes with those user-account/tenant-id automatically maps to one of the sandboxing configuration and node-limits are applied dynamically on it.

It will then allow system/cluster admins to create/maintain multiple sandboxing configurations and map them to the group of internal users (or tenants). This can then further be integrated with the Security Plugin in near future to associate these ids with user-roles and cluster permissions, making it a more concrete construct. Thoughts?

This will also provide common extension points for associating these user-account/tenant-id with Top-N queries which is focussed on the extended visibility of individual queries. While allowing admins to be able to map these rogue/slow queries back to users.
.

reta · 2023-11-03T01:18:29Z

I think this idea was discussed in #8879

kaushalmahi12 · 2023-11-03T17:44:35Z

@getsaurabh02 Thanks for your suggestions!

However, instead of making it generic across all users, can we introduce a concept of user-account/tenant-id which admins can use to define/configure and have multiple sandboxing configurations attached to them.

What I meant to convey is that these attributes should be available across all OpenSearch clusters since these attributes will be part of either authN/authZ or part of request attribute. Now within the cluster we can always define new entities and associate them to sandboxes. Does this make sense or I am missing something ?

It will then allow system/cluster admins to create/maintain multiple sandboxing configurations and map them to the group of internal users (or tenants). This can then further be integrated with the Security Plugin in near future to associate these ids with user-roles and cluster permissions, making it a more concrete construct. Thoughts?

I agree with it. Since the feature inherently limits the access to resources, Security Plugin can help provide more concrete and robust mechanism for resource access.

ansjcy · 2024-01-16T07:31:54Z

@kaushalmahi12 Thanks for the proposal! As .getsaurabh02 mentioned, I think with the combination of Query Sanboxing and Top N Queries, OpenSearch admin users can potentially have both better control and better visibility into the rogue queries by tenents.

Regarding getting the resource usage data from /proc, the performance analyzer plugin already have that data, have explored if it is possible to associate these data with specific queries?

kaushalmahi12 added enhancement Enhancement or improvement to existing feature or request untriaged Search:Resiliency labels Nov 2, 2023

getsaurabh02 assigned kaushalmahi12 Nov 3, 2023

kkhatua added v3.0.0 Issues and PRs related to version 3.0.0 and removed untriaged labels Nov 8, 2023

kaushalmahi12 mentioned this issue Nov 17, 2023

[PROPOSAL][Query Sandboxing] Query Sandboxing high level approach. #11173

Open

jainankitk mentioned this issue Dec 28, 2023

[RFC] High Level Vision for Core Search in OpenSearch #8879

Open

kaushalmahi12 mentioned this issue Jan 10, 2024

[RFC][QSB] Approached to enforcement of system resource limits #11846

Open

kaushalmahi12 mentioned this issue Jan 16, 2024

[QSB Meta Issue] Association and Accounting of Requests in QSB #11900

Open

peternied mentioned this issue Feb 14, 2024

[Feature Request] Ability to (filter) trace the bulk/search requests with high latency or resource consumption #12315

Open

kiranprakash154 mentioned this issue Feb 15, 2024

[RFC] Search Query Sandboxing: User Experience #12342

Open

peternied mentioned this issue Feb 28, 2024

[RFC] Isolating CPU for specific workloads(search or others) #12483

Open

mvanderlee mentioned this issue Mar 1, 2024

[BUG] Shards stuck initializing #12398

Closed

andrross added the Roadmap:Search Project-wide roadmap label label May 29, 2024

getsaurabh02 assigned jainankitk Aug 8, 2024

Bukhtawar mentioned this issue Aug 19, 2024

Adding @jainankitk as a Maintainer #15304

Merged

kkhatua added the v2.17.0 label Sep 9, 2024

kkhatua mentioned this issue Sep 10, 2024

[DOC] Workload management at tenant level to improve resiliency opensearch-project/documentation-website#8194

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Query Sandboxing for search requests #11061

[RFC] Query Sandboxing for search requests #11061

kaushalmahi12 commented Nov 2, 2023 •

edited

Loading

getsaurabh02 commented Nov 2, 2023

reta commented Nov 3, 2023

kaushalmahi12 commented Nov 3, 2023

ansjcy commented Jan 16, 2024 •

edited

Loading

[RFC] Query Sandboxing for search requests #11061

[RFC] Query Sandboxing for search requests #11061

Comments

kaushalmahi12 commented Nov 2, 2023 • edited Loading

Introduction

Problem

Scope

Use cases

Proposal

Future Improvements

getsaurabh02 commented Nov 2, 2023

reta commented Nov 3, 2023

kaushalmahi12 commented Nov 3, 2023

ansjcy commented Jan 16, 2024 • edited Loading

kaushalmahi12 commented Nov 2, 2023 •

edited

Loading

ansjcy commented Jan 16, 2024 •

edited

Loading