-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Design proposal] Point in time search #3960
Comments
@bharath-techie : As today Sep 07, is the code freeze date for OpenSearch. Is there anything pending on this issue ? |
@bharath-techie is this still on track for 2.4 release? code freeze on 11/3 |
@anasalkouz Yes this is on track for 2.4 release and will be closing it by next week (before 11/3) |
Overview
Today, in OpenSearch, if you run the same search query at different points of time, chances are you will get different result as data is constantly changing. However, in real world scenario when analysing data or trying to provide a consistent user experience to your end users you may want the result from a query not to change while the context remains the same and control when changes should appear in the result set. You want to be able to run multiple queries on the same data set and paginate through the data set expecting consistent result. This is not possible using current available options in OpenSearch.
OpenSearch currently supports the following options to achieve pagination, each having a certain limitation:
Point in time search can be used to address some of the above shortcomings.
What is Point in Time Search
User can create a Point In Time for a set of indices. A Point In Time is a set of segments which are frozen in time taken at the moment of creation. We lock the segments of those indices’ shards and create contexts to access and query only those shards.
The Create PIT API returns a PIT Id. This PIT ID can be used as part of multiple search queries to get results on that view of data. The indices might ingest more documents or process modification/deletion of documents, but the Point In Time created will be immune to that and provide a view of data which remains unchanged from the time of creation. This allows users to run multiple search queries on a given Point In Time and gives the same result every time.
For feature proposal please refer to - #1147
PIT only takes data into account the moment a PIT is created. This implies on a lower level that none of the resources that are required to return the data from the initial request are modified or deleted. Segments are kept around, even though the segment might already have been merged away and is not needed for the live data set. This results in keeping more data around than just the live data set. More data means more segments, more file handles, and more heap to keep metadata from segments in the heap.
Requirements
Functional Requirements
Non functional Requirements
Scope
The document scope is limited to Core OpenSearch changes needed to support PIT. For V1, we are considering following functionalities
Point In Time Constructs
Point In Time Id
The identifier for a Point In Time, base64 URL encoded. It chiefly encapsulates the following information - for each shard that’s a part of the PIT it tells which node has the PIT segments i.e it has map of shard Id to node routing. Every node containing the copy of a given shard will not have segments marked for PIT.
Point in time reader context
Point In Time Reader Context
For one copy of each shard in the PIT, a PIT Reader Context is created. A Reader Context holds a reference to a Lucene IndexSearcher. An Index Searcher implements search over a single IndexReader. IndexReader is a construct that provides an interface for accessing a point-in-time view of an index. Any changes made to the index via IndexWriter will not be visible until a new IndexReader is opened. Hence by virtue of holding onto the searcher we are able to exclusively query the shards at that Point In Time.
The PIT Reader Context also holds information about the segments it retains for that shard. This helps us in monitoring the disk utilization of PITs.
What happens when we delete the PIT while a Point In Time Search is running:
When a PIT is marked for deletion, its reference count (refCounted in above POJO) is decremented by 1. But every search query that is using this PIT Reader Context will increment the reference by 1. The PIT Reader Context is only deleted when the refCounted hits 0. Hence even if a PIT is requested to be deleted, it is reaped from memory and resources held are released only once all searches are completed and no references are being held.
High Level API Interactions
Create Point In Time API
Unlike a scroll, by creating a dedicated PIT, we decouple the context from a single query and make it re-usable across arbitrary search requests by passing the PIT Id. We can achieve this by using the Create Point In Time API.
Components
Why we need to store PIT ID :
Failure Scenarios:
Phase 1
Phase 2
Delete Point In Time API
PITs are automatically closed when the keep_alive is elapsed. However, keeping PITs has a cost; hence, they should be deleted as soon as they are no longer used in search requests. We may also delete a PIT and free the resources before its keep alive using the Delete Point In Time APIs.
Request body: (Required, string or array of strings) PIT IDs to be cleared. To clear all PIT IDs, use _all.
Response body:
For each PIT Id requested to be deleted, we return a nested object with following fields
pitId : pit id
successful : whether the PIT is successfully deleted. Partial or complete failure are treated as failures.
Cross cluster behavior
Delete by ID fully supports deleting cross cluster PITs.
Delete All only deletes PITs which are local and also mixed (PITs created from both local and remote clusters)
Fully remote PITs won't be deleted by Delete ALL API.
List all PIT API
This API returns all PIT IDs present in the ES cluster which can be used by the clients / UI
Response body:
_nodes - Contains success and failure stats of all the nodes.
Id - PIT id
Creation time - PIT creation time ( same creation time is propagated to all reader contexts during create pit )
Keep Alive - Keep alive of PIT id
Cross cluster behavior
List all retrieves PITs which are local and also mixed (PITs created from both local and remote clusters)
Fully remote PITs won't be retrieved by List ALL PITs API.
Failure scenarios
If request to all nodes fail - request returns 5xx
If request to some nodes fail - success/failure info will be part of the response
Point In Time Segments API
This API provides information about the disk utilization of a PIT. It returns low-level information about the Lucene segments a PIT is comprised of; similar to the cat segments API.
How do we get the information about the exact segments that are utilised by a PIT? When we open reader contexts for each of the PIT’s shards, we also store the segment info in the PIT reader Context.
Request body: PIT ID
If PIT ID is not mentioned segments details of all PITs will be returned.
ENHANCEMENT to the existing node stats API to capture PIT stats
You can check how many PIT contexts are active with the nodes stats API in the search section. The API already shows similar information for Scrolls. We are simply replicating the logic to publish similar stats for PITs.
Using a PIT Id in a search request
[Support for submitting a search query with a Point In Time Id already exists.]
The result from the above request includes a id, which should be passed to the id of the pit parameter of a search request.A search request with the pit parameter must not specify index, routing, and preference as these parameters are copied from the point in time. The id parameter tells OpenSearch to execute the request using contexts from this point in time. Optionally, the keep_alive parameter can be passed in the PIT body. It tells OpenSearch how long it should extend the retention of the point in time.
Running a Point In Time Search After PIT has expired
This will cause the search query to fail with SearchPhaseExecutionException as all shards in the query would fail because PIT contexts wouldn’t be found for any of them.
Pagination using Point In Time Search
Using search_after
When running a Point In Time Search (search queries with PIT Id), you can use the search_after parameter to retrieve the next page of hits using a set of sort values from the previous page. Using search_after requires multiple search requests with the same query and sort values.
User can first run a search query with PIT.
The response would have 10000 results. To get the next page of 10000 results, user can rerun the previous search using the last doc’s sort values as the search_after argument. The search’s query and sort and pit.Id
arguments must remain unchanged.
Fetching multiple pages in parallel
Using PIT with search_after gives you control over the ordering of results in pages of results. It requires user to fetch one page after another because user needs to know the last result in the current page to fetch the next page of results. This caters to many use cases but some users might not be bothered about the ordering and prefer leaving the pagination to Opensearch. For eg. they simply want 100k results paginated into 10 pages. They also might need all 10 pages at once and not sequentially and they might need the ability to jump from page 3 to page 10 without having to know which document to “search_after“. That’s where the second pagination option, slicing, is a solution
Search slicing
PIT searches can further be improved by slicing them, i.e. split the PIT search in multiple slices which can be consumed independently by your client application.
So, say you have a query which is supposed to return 1,000,000 hits, and you want to PIT search over that result set in batches of 50,000 hits, using a normal PIT query (i.e. without slicing), your client application will have to make the first PIT call and then 20 more synchronous calls (i.e. one after another) to retrieve each batch of 50K hits.
By using slicing, you can parallelize the 20 PIT calls. If your client application is multi-threaded, you can make each PIT call use 5 (e.g.) slices, and thus, you'll end up with 5 slices of ~10K hits that can be consumed by 5 different threads in your application, instead of having a single thread consume 50K hits. You can thus leverage the full computing power of your client application to consume those hits.
Replacing Scroll API
Point In Time Search feature supports deep pagination when used with search_after.
We get the following benefits with PIT, compared to a scroll, as a pagination solution :
Security model
For opensearch clusters where security plugin is enabled, this following section is applicable.
Users will be able to access PIT APIs using the role
point_in_time_full_access
.Role:
For Alias and data stream behavior :
Based on this,
Protection and Resiliency
Limitations
Monitoring
Feature Compliance
PIT supports
Enhancements for V2
We can consider storing PIT segments information on disk i.e. which node contains which shard’s PIT segments. That way we would be able to relocate PIT segments to new nodes when shards relocate, make PIT resilient to OS process restarts and not be dependent on decoding PIT Id for routing requests to various PIT segments.
APPENDIX
How a Point In Time retains segments
This doc tries to give an insight into why segments are retained by PITs and how they are blocked from being merged away or deleted as part of the live data set.
Keeping segments around that are not needed for live data also means that you need more disk space to keep those segments alive, as those cannot be deleted until the PIT id is deleted. The way this works internally is by using reference counting. As long as there is a component (like a PIT search) holding a reference to the data (for example via an open file handle against an inode) there is no final deletion of that data, even though it is not part of the live dataset anymore. This is also the reason why the PIT id exists. By specifying it as part of the query, you are specifying which state you want to query.
This behaviour is already present with Scrolls implementation and has simply been re-used for Point In Time.
How Objects with RefCounted are closed
Callers can only invoke incRef() when they are trying to use the resource and decRef() to release. If the refCount is 0, only then close() is called.
What happens when we free/close a PIT Reader Context
Update PIT reader context API
Because of the implementation changes we are doing for ‘List all PIT’ API, we can easily update the PIT context as well now.
Note : Search API can be used to update keep alive of one pit id.
Update use cases :
Update list and update all use cases have ambiguity on how the user would use it, for example - for update list, whether we would extend the same keep alive for the entire list of PITs or if we want to extend different keep alive for different PITs.
Also we are currently not sure if update all solves any use case of the user.
So, we’ll see if need for update APIs arises once PIT goes to production and implement this API if needed.
The text was updated successfully, but these errors were encountered: