-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a prototype/design for a hybrid local/fetch-on-demand directory #7331
Comments
i will be working on this. |
Create a prototype/design for a hybrid local/fetch-on-demand directoryOverviewCreate a prototype/design for a directory that behaves as a local directory when complete files are present disk, but can fall back to the block-based on-demand fetch when data is requested that is not present. The goal is to support adding an option to remote-backed indices that enables the system to intelligently offload index data before hitting disk watermarks, allowing more frequently used and more-likely-to-be-used data to be stored in faster local storage, while less frequently used data can be removed from local storage since the authoritative copy lives in the remote store. Any time a request is encountered for data not in local storage it will be re-hydrated into local storage on-demand in order to serve the request. The goal is also to provide an implementation that can intelligently manage resources and provide a performant "warm" tier that can serve both reads and writes. Common TerminologiesWHOLE filesA single Lucene file: Lucene File Format Basics BLOCK fileEach lucene file is mapped to logical blocks of constant size (currently 8M). A block file is the part of a lucene file that has been downloaded to local disk. Let’s say we have a lucene file 0001.cfs in S3 of 1 GB and search node has only downloaded few “parts” of this file. So, the node’s local disk will block file as have 0001.cfs._block. Fig : Lucene file, logical mapping of lucene file to blocks, actual block files on disk FileCacheIn-memory key value data structure which manages the lifecycle of all files on the disk. DesignCompositeDirectory - this is a directory that has information about the FsDirectory, RemoteSegmentStoreDirectory and HybridDirectory. This hybrid directory implements all Directory apis, knowing how to read a block file as well as a non-block file/whole file. HybridDirectory is also responsible for downloading the blocks/whole files from the remote store if a read is on a file which is not present in the FileCache. A shard file of a writable warm index can exist in the following places with file format:
Hence, a shard file can be tracked with the help of a FileTracker which will be updated when the file moves from one state to the other. FileTracker can also store the metadata of the files present in the FileTracker such as the file size, most recent IOContext or the file operation, etc to support multiple use-cases. Now, lets look at how the different states of the FileTracker: More detailed FSM of FileTracker is as follows: Description of states of FileTracker is as follows: Write to the disk FileTracker Write flow The file of a writable warm index is added to the file tracker with CACHE state and file type as NON_BLOCK file only once it is completely uploaded to the remote store. For the transient period for the file between getting created and uploaded to the remote store, the file should be read from the FSDirectory and hence not added to the file tracker. How can we track the partial file uploads? On a failure of an upload to the remote store? Delete from remote store? FileTracker Read flowSearch Indexing We can always download as block files - the optimization can be done later to download the whole files depending on the operation and/or file type. When a read request lands on the hybrid directory, it checks the following for the file:
On eviction from the file cache, the file tracker state is updated to REMOTE_ONLY. FileTracker can be part of the CompositeDirectory that can contain FSDirectory, HybridDirectory, RemoteSegmentStoreDirectory. On a read flow, the call will land on the CompositeDirectory and can be delegated to the HybridDirectory to initialize the OnDemandNonBlockSearchIndexInput/OnDemandBlockSearchIndexInput based on the FileTracker file type. OnDemandBlockSearchIndexInput is the IndexInput inherited class as in the case of RemoteSearchDirectory (defined in Implement prototype remote store directory/index input for search), which is basically responsible for serving the read requests by checking if the required files are present in the FileCache and returning from there, else downloading the files, block by block from the remote store and then returning. The class diagram looks as follows: CompositeDirectory would expose the following apis, it would delegate call to the FSDirectory/RemoteStoreSegmentDirectory/HybridDirectory: deleteFile fileLength createOutput/ createTempOutput afterRefresh sync syncMetaData rename openInput obtainLock close getPendingDeletions HybridDirectory can make use of the FileTracker to make a decision to initialize OnDemandNonBlockSearchIndexInput/OnDemandBlockSearchIndexInput. HybridDirectory would expose the following apis: listAll deleteFile fileLength createOutput/createTempOutput openInput close |
Prototype: main...neetikasinghal:OpenSearch:new-CD |
This work is being tracked in #8446 |
More context here: #6528
Create a prototype/design for a hybrid directory that behaves as a local directory when complete files are present disk, but can fall back to the block-based on-demand fetch when data is requested that is not present. One of the key goals of this task will be to determine the delta between a read-only variant and a fully writable variant to make a call as to whether the read-only variant is a worthwhile incremental step. (I think the writable variant will increase the scope by requiring reconciling the separate Store approach used by remote-backed indexes, as well as bring in complexities around notifying replicas of changes).
The text was updated successfully, but these errors were encountered: