[Draft] High level design proposal for buffer pool in opensearch


### Is your feature request related to a problem? Please describe

In continuation to discussion on (https://github.com/opensearch-project/OpenSearch/issues/18833) and (https://github.com/opensearch-project/OpenSearch/issues/18838). I will continue to write a high level design proposal, followed by low level design and implementation choices on libraries/data structures and algorithms and another proposal for integration and tunability in opensearch. 



### Describe the solution you'd like

In this proposal, I'll focus on high level design of buffer pool in opensearch. 

### Functional requirements

1. Should support in-memory writes (writes to application pages) [ToDo]
2. Should support in-memory reads (reads from application pages)
3. Should support pluggable page replacement algorithms (evictions and prefetching).
4. Should support writing dirty pages to disk [ToDo]
5. Should support dynamic size allocations (different for reads/prefetch, readaheads and writes)
6. Should throttle reads/writes in case of memory pressure
7. Should support zero copy reads from application to storage device.
8. Must not result in data-loss (as long as application level sync/fsync is called) [ToDo]
9. Support delete after last read (Unix file system semantic of keeping data available for read till last reader is done even after delete, this behavior is seen during segment merges) [ToDo]

### Non functional requirements

1. Low Read latency at high concurrency
2. Write throughput at high concurrency

### Out of scope

1. Design for IO interface, will be covered separately.


### Core Components

**High level IO model with bufferpool**

<img width="1179" height="719" alt="Image" src="https://github.com/user-attachments/assets/1fda8f80-d038-4fcc-8fe7-ace89d7c976c" />

**Memory Provider**

Memory provider is going to be responsible for pre-allocating the memory outside of jvm heap (off-heap) memory. As part of low level details I’ll cover on possible memory provider implementations (for eg MemorySegments,  sun.misc.Unsafe, netty etc ).

**Page**

Page is the unit of memory for all reads and writes. Page will be backed by off heap memory as returned from a memory region of bufferpool.  Each page will have a constant size, which will depend on the operation for which the page is allocated for (write, read, prefetch, readahead etc). Each page is going to contain a page header and page content. Page header will contain page identifier (encoded with memory region index in the buffer pool), reference count, pin flag, page slice metadata information and a page level lock. Page level lock can be used to modify page header attributes (for instance reference count, pin flag etc). Page content are going to be used for writing or reading data. Once a page is flushed, page content will become immutable. 

**BufferPool**

Buffer pool is going to provide an interface to allocate and deallocate/free pages. Internally BufferPool consists of multiple memory regions. Buffer pool is also going to be responsible for managing memory regions (resizing, fragmentation checks etc). I have to evaluate "range tree" versus hash table kind of core data structures for buffer pool [ToDo]. 

**Memory Regions**

Memory regions are the contiguous chunks of memory pre-allocated in the bufferpool. Each memory region will maintain its own set of data structures for tracking free pages, claimed pages and flushed pages. Since we are going to support multiple page sizes we may have to handle internal fragmentation in the buffer-pool. To simplify the handling of internal fragmentation, I am thinking of following options.
1. We can have dedicated memory region for each page size, this is simpler to implement but it can result in throttling, reads, writes even when we may have enough memory in different region. To make this simpler we can start with static size for each memory region (per size) and dynamically resize the regions based on usage in the background. 
2. Other option is to have each memory region an affinity towards page size and buffer pool can try to allocate a page from preferred memory regions for the size. If there is no memory region for request size, then buffer-pool can try the memory region with next larger page size affinity. We will still have to handle the fragmentation.
First option seems simpler to me, so possibly this is what I am going to go with for initial implementation.


**BufferCache**

BufferCache is going to store the mapping of filename and file block to the corresponding page in the bufferpool. Only pages that have successfully been fsynced are going to be part of buffer cache. Its possible that files that are not yet committed by index writer could be present in buffer cache because of periodic fsyncing of the pages in dirty page tracker. Index Reader will ensure not reading "non-committed" files, but buffer cache will not have these checks. 

**Dirty page tracker**

Each memory region will have its own dirty page tracker that is going to keep track of all dirty pages and will be periodically and/or explicitly fsynced to the storage device.

**Prefetching and readaheads**

 Prefetch requests are initiated by IndexInput. There will be background prefetcher threads that are going to be responsible for submitting prefetch requests. Prefetchers are also going to respect backpressure from memory(page replacement component) and io interface. All prefetch requests will be sent to IO interface for further processing. 
Readaheads are going to be initiated from reads in IndexInput. Readaheads are also going to honor the memory and IO pressure as communicated by page replacement and io interface. 
Both prefetch and readahead requests are going to be tied to the scope of an opensearch operation. (This will be discussed more in io interface design)

**Page Replacement**

Page replacement component is going to be responsible for managing the lifecycle of pages in the bufferpool as well as submitting quotas to prefetchers and readaheads to communicate memory pressure. 

**IOInterface**

IOInterface is going to be a lower abstraction that does IO from a file identifier (file descriptor/channel) to the pages in the buffer pool. IO interface is also going to provide the capability of writing pages, reading pages, syncing pages to IO device. 


### User flows

**Read flow** 

<img width="597" height="748" alt="Image" src="https://github.com/user-attachments/assets/6fe642b9-ab07-4a6f-b9f2-ded6725c6fc6" />

**Write flow**

Each memory region is going to maintain two list one for free pages and one for claimed pages. For pages that are going to be currently being written into are going to be owned by IndexOutput, till a flush is called on the page. 
Flush is going to make the page content immutable and this page is going to be tracked in a different data structure, dirty page tracker as maintained by bufferpool. 
fsync operation on a file is going to result in moving all pages associated with a file (and some more if io concurrency and batching allows for it) from dirty page tracker to buffer cache and storage device. 
Buffer pool will also periodically fsync the pages tracked in dirty page tracker. This will make the page ready/available for reads.

<img width="697" height="667" alt="Image" src="https://github.com/user-attachments/assets/fc41a7cb-973b-4aab-b61b-b26081d822b5" />


**Prefetch flow**

1. All indexInput prefetch requests will be added to a queue of prefetcher. 
2. Prefetchers  are going to remove prefetch requests from the queue, if queue is not empty.
3. Prefetcher will have a prefetch quota from page replacement component and IO interface. 
4. Each prefetcher thread is going to submit IO request based on the available quota and prefetch request scope validation from. 

**Readahead flow**
Readahead flow is going to be similar to prefetch flow except there wont be an explicit call for readahead from index readers/index input. And readahead component will get a different quota for read ahead requests from page replacement component and io interface. 

**Periodic flushing/explicit sync from directory**
There will be two distinct operations flush and fsync. Flush is going to result in moving the ownership of page allocated for write from IndexOutput to dirty page tracker. Flush will not guarantee that the page is written onto the storage device. Flush will be called at page granularity. Pages after flush may not be visible for read as they wont be added to buffer cache. 
Buffer pool will also periodically submit write io for pages present in dirty page tracker and move them to buffer-cache.

There will be a file level fsync, which is going to flush all the dirty pages present for a file in dirty page tracker to storage device. Fsync request will only be sent to io interface once there is going to be an acknowledgement for all flush operations. Fsync operation will guarantee that all flushed pages for a file has been successfully written to io interface before calling fsync .


### Related component

_No response_

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Draft] High level design proposal for buffer pool in opensearch #18873

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Functional requirements

Non functional requirements

Out of scope

Core Components

User flows

Related component

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Draft] High level design proposal for buffer pool in opensearch #18873

Description

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Functional requirements

Non functional requirements

Out of scope

Core Components

User flows

Related component

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions