[RFC] Adding Fetch Phase Profiling

### Is your feature request related to a problem? Please describe

## Problem Statement

OpenSearch currently supports detailed profiling for the query and aggregation phases of search, enabling developers to analyze performance by timing the underlying operations. However, a significant gap exists in this feature, as it provides no insight into the fetch phase. This absence makes it difficult for developers to identify and resolve performance bottlenecks occurring during document retrieval.

A clear [example](https://github.com/opensearch-project/OpenSearch/issues/1764) of this problem was a significant performance regression in the fetch phase experienced by some users migrating from Elasticsearch 7.9 to OpenSearch 1.0. The root cause was a change to the Lucene codec, which was eventually mitigated in OpenSearch 1.2. Developers were forced to rely on external tools like Java Flight Recorder to diagnose the problem. This profiling revealed that excessive time was being spent on decompression during the fetch phase, but identifying this would have been much simpler and faster if the search profile results had contained native timing metrics for the fetch phase, which provide more specific, query-related context.

To provide a complete performance picture, a fetch phase profile must be introduced. Consistent with the existing profilers, it should maintain timing information for the entire phase, alongside a granular breakdown of its key operations.



### Describe the solution you'd like

## Proposed Solution 
The most efficient and consistent approach is to integrate fetch phase profiling directly into the established Profile API. OpenSearch has a mature profiling infrastructure, including abstract base classes like `AbstractInternalProfileTree`, `AbstractProfileBreakdown`, and `AbstractProfiler`, which already handle the core logic of timing, tree-based result construction, and serialization for the query and aggregation phases.

By creating a new `profile.fetch` package with classes that inherit from this shared infrastructure, we gain several key advantages:

* Pros:
    * Rapid Development: Leverages years of development and debugging already invested in the core profiling tools, dramatically reducing implementation time.
    * User Consistency: Delivers a familiar user experience. Users who understand query profiling will immediately understand fetch profiling, as the output format and API interaction will be identical.
    * Low Maintenance Overhead: Avoids introducing a new, bespoke system that would require its own maintenance and debugging cycles. It remains part of a single, unified profiling feature.
    * Guaranteed Integration: Ensures seamless integration with the existing search response structure and APIs.
* Cons:
    * Minor Constraints: The design must adhere to the existing structure, which may impose minor limitations compared to a completely new design. However, given the framework's proven utility for other complex phases, this is a low-risk consideration.

### Design 

For this design, we will choose not to rely on the messy, convoluted structure of the AbstractInternalProfileTree used to implement query and fetch profiling. Instead, we will choose to implement fetch profiling through a new class called FlatFetchProfileTree. This will use a more intuitive tree structure with explicated parent and child nodes rather than relying on an unintuitive stack-based approach to control parent-child replationships. The fetch profile tree will always contain a single node to represent the standard fetch phase. It will also contain a node to represent inner hits phases and top hits aggregation fetch phases (if these phases run). These fetch phases often have multiple fetch phases that run within them, so we will consolidate them into a single respective fetch breakdown (for each inner hits search and top hits aggregation that runs). The fetch profile tree will always contain a root node that holds the following information:

- Time spent creating the stored fields visitor
- Time spent building fetch sub-phase processors
- Time spent switching to the next segment
- Time spent loading stored fields for a hit
- Time spent loading the document _source

The tree will also contain a child node for each fetch sub-phase that runs (see appendix for a list of all sub-phases). The child node breakdown will contain the following information:
- Time spent switching to the next segment
- Time spent executing a fetch sub-phase

<img width="3840" height="1044" alt="Image" src="https://github.com/user-attachments/assets/536d4361-f05d-413c-8d24-5e0be18789e9" />

Note: the above diagram only shows sub-phases for simplicity. 

### Related component

Search:Performance

### Describe alternatives you've considered

## Alternative: Build a New Profiling Class Structure from Scratch

Another option is to design and implement a completely new set of profiling classes tailored specifically for the fetch phase, without inheriting from the existing framework.

* Pros:
    * Total Design Freedom: This approach would not be constrained by any existing abstractions, allowing for a purpose-built structure optimized purely for fetch phase semantics.
* Cons:
    * Significant Development Cost: It involves "reinventing the wheel" by re-implementing the complex logic for timing hierarchies, result aggregation, and JSON serialization that the core framework already provides.
    * Introduces Inconsistency: The output would likely differ in structure and naming from the query and aggregation profiles, creating a confusing and disjointed user experience.
    * Higher Risk and Maintenance: A net-new system would carry a higher risk of bugs and would create a separate, parallel framework to maintain and update in the future.

Extending the existing profiling framework is the only solution that provides detailed, query-specific, and user-friendly insights without imposing an unreasonable burden on developers or users. It reuses existing, stable components to deliver high value with minimal development cost and maximum consistency. 

## Alternative: Rely on External JVM Profilers

Developers can currently use external tools like Java Flight Recorder (JFR) or commercial profilers to analyze the JVM during a search request.

* Pros:
    * Extreme Detail: These tools provide deep, method-level insight into CPU time, memory allocation, and thread states.
* Cons:
    * High Barrier to Entry: Requires specialized knowledge of Java tooling, JVM internals, and the OpenSearch codebase to interpret the results effectively. This is not a user-friendly solution for the average OpenSearch developer or administrator.
    * Lacks Query Context: A JVM profile is generic. It is difficult to isolate the performance data for a single search request or to understand the timings in the context of the query's logical structure (e.g., per-shard breakdown), which is the primary value of the built-in Profile API.
    * Not a Built-in Solution: It is an external, ad-hoc process, not an integrated, on-demand feature that can be toggled via an API call.



## Alternative: Use Distributed Tracing (e.g., Jaeger, Zipkin)

Integrating a distributed tracing solution using OpenTelemetry SDKs could capture the flow of a fetch request from the coordinator node to the data nodes.

* Pros:
    * Excellent Distributed Visualization (see appendix): This is the best option for visualizing network latency and the high-level request flow across multiple nodes.
* Cons:
    * Requires Code Instrumentation: While OpenTelemetry provides helpers, it still requires manually adding instrumentation throughout the fetch phase code paths.
    * Insufficient Granularity: A trace typically shows that a data node took X milliseconds to respond but cannot break down why. It cannot provide the detailed, intra-node timings of specific sub-phases (like highlighting vs. script field execution) that are essential for deep analysis and optimization. The proposed solution, in contrast, is designed to provide exactly this level of detail.



### Additional context

## Appendix 
### Profiling

The following is an overview of query profiling (currently implemented): 

<img width="3840" height="2068" alt="Image" src="https://github.com/user-attachments/assets/1d1608e3-da76-4feb-83fe-f3bddd81a301" />

The InternalProfileTree stores a an individual ProfileBreakdown for each subquery in a tree structure where each subquery is represented as a child node. Each ProfileBreakdown contains a list of timers that capture the timings of individual operations within each subquery. 

The following diagram represents the structure of an InternalProfileTree for a boolean query comprised of a match subquery and a term subquery:

<img width="3840" height="2041" alt="Image" src="https://github.com/user-attachments/assets/1cf1f2c4-7e1e-476c-9e20-10f64eae958d" />

When results are generated, the breakdown’s timers are aggregated into a single “node time” to represent the timing of each individual subquery. For simplicity/space purposes, only the individual timers for the match-query are shown. 

### Fetch Phase

The following is an overview of the FetchPhase. The shard starts with a list of document ids to load (docIdsToLoad): 

<img width="1943" height="3840" alt="Image" src="https://github.com/user-attachments/assets/8ef919e2-07a6-46e7-b930-509bc2dc392d" />

`FieldsVisitor` - used to read selected values from a document’s stored fields when a search retrieves documents. FieldsVisitor keeps track of which fields must be loaded. These visitors are created during the fetch phase of search by the method createStoredFieldsVisitor in FetchPhase. 

`LeafReader` - Reads data from a segment of an index. During the fetch phase, its primary job is to retrieve the stored contents of a document (like its _source field) after the correct document ID and segment have been located. By sorting the document IDs, the documents are naturally clustered by the segment they live in. This allows the loop to process all the required documents from Segment A, then all from Segment B, and so on. This is done so the system doesn’t have to bear the performance cost of changing the LeafReader and all the processor contexts for every single document. It only incurs that cost when it finishes with one segment and moves to the next.

`HitContext` - represents all information about a single search hit that is needed while executing fetch sub-phases. Stores the SearchHit, Lucene reader context, document ID within that reader, and a SourceLookup that points to the document’s source. When a hit context is created, the source lookup is tied to the correct segment and document. During the fetch phase, each HitContext is populated with the hit’s basic data and source so that fetch sub-phases can use it.  The function prepareNestedHitContext is called only when a query matches a nested object inside a larger document (requires a nested query). 

`FetchSourcePhase` - The default fetch sub-phase. Retrieves the _source field (simply the original JSON document that was indexed) for each matching document. Runs by default for all standard search queries.

Optional Fetch Sub-phases -

* `ExplainPhase` - provides a per-hit explanation of how the score was computed. It only runs when the search request specifies "explain": true. 
* `FetchDocValuesPhase` - retrieves values directly from doc values (disk-based, columnar data structure in Lucene). Only runs when a docvalue_fields array is specified in the body of search request. 
* `FetchFieldsPhase` - Retrieves the values of specific fields that have been explicitly marked as "store": true in the index mapping. The data is retrieved from stored fields as opposed to doc values or source. This phase is only triggered when a fields array is included in the body of the search request and the fields listed in that array are configured with "store": true in the index mapping. 
* `FetchVersionPhase` - adds the document version from the _version field to each hit. Runs only when the search request asks for the document version. 
* `InnerHitsPhase` - retrieves “inner hits” which are nested hits returned for a parent document. Triggered when you are searching on documents that have a nested relationship and you explicitly request the matching inner documents in your query. 
* `MatchedQueriesPhase` - collects which named queries matched each document (and optionally their scores). Identifies which specific clauses of the query matched a particular document. Not activated by default and only runs when query clauses are explicitly named using the _name parameter within the query DSL. 
* `ScriptFieldsPhase` - calculates per-hit scripted fields. Computes and returns new, custom fields on-the-fly for each search result. These fields are not stored in the original document but are generated dynamically at query time using a script. This phase is triggered whenever a script_fields object is included in the body of the search request. 
* `SeqNoPrimaryTermPhase` - loads the sequence number and primary term of each hit (useful for optimistic concurrency control). Triggered by "seq_no_primary_term": true.
* `HighlightPhase` - generates highlighted snippets for requested text fields. Finds the exact terms within a document’s field that matched the user’s query and present them as formatted snippets. Triggered when a highlight object is included in the body of the search request. 
* `FetchScorePhase` - loads the score for each hit when requested. 

`SearchHit` - represents one document returned from a search request. Implements Writeable and ToXContentObject and stores information about the hit including the Lucene doc ID and its score, the document ID and any nested identity info, version, sequence number, primary term, source bytes highlight fields, sort values, matched query info, explanations, and shard details. It is the single record of a search result, containing all relevant metadata, source, and fields needed to represent and process one document in the search response.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Adding Fetch Phase Profiling #18696

Is your feature request related to a problem? Please describe

Problem Statement

Describe the solution you'd like

Proposed Solution

Design

Related component

Describe alternatives you've considered

Alternative: Build a New Profiling Class Structure from Scratch

Alternative: Rely on External JVM Profilers

Alternative: Use Distributed Tracing (e.g., Jaeger, Zipkin)

Additional context

Appendix

Profiling

Fetch Phase

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Adding Fetch Phase Profiling #18696

Description

Is your feature request related to a problem? Please describe

Problem Statement

Describe the solution you'd like

Proposed Solution

Design

Related component

Describe alternatives you've considered

Alternative: Build a New Profiling Class Structure from Scratch

Alternative: Rely on External JVM Profilers

Alternative: Use Distributed Tracing (e.g., Jaeger, Zipkin)

Additional context

Appendix

Profiling

Fetch Phase

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions