Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Tracking Search Pipeline Execution #16705

Open
junweid62 opened this issue Nov 22, 2024 · 15 comments
Open

[RFC] Tracking Search Pipeline Execution #16705

junweid62 opened this issue Nov 22, 2024 · 15 comments
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Search Search query, autocomplete ...etc

Comments

@junweid62
Copy link

Is your feature request related to a problem? Please describe

With the expansion of search pipeline processors, tracking data transformations and understanding data flow through complex processors is becoming challenging. The introduction of ML inference processors, which can manipulate model inputs and outputs, increases the need for a tool to visualize and debug the flow of data across these processors. Such functionality would aid in troubleshooting, optimizing pipeline configurations, and provide transparency for end-to-end transformations of search requests and responses.

As search pipeline processors grow in complexity, there is an increasing need to: Related Issue

  1. Track how data flows and transforms through each processor.
  2. Debug data transformations and pinpoint any failures within the pipeline.
  3. View the end-to-end pipeline execution for both the request and response sides of a search.

This capability would also be valuable for frontend plugins like the Flow Framework, helping users configure and test complex ingest and search pipelines.

Describe the solution you'd like

Adding verbose Parameter to Search Request [Preferred]

Overview

In this approach, the verbose_pipeline parameter is introduced as a query parameter in the search request URL. When used in conjunction with the search_pipeline parameter, it activates a debugging mode, allowing detailed tracking of search pipeline processor execution without requiring a new API or changes to the Explain API.
searchRequestflow drawio


Pros

  1. Minimal Changes to Existing Workflow:

    • No need for a new API endpoint; the debugging functionality is seamlessly integrated into the existing search request.
  2. Backward Compatibility:

    • The verbose parameter is optional and defaults to false. Existing search requests remain unaffected unless explicitly updated to include verbose=true.
  3. Alignment with OpenSearch Design:

    • Consistent with the design of existing search features, such as the profile query parameter.

Cons

  1. Performance Impact:

    • Activating verbose mode may slightly increase computational load due to additional processor-level logging, primarily for debugging purposes. By integrating with the existing search backpressure mechanism, the system can dynamically manage resource usage, ensuring stability while allowing detailed debugging during low-load periods.

Example Request

GET /my_index/_search?search_pipeline=my_debug_pipeline&verbose_pipeline=true

Example Response

{
  "took": 15,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 50,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "my_index",
        "_id": "1",
        "_score": 1.0,
        "_source": { "field": "value" }
      }
    ]
  },
  "processor_result": [
    {
      "processor": "filter_query",
      "status": "success",
      "execution_time": 3,
      "input": { "query": { "match_all": {} } },
      "output": { "query": { "filtered_query": { "match_all": {} } } }
    },
    {
      "processor": "collapse",
      "status": "success",
      "execution_time": 5,
      "input": { "hits": [...] },
      "output": { "collapsed_hits": [...] }
    }
  ]
}

Common Fields for All Processors

Each processor, regardless of type, will include the following common fields:

  • processor: The name or type of the processor (e.g., filter_query, collapse).
  • status: Indicates whether the processor completed successfully (success) or encountered an error (failure).
  • execution_time: The time taken by the processor to execute, in milliseconds.
  • input: The input data provided to the processor. The structure of this field varies depending on the processor type.
  • output: The transformed data output by the processor. The structure of this field varies depending on the processor type.

Request Processor Fields

For processors that handle the incoming search request:

  • input: The original search request before processing (e.g., the query, filters, and other parameters).
  • output: The modified search request after this processor has applied its transformations.

Example:

{
  "processor": "filter_query",
  "status": "success",
  "execution_time": 3,
  "input": { "query": { "match_all": {} } },
  "output": { "query": { "filtered_query": { "match_all": {} } } }
}

Search Phase Result Processor Fields

For processors that handle intermediate results during the search phase:

  • input: The set of search hits or results passed into this processor.
  • output: The modified or filtered set of search hits after the processor has completed its operation.
{
  "processor":"normalization-processor"
  "status": "success",
  "execution_time": 5,
  "input": {
    "hits": [
      { "_index": "my_index", "_id": "1", "_score": 1.0, "_source": { "field": "value1" } },
      { "_index": "my_index", "_id": "2", "_score": 0.9, "_source": { "field": "value2" } }
    ]
  },
  "output": {
    "hits": [
      { "_index": "my_index", "_id": "1", "_score": 1.0, "_source": { "field": "value1" } }
    ]
  }
}

Response Processor Fields

For processors that handle the final search response:

  • input: The raw search response from the previous phase or processor.
  • output: The final transformed response to be returned to the client.
{
  "processor": "Rerank",
  "status": "success",
  "execution_time": 4,
  "input": { "hits": [ ... ] },
  "output": { "hits": [ ... ] }
}

Verbose Mode Support Across Search Pipeline Configurations

The verbose mode is designed to seamlessly integrate with all ways of using a search pipeline, ensuring consistent debugging capabilities regardless of the method chosen. Below is an overview of how verbose mode supports different search pipeline configurations:

  1. Default Search Pipeline
PUT /my_index/_settings
{
  "index.search.default_pipeline": "my_pipeline"
}

GET /my_index/_search?verbose_pipeline=true
  1. Specified Search Pipeline by ID
GET /my_index/_search?search_pipeline=my_pipeline&verbose_pipeline=true
  1. Ad-Hoc (Temporary) Search Pipeline
POST /my_index/_search?verbose_pipeline=true
{
  "query": {
    "match": { "text_field": "some search text" }
  },
  "search_pipeline": {
    "request_processors": [
      {
        "filter_query": {
          "query": { "term": { "visibility": "public" } }
        }
      }
    ],
    "response_processors": [
      {
        "collapse": {
          "field": "category"
        }
      }
    ]
  }
}

Related component

Search

Describe alternatives you've considered

No response

Additional context

No response

@junweid62 junweid62 added enhancement Enhancement or improvement to existing feature or request untriaged RFC Issues requesting major changes labels Nov 22, 2024
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Nov 22, 2024
@pyek-bot
Copy link
Contributor

Hi, Thanks for the detailed information! I'm pretty new to the OpenSearch project. I'm trying to understand when the verbose_pipeline parameter is passed, where is it getting its information from? Is it from a map, system index or logs? Trying to understand where this information is persisted or what the source of it is.

@junweid62
Copy link
Author

Hi, Thanks for the detailed information! I'm pretty new to the OpenSearch project. I'm trying to understand when the verbose_pipeline parameter is passed, where is it getting its information from? Is it from a map, system index or logs? Trying to understand where this information is persisted or what the source of it is.

Thanks for your question! When the verbose_pipeline parameter is passed, the information is not persisted anywhere—it’s generated dynamically during the request execution. The source of the information is the actual processing flow of the search pipeline in memory. Each processor in the pipeline logs its input, output, status, and execution time as the request flows through.

This information is collected directly from the execution of the processors and returned as part of the response. It is not stored in a system index, map, or logs, which helps keep the feature lightweight and avoids adding unnecessary overhead to the system.

Hope this clarifies! Let me know if you have any follow-up questions!

@owaiskazi19
Copy link
Member

By integrating with the existing search backpressure mechanism, the system can dynamically manage resource usage

Since this param is part of the search request, search backpressure would already be integrated.

@owaiskazi19
Copy link
Member

@msfroh @reta @andrross @dblock can you take a look at this RFC and share your thoughts?

@msfroh
Copy link
Collaborator

msfroh commented Nov 26, 2024

I think it's a good idea. As search pipelines get more complicated, getting step-by-step logging from each processor will be useful (though the response can get quite large -- kind of like profiler output). I discussed doing something like this with @mingshl.

As a minor correction to the section Search Phase Result Processor Fields, the phase results are just doc IDs and scores (if I recall correctly). Still, being able to see how scores were processed, e.g. by the hybrid query score normalizer, would be pretty nice. I was discussing ways of getting "explain" output from the normalizer with @martin-gaievski, and I think this could solve that problem.

If you need somewhere to store the verbose output, the PipelinedRequest object that flows through the pipelines might be an option. At the end, you could copy the output from that into the final SearchResponse that gets returned to the client.

@owaiskazi19
Copy link
Member

owaiskazi19 commented Nov 26, 2024

(though the response can get quite large -- kind of like profiler output)

To handle this do you think we can add a size parameter or limit the response by a small number let's say 5 because we just have to see how the processor does processing. We probably don't need an entire SearchResponse?

If you need somewhere to store the verbose output

I don't think so we have to store it. We could just directly read from the ProcessorResultMap and return it in the SearchResponse.

@mingshl
Copy link
Contributor

mingshl commented Nov 26, 2024

Glad to see this RFC is cut! It would be helpful for debugging and tracking search processors. I wish this will also be added to ingest pipelines too. Can ingest pipeline take similar design? Just so, if we decide to do the same on ingest pipeline, we can make it consistent.

Wondering if we can reuse PipelineProcessingContext, this can carry over a map of context between processors.

@owaiskazi19
Copy link
Member

owaiskazi19 commented Nov 26, 2024

Can ingest pipeline take similar design? Just so, if we decide to do the same on ingest pipeline, we can make it consistent.

We already have verbose for Ingest Pipelines https://opensearch.org/docs/latest/ingest-pipelines/simulate-ingest/#query-parameters

@junweid62
Copy link
Author

junweid62 commented Nov 26, 2024

Wondering if we can reuse PipelineProcessingContext, this can carry over a map of context between processors.

Thanks for the suggestion! I just checked the code base, and PipelineProcessingContext does seem like a good fit for this purpose. It looks like it can effectively carry debug information between processors. I'll explore this further and see how it aligns with the verbose mode implementation.

@junweid62
Copy link
Author

As a minor correction to the section Search Phase Result Processor Fields, the phase results are just doc IDs and scores (if I recall correctly). Still, being able to see how scores were processed, e.g. by the hybrid query score normalizer, would be pretty nice. I was discussing ways of getting "explain" output from the normalizer with @martin-gaievski, and I think this could solve that problem.

Thanks for the feedback! I’ll update the section on Search Phase Result Processor Fields to reflect that the phase results are just doc IDs and scores—thanks for pointing that out.

If you need somewhere to store the verbose output, the PipelinedRequest object that flows through the pipelines might be an option. At the end, you could copy the output from that into the final SearchResponse that gets returned to the client.

I’ll also take a closer look at using the PipelinedRequest object for storing verbose output.

@reta
Copy link
Collaborator

reta commented Nov 27, 2024

Thanks @junweid62 , certainly +1 to the feature. The only concern I have is that we are probably introducing too many knobs on the search side:

  • the search request has profile which we could use to trace search request in details
  • yes, we do have verbose setting for ingest pipelines (as you pointed out) and introducing it for consistency probably would have made sense, but ingest do not have profile or/and explain (AFAIK, may be that's what we could bring on the ingest side instead?)

From my perspective, profile should be sufficient, no need to introduce verbose setting on search side

@junweid62
Copy link
Author

junweid62 commented Nov 27, 2024

Thanks @junweid62 , certainly +1 to the feature. The only concern I have is that we are probably introducing too many knobs on the search side:

  • the search request has profile which we could use to trace search request in details
  • yes, we do have verbose setting for ingest pipelines (as you pointed out) and introducing it for consistency probably would have made sense, but ingest do not have profile or/and explain (AFAIK, may be that's what we could bring on the ingest side instead?)

From my perspective, profile should be sufficient, no need to introduce verbose setting on search side

Thanks for the feedback, profile is fundamentally designed to provide timing-related insights, as it focuses on performance debugging. However, verbose_pipeline serves a different purpose that complements profile rather than overlapping with it:

The profile API focuses on timing information and payload metrics (e.g., size/quantity), which are excellent for debugging performance bottlenecks. As highlighted in the OpenSearch documentation:

"The Profile API provides timing information about the execution of individual components of a search request. Using the Profile API, you can debug slow requests and understand how to improve their performance."

It doesn't track logical transformations or interim values between processors. This makes verbose_pipeline the right tool for cases where users need to understand how data evolves across the pipeline, especially with increasingly complex processors like those involving ML inference.

@reta
Copy link
Collaborator

reta commented Nov 28, 2024

It doesn't track logical transformations or interim values between processors. This makes verbose_pipeline the right tool for cases where users need to understand how data evolves across the pipeline, especially with increasingly complex processors like those involving ML inference.

Thank @junweid62 , got it, I think I misunderstood a bit the scope of it (and from the comments above, there seems to be a mix of the intermediate data and time-related insights like execution_time). I think that we should expand a bit here:

  • update the profile stats with timing-related insights
  • add the verbose_pipeline (as you suggested) with data-related insights

What do you think? Thanks!

@junweid62
Copy link
Author

It doesn't track logical transformations or interim values between processors. This makes verbose_pipeline the right tool for cases where users need to understand how data evolves across the pipeline, especially with increasingly complex processors like those involving ML inference.

Thank @junweid62 , got it, I think I misunderstood a bit the scope of it (and from the comments above, there seems to be a mix of the intermediate data and time-related insights like execution_time). I think that we should expand a bit here:

  • update the profile stats with timing-related insights
  • add the verbose_pipeline (as you suggested) with data-related insights

What do you think? Thanks!

Thanks for the proposal! I see where you're coming from, but I feel like keeping everything together in verbose_pipeline might be a better approach. Here's why:

  1. Easier to read and use: Having both timing and data-related insights in one place makes it much easier for users to analyze the pipeline. They don’t have to jump between sections to connect the dots.
  2. Simplifies implementation: Combining them reduces the amount of logic we need to maintain. Instead of handling timing in one place and data in another, we can streamline it all in one flow.
  3. Future-proofing: If we ever want to add more metrics (like memory usage or errors), having a single structure for processor details makes that way simpler.

I think putting it all under verbose_pipeline would give users a full picture of each processor—what it’s doing and how long it’s taking—all in one spot. What do you think? Happy to chat more if needed!

@reta
Copy link
Collaborator

reta commented Dec 5, 2024

I think putting it all under verbose_pipeline would give users a full picture of each processor—what it’s doing and how long it’s taking—all in one spot. What do you think? Happy to chat more if needed!

See your point, I think profile would definitely benefit from timing in any case (it has to be done), but I don't see duplicating the same (or more) timing in verbose_pipeline output as a problem considering the arguments you have provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Search Search query, autocomplete ...etc
Projects
Status: New
Status: 🆕 New
Development

No branches or pull requests

7 participants