You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This document proposes the Low Level API design for Normalization and Score Combination Feature. This is the first of the three LLD that Vector Search Team will be doing for the Normalization and Score Combination Feature. As a pre-read please make sure that you read the high level design: Score Combination and Normalization for Semantics Search[HLD] and Search Phase Results Processor. The building blocks for the feature and high level decisions are taken in those issues.
Background
As discussed in the HLD, we will be adding a new Query Clause in OpenSearch which will help us fetch the results per shard for different queries whose scores are at different scale(example: Neural Query and Keyword Search). Once those results are retrieved at Shard level, at the coordinator node we will Normalize and Combine the Results via Search Phase Results Processors to build the final doc Ids list for which the Fetch Phase will run. Refer the below high Level Flow. High Level Flow
Search Phase results Processor Flow
Scope of the Issue
In this issue we will try to propose solution for the below questions that are not discussed in detail in the HLD.
Name of the new Query Clause, which will hold the queries whose results needs to be Normalized and Combined
The shape of the _search api request which will include:
How users will define the technique for doing Normalization?
How users will define the technique and parameters for doing Score Combination?
Out of Scope
Below are some items that are out of scope for this design, but will be covered in other Low Level Design:
How the data needed for doing Normalization will be fetched from Shards?
Will the queries provided run in sequence or in parallel?
How the results of different queries will be transferred to Coordinator node?
How pagination will be supported on the Query Clause
How explain API will be supported on the Query Clause.
Solution
Below are some names that we are proposing for the new Query Clause:
Composite Query
Multi Component Query
Hybrid Query (Recommended)
Ensemble Query
The idea which we want to use while building this Query Clause name is it should allow us change the way we update the Scores in future. Like going forward we don’t want to do normalization of the scores but use some other technique to bring the scores at same scale. All the above names fits well in that category, but based on my understanding I recommend to use Hybrid Query, as new Query Clause name.
Below is the proposed API shape for doing Normalization and Score Combination for different queries. Please see Usage section on how customers can use it.
POST <INDEX-NAME>/_search
{
"query": {
"hybrid": [
{},// First Query
{} // Second Query
..... // Other Queries
]
}
}
Search Pipeline interface that will help us do the normalization.
PUT /_search_processing/pipeline/<PIPELINE-NAME>
{
"description": "A pipeline that adds a Normalization and Combination processor",
"phase_injector_processors" : [
{
"scoring-processor" : {
"normalization": { // Optional
"technique/method": "<NORMALIZATION-TECHNIQUE>", //Possible Values: min-max, L2Norm(default)
}
"combination": { // Optional
"technique/method" : "<SCORE-COMBINATION-TECHNIQUE>", //harmonic mean, arithmatic mean, sum(default), geometric
"parameters" : { // optional
// list of all the parameters that can be required for above algo
"weights" : [0.4, 0.7] // a floating pointing array which can be used in the algorithm
}
}
}
}
]
}
Usage
Below are some examples on how customers can do the Normalization and Score combination feature in their Search request. Below example are few of the many ways customers can use the Query.
Once the index is created. Customer can index the data. The way they are indexing the data is completely upto the customer.
Example Query Usage 1
Creating Search Pipeline separately and then using it as a request param.
PUT /_search_processing/pipeline/normalizationPipeline
{
"description": "A pipeline that adds a Normalization and Combination processor",
"phase_injector_processors" : [
{
"scoring-processor" : {
"normalization": { // Optional
"technique/method": "<NORMALIZATION-TECHNIQUE>", //Possible Values: min-max, L2Norm
}
"combination": { // Optional
"technique/method" : "<SCORE-COMBINATION-TECHNIQUE>", //harmonic mean, arithmatic mean, geometric
"parameters" : { // optional
// list of all the parameters that can be required for above algo
"weights" : [0.4, 0.7] // a floating pointing array which can be used in the algorithm
}
}
}
}
]
}
Using Search Pipeline in the Search Request itself, via adhoc pipeline. This will make sure that customer can do testing of the queries using different pipeline or can use Search comparison tool to see different results.
POST flicker-index/_search
{
"query": {
"hybrid": {
"queries": [
{
"neural": {
"passage_embedding": {
"query_text": "Girl with Brown Hair",
"model_id": "ABCBMODELID",
"k": 20,
"filter": {
"term": {
"status": "published"
}
}
}
}
},
{
"bool": {
"must": [
{
"match": {
"passage_text": "Girl Brown hair"
}
}
],
"filter": {
"term": {
"status": "published"
}
}
}
}
]
}
},
"search_pipeline": {
"description": "A pipeline that adds a Normalization and Combination processor",
"phase_injector_processors" : [
{
"scoring-processor" : {
"normalization": { // Optional
"technique/method": "<NORMALIZATION-TECHNIQUE>", //Possible Values: min-max, L2Norm
}
"combination": { // Optional
"technique/method" : "<SCORE-COMBINATION-TECHNIQUE>", //harmonic mean, arithmatic mean, geometric
"parameters" : { // optional
// list of all the parameters that can be required for above algo
"weights" : [0.4, 0.7] // a floating pointing array which can be used in the algorithm
}
}
}
}
]
}
}
Example Query Usage 3
Using Index settings to set a default search pipeline.
PUT /flicker-index/_settings
{
"index.default_search_pipeline" : "normalizationPipeline"
}
PUT /_search_processing/pipeline/normalizationPipeline
{
"description": "A pipeline that adds a Normalization and Combination processor",
"phase_injector_processors" : [
{
"scoring-processor" : {
"normalization": { // Optional
"technique/method": "<NORMALIZATION-TECHNIQUE>", //Possible Values: min-max, L2Norm
}
"combination": { // Optional
"technique/method" : "<SCORE-COMBINATION-TECHNIQUE>", //harmonic mean, arithmatic mean, geometric
"parameters" : { // optional
// list of all the parameters that can be required for above algo
"weights" : [0.4, 0.7] // a floating pointing array which can be used in the algorithm
}
}
}
}
]
}
POST flicker-index/_search
{
"query": {
"hybrid": {
"queries": [
{
"neural": {
"passage_embedding": {
"query_text": "Girl with Brown Hair",
"model_id": "ABCBMODELID",
"k": 20
}
}
},
{
"bool": {
"must": [
{
"match": {
"passage_text": "Girl Brown hair"
}
}
],
"filter": {
"term": {
"status": "published"
}
}
}
}
]
}
}
}
From customer standpoint, all their API calls will remain same, and they need to update only the body of the request.
From cohesion standpoint, as we are doing the search it make sense to include with the _search api to provide a unified experience for customers who are doing search via OpenSearch.
Less maintenance and consistent output format, as the new compound query is integrated with _search api.
Integration with other search capabilities like Explain Query, Pagination, _msearch will be possible, rather than reinventing the wheel.
Cons:
From implementation standpoint, we need define new concepts in OpenSearch like new Query clause, which will require customer education in terms of how to use it.
Alternatives
Alternatives Considered
Alternative-1: Implement a new Rest Handler instead of using creating a new compound query
The idea here to create a new rest handlers which define the list of queries whose scores needs to be normalized and combined. Pros:
This will provide flexibility for the team to do experiments without touching core capabilities of OpenSearch.
Easier implementation as the new rest handlers is limited to Neural Search Plugin.
Cons:
Duplicate code and interfaces as we will be implementing the same search api functionality(size, from and to, include source fields, scripting etc.)
A higher learning curve and difficult in adoption for customers who are already using _search api for other search workloads.
Future Scope
Currently we see that we define filters for each of the queries, which are present in the queries array. Ideally we would like to provide a single filters key which can define filters for all the queries, along with individual filters. We can borrow something filter context from bool query.
As this is a new Query Clause we want to get some feedback on the name of the query clause. We want to keep the name generic and can be used on any queries that give scores at different scale.
The text was updated successfully, but these errors were encountered:
@navneet1v Does the hybrid query essentially behave like a conjunctive bool query, where the SearchPhaseResultProcessor takes care of normalizing scores across the clauses?
Do we need to do any customization at the collector level to collect both the neural scores and the textual scores during the query phase? (Or is that part of the "out of scope" detail to be covered elsewhere?)
@msfroh Yes we need customization on the Docs Collector and QueryPhaseSearcher class. Please refer this github issue: #193 for the details on how we are doing it.
vamshin
changed the title
[RFC] Normalization and Score Combination Feature API Design LLD
[RFC] Improved Hybrid Search relevancy by Normalization and Score Combination Feature API Design LLD
Sep 5, 2023
Introduction
This document proposes the Low Level API design for Normalization and Score Combination Feature. This is the first of the three LLD that Vector Search Team will be doing for the Normalization and Score Combination Feature. As a pre-read please make sure that you read the high level design: Score Combination and Normalization for Semantics Search[HLD] and Search Phase Results Processor. The building blocks for the feature and high level decisions are taken in those issues.
Background
As discussed in the HLD, we will be adding a new Query Clause in OpenSearch which will help us fetch the results per shard for different queries whose scores are at different scale(example: Neural Query and Keyword Search). Once those results are retrieved at Shard level, at the coordinator node we will Normalize and Combine the Results via Search Phase Results Processors to build the final doc Ids list for which the Fetch Phase will run. Refer the below high Level Flow.
High Level Flow
Search Phase results Processor Flow
Scope of the Issue
In this issue we will try to propose solution for the below questions that are not discussed in detail in the HLD.
Out of Scope
Below are some items that are out of scope for this design, but will be covered in other Low Level Design:
Solution
Below are some names that we are proposing for the new Query Clause:
The idea which we want to use while building this Query Clause name is it should allow us change the way we update the Scores in future. Like going forward we don’t want to do normalization of the scores but use some other technique to bring the scores at same scale. All the above names fits well in that category, but based on my understanding I recommend to use Hybrid Query, as new Query Clause name.
Below is the proposed API shape for doing Normalization and Score Combination for different queries. Please see Usage section on how customers can use it.
Search Pipeline interface that will help us do the normalization.
Usage
Below are some examples on how customers can do the Normalization and Score combination feature in their Search request. Below example are few of the many ways customers can use the Query.
Prequisites
Sample Index Mapping
Once the index is created. Customer can index the data. The way they are indexing the data is completely upto the customer.
Example Query Usage 1
Creating Search Pipeline separately and then using it as a request param.
Example Query Usage 2
Using Search Pipeline in the Search Request itself, via adhoc pipeline. This will make sure that customer can do testing of the queries using different pipeline or can use Search comparison tool to see different results.
Example Query Usage 3
Using Index settings to set a default search pipeline.
Example usage 4
The above index request will finally be stored like this:
Users can just do this easily.
Pros:
Cons:
Alternatives
Alternatives Considered
Alternative-1: Implement a new Rest Handler instead of using creating a new compound query
The idea here to create a new rest handlers which define the list of queries whose scores needs to be normalized and combined.
Pros:
Cons:
Future Scope
References:
Feedback Required:
The text was updated successfully, but these errors were encountered: