-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Hybrid search error with field of type nested on the index #466
Comments
Hi @tiagoshin, I hope this clarifies. |
Hi @navneet1v I know that getting a score above one doesn't necessarily mean that normalization wasn't applied because the sum of the weights could be higher than 1, but it's not the case. Please take a look at the creation of the post-processor for hybrid search; we use a combination based on arithmetic mean with the sum of the weights being equal to 1.
Please notice that the results from the hybrid search with type nested on the index are just the sum of the lexical and semantic scores. This is the same that we had before release 2.10 without applying normalization and combination techniques. |
@martin-gaievski can you just try to reproduce with the steps added they are pretty detailed. |
I repro the issue using provided steps, seems the problem is with the |
@martin-gaievski what is the LoE and ETA for the deep dive on root cause? |
Additional context gathered: whenever there is a nested field in the index, it impacts the results, despite that field being included. Additionally, we need to ensure users can filter by nested fields. |
@dagneyb can you explain a bit more on this? |
@navneet1v are you looking for more context on my comment or on the overall issue? |
yes |
@navneet1v I think the overall summary provided does a good job of this: When we have any field on the index mapping properties with type nested, it doesn't apply normalization and weighted combination. Instead, it just sums up the values, the same way that Opensearch did before having Hybrid search feature. If you have a specific question, let me know and I can reach out to the impacted user directly. |
@navneet1v The context of this comment
is that we need to make sure it's possible to declare an index with nested fields and also to apply filters by them in the search query |
We've pushed a code change that fixes this issue, it's part of the main and 2.x branches. https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.12.0/8999/linux/x64/tar/dist/opensearch/opensearch-2.12.0-linux-x64.tar.gz We cannot put it to 2.11 as that release only accepts critical security fixes. |
@tiagoshin can you use the links provided by @martin-gaievski to test and validate. Feel free to provide the feedback. |
@tiagoshin we run your initial scenario on a 2.12 RC build. Only unknown piece was a model, for our testing we used hybrid search query
Below are response for sub-queries for case when we run them as independent queries. bm25 query
neural search query
|
Does the re-tagging suggest this didn't make it into |
@martin-gaievski can you ans this question? |
@martin-gaievski Just following up to see if you have an update on this? Nested types in indexes feel extremely common, so this really blocks a lot of Hybrid Search usage. Given it looks like the fix is complete, and how limiting this makes Hybrid Search, any way we can get this patched in soon? |
@jared-rheaply fix for the original problem reported in this issue has been fixed and is part of the 2.12. Please see corresponding PRs tagged in this issue (#490 and #498) and one more that is related #524. This was marked as 2.13 due to some internal procedures related to release, I'm closing this issue now |
What is the bug?
I identified a bug in the Hybrid search on release 2.10. The same happens on release 2.11.
When we have any field on the index mapping properties with type nested, it doesn't apply normalization and weighted combination. Instead, it just sums up the values, the same way that Opensearch did before having Hybrid search feature.
To identify it, I created an unused field in the index mapping properties with type nested and verified the scores in the hybrid search. To compare, I did the same by adding this field with type text and verified the results.
The same behavior happens whether we use the field of type nested or not.
How can one reproduce the bug?
Before running these steps, create a model and use its model_id.
PUT {{host}}/_ingest/pipeline/pipeline-test
{
"description": "An NLP ingest pipeline",
"processors": [
{
"text_embedding": {
"model_id": "{{model_id}}",
"field_map": {
"name": "passage_embedding"
}
}
}
]
}
PUT {{host}}/index-test
{
"settings": {
"index.knn": true,
"default_pipeline": "pipeline-test"
},
"mappings": {
"properties": {
"id": {
"type": "text"
},
"passage_embedding": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "lucene",
"parameters": {
"ef_construction": 512,
"m": 8
}
}
},
"name": {
"type": "text"
},
"passage_text": {
"type": "text"
},
"test": {
"type": "nested"
}
}
}
}
PUT {{host}}/index-test/_doc/1
{
"name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
"id": "4319130149.jpg"
}
PUT {{host}}/index-test/_doc/2
{
"name": "A wild animal races across an uncut field with a minimal amount of trees .",
"id": "1775029934.jpg"
}
PUT {{host}}/index-test/_doc/3
{
"name": "People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .",
"id": "2664027527.jpg"
}
PUT {{host}}/index-test/_doc/4
{
"name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
"id": "4427058951.jpg"
}
PUT {{host}}/index-test/_doc/4
{
"name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
"id": "4427058951.jpg"
}
PUT {{host}}/index-test/_doc/5
{
"name": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
"id": "2691147709.jpg"
}
PUT {{host}}/_search/pipeline/nlp-search-pipeline
{
"description": "Post processor for hybrid search",
"phase_results_processors": [
{
"normalization-processor": {
"normalization": {
"technique": "min_max"
},
"combination": {
"technique": "arithmetic_mean",
"parameters": {
"weights": [
0.7,
0.3
]
}
}
}
}
]
}
Querying lexical search
PUT {{host}}/index-test/_search
{
"_source": {
"excludes": [
"passage_embedding"
]
},
"query": {
"match": {
"name": {
"query": "wild west"
}
}
}
}
Results:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 1.7878418,
"hits": [
{
"_index": "index-test",
"_id": "1",
"_score": 1.7878418,
"_source": {
"name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
"id": "4319130149.jpg"
}
},
{
"_index": "index-test",
"_id": "2",
"_score": 0.58093566,
"_source": {
"name": "A wild animal races across an uncut field with a minimal amount of trees .",
"id": "1775029934.jpg"
}
},
{
"_index": "index-test",
"_id": "5",
"_score": 0.55228686,
"_source": {
"name": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
"id": "2691147709.jpg"
}
},
{
"_index": "index-test",
"_id": "4",
"_score": 0.53899646,
"_source": {
"name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
"id": "4427058951.jpg"
}
}
]
}
}
Query semantic search
PUT {{host}}/index-test/_search
{
"_source": {
"excludes": [
"passage_embedding"
]
},
"query": {
"neural": {
"passage_embedding": {
"query_text": "wild west",
"model_id": "{{model_id}}",
"k": 20
}
}
}
}
Response:
{
"took": 47,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 5,
"relation": "eq"
},
"max_score": 0.65891314,
"hits": [
{
"_index": "index-test",
"_id": "2",
"_score": 0.65891314,
"_source": {
"name": "A wild animal races across an uncut field with a minimal amount of trees .",
"id": "1775029934.jpg"
}
},
{
"_index": "index-test",
"_id": "1",
"_score": 0.6278618,
"_source": {
"name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
"id": "4319130149.jpg"
}
},
{
"_index": "index-test",
"_id": "5",
"_score": 0.62723345,
"_source": {
"name": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
"id": "2691147709.jpg"
}
},
{
"_index": "index-test",
"_id": "3",
"_score": 0.6229783,
"_source": {
"name": "People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .",
"id": "2664027527.jpg"
}
},
{
"_index": "index-test",
"_id": "4",
"_score": 0.5791679,
"_source": {
"name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
"id": "4427058951.jpg"
}
}
]
}
}
Hybrid search
GET {{host}}/index-test/_search?search_pipeline=nlp-search-pipeline
{
"_source": {
"exclude": [
"passage_embedding"
]
},
"query": {
"hybrid": {
"queries": [
{
"match": {
"name": {
"query": "wild west"
}
}
},
{
"neural": {
"passage_embedding": {
"query_text": "wild west",
"model_id": "{{model_id}}",
"k": 20
}
}
}
]
}
}
}
Response:
{
"took": 60,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 5,
"relation": "eq"
},
"max_score": 2.4157035,
"hits": [
{
"_index": "index-test",
"_id": "1",
"_score": 2.4157035,
"_source": {
"name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
"id": "4319130149.jpg"
}
},
{
"_index": "index-test",
"_id": "2",
"_score": 1.2398489,
"_source": {
"name": "A wild animal races across an uncut field with a minimal amount of trees .",
"id": "1775029934.jpg"
}
},
{
"_index": "index-test",
"_id": "5",
"_score": 1.1795204,
"_source": {
"name": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
"id": "2691147709.jpg"
}
},
{
"_index": "index-test",
"_id": "4",
"_score": 1.1181643,
"_source": {
"name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
"id": "4427058951.jpg"
}
},
{
"_index": "index-test",
"_id": "3",
"_score": 0.6229783,
"_source": {
"name": "People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .",
"id": "2664027527.jpg"
}
}
]
}
}
What is the expected behavior?
Note that on hybrid search steps, the score is higher than 1, which means that the normalization was not applied.
The expected result is what we get when the "test" field on the index is defined with type "text":
{
"took": 87,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 5,
"relation": "eq"
},
"max_score": 0.88318545,
"hits": [
{
"_index": "index-test",
"_id": "1",
"_score": 0.88318545,
"_source": {
"name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
"id": "4319130149.jpg"
}
},
{
"_index": "index-test",
"_id": "2",
"_score": 0.32350767,
"_source": {
"name": "A wild animal races across an uncut field with a minimal amount of trees .",
"id": "1775029934.jpg"
}
},
{
"_index": "index-test",
"_id": "5",
"_score": 0.18827114,
"_source": {
"name": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
"id": "2691147709.jpg"
}
},
{
"_index": "index-test",
"_id": "3",
"_score": 0.16481397,
"_source": {
"name": "People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .",
"id": "2664027527.jpg"
}
},
{
"_index": "index-test",
"_id": "4",
"_score": 0.001,
"_source": {
"name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
"id": "4427058951.jpg"
}
}
]
}
}
What is your host/environment?
I ran it on Docker on Mac M2
The text was updated successfully, but these errors were encountered: