Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid negative scores returned from multi_match query with cross_fields #13829

Merged
merged 4 commits into from
May 31, 2024

Conversation

msfroh
Copy link
Collaborator

@msfroh msfroh commented May 25, 2024

Description

Under specific circumstances, when using cross_fields scoring on a multi_match query, we can end up with negative scores from the inverse document frequency calculation in the BM25 formula.

Specifically, the IDF is calculated as:

log(1 + (N - n + 0.5) / (n + 0.5))

where N is the number of documents containing the field and n is the number of documents containing the given term in the field. Obviously, n should always be less than or equal to N.

Unfortunately, cross_fields makes up a new value for n and tries to use it across all fields.

This change finds the minimum (nonzero) value of N and uses that as an upper bound for the new value of n.

Related Issues

Resolves #7860

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • API changes companion pull request created.
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@msfroh msfroh force-pushed the avoid_negative_blended_scores branch from a50898c to 2353a42 Compare May 25, 2024 02:42
Copy link
Contributor

❌ Gradle check result for a50898c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@msfroh
Copy link
Collaborator Author

msfroh commented May 25, 2024

For some context, I came up with this fix after talking myself through the logic of the (previously) failing test in https://github.com/opensearch-project/OpenSearch/pull/13627/files#r1614240922

@reta
Copy link
Collaborator

reta commented May 31, 2024

@msfroh mind please backport to 2.x manually? thank you

@msfroh msfroh deleted the avoid_negative_blended_scores branch June 5, 2024 01:53
msfroh added a commit to msfroh/OpenSearch that referenced this pull request Jun 5, 2024
…lds` (opensearch-project#13829)

Under specific circumstances, when using `cross_fields` scoring on a
`multi_match` query, we can end up with negative scores from the inverse
document frequency calculation in the BM25 formula.

Specifically, the IDF is calculated as:

```
log(1 + (N - n + 0.5) / (n + 0.5))
```

where `N` is the number of documents containing the field and `n` is the
number of documents containing the given term in the field. Obviously,
`n` should always be less than or equal to `N`.

Unfortunately, `cross_fields` makes up a new value for `n` and tries to
use it across all fields.

This change finds the (nonzero) value of `N` for each field and uses that as an
upper bound for the new value of `n`.

Signed-off-by: Michael Froh <froh@amazon.com>

---------

Signed-off-by: Michael Froh <froh@amazon.com>
(cherry picked from commit fffd101)
msfroh added a commit to msfroh/OpenSearch that referenced this pull request Jun 5, 2024
…lds` (opensearch-project#13829)

Under specific circumstances, when using `cross_fields` scoring on a
`multi_match` query, we can end up with negative scores from the inverse
document frequency calculation in the BM25 formula.

Specifically, the IDF is calculated as:

```
log(1 + (N - n + 0.5) / (n + 0.5))
```

where `N` is the number of documents containing the field and `n` is the
number of documents containing the given term in the field. Obviously,
`n` should always be less than or equal to `N`.

Unfortunately, `cross_fields` makes up a new value for `n` and tries to
use it across all fields.

This change finds the (nonzero) value of `N` for each field and uses that as an
upper bound for the new value of `n`.

Signed-off-by: Michael Froh <froh@amazon.com>

---------

Signed-off-by: Michael Froh <froh@amazon.com>
(cherry picked from commit fffd101)
@msfroh
Copy link
Collaborator Author

msfroh commented Jun 5, 2024

Backport PR is ready: #13983

parv0201 pushed a commit to parv0201/OpenSearch that referenced this pull request Jun 10, 2024
…lds` (opensearch-project#13829)

Under specific circumstances, when using `cross_fields` scoring on a
`multi_match` query, we can end up with negative scores from the inverse
document frequency calculation in the BM25 formula.

Specifically, the IDF is calculated as:

```
log(1 + (N - n + 0.5) / (n + 0.5))
```

where `N` is the number of documents containing the field and `n` is the
number of documents containing the given term in the field. Obviously,
`n` should always be less than or equal to `N`.

Unfortunately, `cross_fields` makes up a new value for `n` and tries to
use it across all fields.

This change finds the (nonzero) value of `N` for each field and uses that as an
upper bound for the new value of `n`.

Signed-off-by: Michael Froh <froh@amazon.com>

---------

Signed-off-by: Michael Froh <froh@amazon.com>
kkewwei pushed a commit to kkewwei/OpenSearch that referenced this pull request Jul 24, 2024
wdongyu pushed a commit to wdongyu/OpenSearch that referenced this pull request Aug 22, 2024
…lds` (opensearch-project#13829)

Under specific circumstances, when using `cross_fields` scoring on a
`multi_match` query, we can end up with negative scores from the inverse
document frequency calculation in the BM25 formula.

Specifically, the IDF is calculated as:

```
log(1 + (N - n + 0.5) / (n + 0.5))
```

where `N` is the number of documents containing the field and `n` is the
number of documents containing the given term in the field. Obviously,
`n` should always be less than or equal to `N`.

Unfortunately, `cross_fields` makes up a new value for `n` and tries to
use it across all fields.

This change finds the (nonzero) value of `N` for each field and uses that as an
upper bound for the new value of `n`.

Signed-off-by: Michael Froh <froh@amazon.com>

---------

Signed-off-by: Michael Froh <froh@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed bug Something isn't working Search:Relevance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] function score query returned an invalid (negative) score with multi match cross fields query
2 participants