Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "minimum_should_match" in pfocr API #88

Closed
erikyao opened this issue Aug 31, 2022 · 1 comment
Closed

Support "minimum_should_match" in pfocr API #88

erikyao opened this issue Aug 31, 2022 · 1 comment
Assignees

Comments

@erikyao
Copy link
Contributor

erikyao commented Aug 31, 2022

Brief

Requirement discussion:

From @newgene:

requirements on PFOCR API

  • cap of n results
  • sort by # of entities matching, min 2 matches
  • batch across many results
  • assuming non-clashing IDs

Request structure:

# for a single list of input [ID1 ID2]
{
    "query": { 
        "multi_match" : { 
            "query": "ID1 ID2",  # or ["ID1 ID2", "ID3 ID4"] for batch queries
            "type": "best_fields",  
            "fields": "associatedWith.mentions.genes.ncbigene", 
            "operator": "OR"
            "lenient": True,
            "analyzer": "whitespace",
            "minimum_should_match": 2 
        }
    }
}

How to implement

Keys:

  1. Implement a new class PfocrQueryBuilder(ESQueryBuilder) that subclasses biothings.web.query.ESQueryBuilder from biothings.api
  2. Config the PfocrQueryBuilder in config_web/pfocr.py
  3. Allow minimum_should_match parameter (for POST queries only, for now) in config_web/pfocr.py

Reference implementation:

Future Work

Q1: Do we need a new individual handler. Otherwise we have to change the default behavior of the query handler.
A1: We decided to use the original query handler with a customized ESQueryBuilder. The new query builder will keep its default behavior if we don't pass in overriding arguments, i.e. operator, analyzer, and minimum_should_match so far

Q2: Related to Q1, determine if we need a customized structure of response.
A2: TBD

Q3: Does the sorting work as expected?
A3: See the ES explain section below

Q4: How to boost on the number of matched should matches?
A4: TBD

@erikyao
Copy link
Contributor Author

erikyao commented Sep 1, 2022

How ES explain the scoring

We can see how ES calculate the scores for each matched document using explain. E.g.:

GET pfocr_20201204_riylt7vd/_search
{
  "explain": true,
  "query": {
    "multi_match": {
      "query": "5601 5595 10189 10333",
      "fields": "associatedWith.mentions.genes.ncbigene",
      "operator": "OR",
      "lenient": true,
      "analyzer": "whitespace",
      "minimum_should_match": 2
    }
  }
}

The top document's _score is the sum of the 4 matched term/Id's scores.

image

The second document only matched 3 terms/IDs, so its _score comes from 3 term/ID's contribution:

image (1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant