Support "minimum_should_match" in `pfocr` API #88

erikyao · 2022-08-31T06:35:28Z

Brief

Requirement discussion:

requirements on PFOCR API

cap of n results

sort by # of entities matching, min 2 matches

batch across many results

assuming non-clashing IDs

Request structure:

# for a single list of input [ID1 ID2]
{
    "query": { 
        "multi_match" : { 
            "query": "ID1 ID2",  # or ["ID1 ID2", "ID3 ID4"] for batch queries
            "type": "best_fields",  
            "fields": "associatedWith.mentions.genes.ncbigene", 
            "operator": "OR"
            "lenient": True,
            "analyzer": "whitespace",
            "minimum_should_match": 2 
        }
    }
}

How to implement

Keys:

Implement a new class PfocrQueryBuilder(ESQueryBuilder) that subclasses biothings.web.query.ESQueryBuilder from biothings.api
Config the PfocrQueryBuilder in config_web/pfocr.py
Allow minimum_should_match parameter (for POST queries only, for now) in config_web/pfocr.py

Reference implementation:

Resource Discovery API

Future Work

Q1: Do we need a new individual handler. Otherwise we have to change the default behavior of the query handler.
A1: We decided to use the original query handler with a customized ESQueryBuilder. The new query builder will keep its default behavior if we don't pass in overriding arguments, i.e. operator, analyzer, and minimum_should_match so far

Q2: Related to Q1, determine if we need a customized structure of response.
A2: TBD

Q3: Does the sorting work as expected?
A3: See the ES explain section below

Q4: How to boost on the number of matched should matches?
A4: TBD

The text was updated successfully, but these errors were encountered:

erikyao · 2022-09-01T21:20:16Z

How ES `explain` the scoring

We can see how ES calculate the scores for each matched document using explain. E.g.:

GET pfocr_20201204_riylt7vd/_search
{
  "explain": true,
  "query": {
    "multi_match": {
      "query": "5601 5595 10189 10333",
      "fields": "associatedWith.mentions.genes.ncbigene",
      "operator": "OR",
      "lenient": true,
      "analyzer": "whitespace",
      "minimum_should_match": 2
    }
  }
}

The top document's _score is the sum of the 4 matched term/Id's scores.

The second document only matched 3 terms/IDs, so its _score comes from 3 term/ID's contribution:

erikyao self-assigned this Aug 31, 2022

erikyao mentioned this issue Aug 31, 2022

First implementation to Issue#88 #89

Merged

erikyao closed this as completed Sep 13, 2022

andrewsu mentioned this issue Sep 16, 2022

Improve scoring in PFOCR API based on specificity #91

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support "minimum_should_match" in `pfocr` API #88

Support "minimum_should_match" in `pfocr` API #88

erikyao commented Aug 31, 2022 •

edited

Loading

erikyao commented Sep 1, 2022

Support "minimum_should_match" in pfocr API #88

Support "minimum_should_match" in pfocr API #88

Comments

erikyao commented Aug 31, 2022 • edited Loading

Brief

How to implement

Future Work

erikyao commented Sep 1, 2022

How ES explain the scoring

Support "minimum_should_match" in `pfocr` API #88

Support "minimum_should_match" in `pfocr` API #88

erikyao commented Aug 31, 2022 •

edited

Loading

How ES `explain` the scoring