Improve scoring in PFOCR API based on specificity #91

andrewsu · 2022-09-16T16:11:07Z

In #88, @erikyao implemented a very useful new parameter for minimum_should_match. That allows users to submit a list of entities, and then return documents that match at least some number of the input queries.

Based on this comment from @erikyao, scoring is currently based only on the number of matched entities. I can imagine two ways of improving the scoring based on specificity.

Specificity of the query term: Consider the situation where the gene list in my query includes both TP53 (a very commonly-studied gene) and ANKRD37 (an almost completely-uncharacterized gene). Right now, a match to TP53 is scored the same as a match to ANKRD37, but matches to TP53 are much more common. It would be reasonable to weight matches to query terms differently based on how commonly they are found in the PFOCR dataset.
Specificity of the matched pathway: Consider the situation where we have two pathways that match the exact same query terms. Currently, those would be scored the same. But if one pathway overall has 100 genes and the second pathway has just 10 genes, then the second pathway is probably more relevant to the input query set, so it should be score higher.

The text was updated successfully, but these errors were encountered:

andrewsu · 2023-04-12T19:19:00Z

closing based on subsequent work by @tokebe that supersedes this.

andrewsu closed this as completed Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve scoring in PFOCR API based on specificity #91

Improve scoring in PFOCR API based on specificity #91

andrewsu commented Sep 16, 2022

andrewsu commented Apr 12, 2023

Improve scoring in PFOCR API based on specificity #91

Improve scoring in PFOCR API based on specificity #91

Comments

andrewsu commented Sep 16, 2022

andrewsu commented Apr 12, 2023