Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve scoring in PFOCR API based on specificity #91

Closed
andrewsu opened this issue Sep 16, 2022 · 1 comment
Closed

Improve scoring in PFOCR API based on specificity #91

andrewsu opened this issue Sep 16, 2022 · 1 comment

Comments

@andrewsu
Copy link
Member

In #88, @erikyao implemented a very useful new parameter for minimum_should_match. That allows users to submit a list of entities, and then return documents that match at least some number of the input queries.

Based on this comment from @erikyao, scoring is currently based only on the number of matched entities. I can imagine two ways of improving the scoring based on specificity.

  • Specificity of the query term: Consider the situation where the gene list in my query includes both TP53 (a very commonly-studied gene) and ANKRD37 (an almost completely-uncharacterized gene). Right now, a match to TP53 is scored the same as a match to ANKRD37, but matches to TP53 are much more common. It would be reasonable to weight matches to query terms differently based on how commonly they are found in the PFOCR dataset.
  • Specificity of the matched pathway: Consider the situation where we have two pathways that match the exact same query terms. Currently, those would be scored the same. But if one pathway overall has 100 genes and the second pathway has just 10 genes, then the second pathway is probably more relevant to the input query set, so it should be score higher.
@andrewsu
Copy link
Member Author

closing based on subsequent work by @tokebe that supersedes this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant