Document factors leading to search result relevance #597

mejackreed · 2019-08-20T23:35:27Z

Spun out of #532

A pre-requisite to improving relevance.

Why do some components appear higher in results than their containing collections? (related Boost collection-level results #531)
Are there any suggestions to current indexing search approach?

labradford · 2019-08-26T18:28:17Z

Here is how the boosting is currently set up in the solrconfig.xml

If we have this Collection and Work scenario and search for the word "fox":

The Work will have a higher score because it has more occurrences in text fields for "fox", and it receives a score of 10 for each occurrence.

A possible solution might be to boost the Collection text to double that of the Work text and that may give the Collection a higher score than the Work.

Using the Query Elevation Component might be a possible solution.
https://lucene.apache.org/solr/guide/8_1/the-query-elevation-component.html

Also, this article might be helpful as well:
https://medium.com/@pablocastelnovo/if-they-match-i-want-them-to-be-always-first-boosting-documents-in-apache-solr-with-the-boost-362abd36476c

billdueber · 2019-09-03T18:26:45Z

Do we have concrete examples based on some of the EADs we actually have where search results strike people as weird?

Is the main problem keyword searching, where we want collections to show up first? Or known-item searching and we should have a more specialized collection search?

We're going to be fighting two things here. One, as @labradford noted, is that works just plain might have more matches in them, which we should be able to deal with via a simple boost. If not, I've used a boost function in the past that just plain gives a little extra juice to a specific field value (which would be, what component_level_isim=1 in this case?). We could similarly use a boost function to give more relevancy to a document the closer it is to the root, too, if that's what we want (e.g., set the boost to 5 - component_level or something).

A similar effect could be had by doubling up on some fields, e.g. there could be a collection_title field that is only populated for collections. collection_title_whatever could then be used with a higher boost, and since it's only populated for collections it'd give more juice to them. That starts to complicate the indexing configuration, though, which I think we'd like to avoid. An edismax boost function would probably be better.

The second confounding factor we might be fighting is differences in field length, where for the same reasons a similar number of matches will represent a higher percentage of the available tokens in the shorter document and thus it rises to the top. To the extent that titles/text for a 'work' have less text than its collection, the work is gonna win. I have this problem all the time in the library catalogs I run, where the crappiest records rise to the top because they're so much shorter.

I don't have a good solution for a broad-based search (e.g., against the full text of the document) An option for single-field queries like we have here is to have a sister-field (or fields) with omitNorms set to true to ignore field length and use it for boosting instead, but that might really screw things up.

If the general goal is to have collections show up first during keyword searches, I would go with the boost function as described above.

If the real problem is known-item searching, I recommend we have a collection search type, and then implement multiple suggest handlers, one for each search type, to get rid of the noise. I've done this before with much kicking and screaming by Blacklight (see https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary where the "headword," "headword with alternate spellings," and "Modern english equivalent" all pull from different suggest handlers").

anarchivist · 2019-09-05T06:01:22Z

Thanks both. I think there's enough actionable work here after PO discussion; see #723, #724, #725, #726, and #727. Let's split out more discussion there (and on new tickets) as needed.

labradford self-assigned this Aug 22, 2019

anarchivist closed this as completed Sep 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document factors leading to search result relevance #597

Document factors leading to search result relevance #597

mejackreed commented Aug 20, 2019 •

edited by anarchivist

Loading

labradford commented Aug 26, 2019

billdueber commented Sep 3, 2019 •

edited

Loading

anarchivist commented Sep 5, 2019

Document factors leading to search result relevance #597

Document factors leading to search result relevance #597

Comments

mejackreed commented Aug 20, 2019 • edited by anarchivist Loading

labradford commented Aug 26, 2019

billdueber commented Sep 3, 2019 • edited Loading

anarchivist commented Sep 5, 2019

mejackreed commented Aug 20, 2019 •

edited by anarchivist

Loading

billdueber commented Sep 3, 2019 •

edited

Loading