Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document factors leading to search result relevance #597

Closed
2 tasks done
mejackreed opened this issue Aug 20, 2019 · 3 comments
Closed
2 tasks done

Document factors leading to search result relevance #597

mejackreed opened this issue Aug 20, 2019 · 3 comments
Assignees

Comments

@mejackreed
Copy link
Collaborator

mejackreed commented Aug 20, 2019

Spun out of #532

A pre-requisite to improving relevance.

  • Why do some components appear higher in results than their containing collections? (related Boost collection-level results #531)
  • Are there any suggestions to current indexing search approach?
@labradford labradford self-assigned this Aug 22, 2019
@labradford
Copy link
Contributor

Here is how the boosting is currently set up in the solrconfig.xml

If we have this Collection and Work scenario and search for the word "fox":

The Work will have a higher score because it has more occurrences in text fields for "fox", and it receives a score of 10 for each occurrence.

A possible solution might be to boost the Collection text to double that of the Work text and that may give the Collection a higher score than the Work.

Using the Query Elevation Component might be a possible solution.
https://lucene.apache.org/solr/guide/8_1/the-query-elevation-component.html

Also, this article might be helpful as well:
https://medium.com/@pablocastelnovo/if-they-match-i-want-them-to-be-always-first-boosting-documents-in-apache-solr-with-the-boost-362abd36476c

@billdueber
Copy link
Contributor

billdueber commented Sep 3, 2019

Do we have concrete examples based on some of the EADs we actually have where search results strike people as weird?

Is the main problem keyword searching, where we want collections to show up first? Or known-item searching and we should have a more specialized collection search?

We're going to be fighting two things here. One, as @labradford noted, is that works just plain might have more matches in them, which we should be able to deal with via a simple boost. If not, I've used a boost function in the past that just plain gives a little extra juice to a specific field value (which would be, what component_level_isim=1 in this case?). We could similarly use a boost function to give more relevancy to a document the closer it is to the root, too, if that's what we want (e.g., set the boost to 5 - component_level or something).

A similar effect could be had by doubling up on some fields, e.g. there could be a collection_title field that is only populated for collections. collection_title_whatever could then be used with a higher boost, and since it's only populated for collections it'd give more juice to them. That starts to complicate the indexing configuration, though, which I think we'd like to avoid. An edismax boost function would probably be better.

The second confounding factor we might be fighting is differences in field length, where for the same reasons a similar number of matches will represent a higher percentage of the available tokens in the shorter document and thus it rises to the top. To the extent that titles/text for a 'work' have less text than its collection, the work is gonna win. I have this problem all the time in the library catalogs I run, where the crappiest records rise to the top because they're so much shorter.

I don't have a good solution for a broad-based search (e.g., against the full text of the document) An option for single-field queries like we have here is to have a sister-field (or fields) with omitNorms set to true to ignore field length and use it for boosting instead, but that might really screw things up.

If the general goal is to have collections show up first during keyword searches, I would go with the boost function as described above.

If the real problem is known-item searching, I recommend we have a collection search type, and then implement multiple suggest handlers, one for each search type, to get rid of the noise. I've done this before with much kicking and screaming by Blacklight (see https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary where the "headword," "headword with alternate spellings," and "Modern english equivalent" all pull from different suggest handlers").

@anarchivist
Copy link
Member

Thanks both. I think there's enough actionable work here after PO discussion; see #723, #724, #725, #726, and #727. Let's split out more discussion there (and on new tickets) as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants