-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document factors leading to search result relevance #597
Comments
Here is how the boosting is currently set up in the solrconfig.xml If we have this Collection and Work scenario and search for the word "fox": The Work will have a higher score because it has more occurrences in text fields for "fox", and it receives a score of 10 for each occurrence. A possible solution might be to boost the Collection text to double that of the Work text and that may give the Collection a higher score than the Work. Using the Query Elevation Component might be a possible solution. Also, this article might be helpful as well: |
Do we have concrete examples based on some of the EADs we actually have where search results strike people as weird? Is the main problem keyword searching, where we want collections to show up first? Or known-item searching and we should have a more specialized collection search? We're going to be fighting two things here. One, as @labradford noted, is that works just plain might have more matches in them, which we should be able to deal with via a simple boost. If not, I've used a A similar effect could be had by doubling up on some fields, e.g. there could be a collection_title field that is only populated for collections. The second confounding factor we might be fighting is differences in field length, where for the same reasons a similar number of matches will represent a higher percentage of the available tokens in the shorter document and thus it rises to the top. To the extent that titles/text for a 'work' have less text than its collection, the work is gonna win. I have this problem all the time in the library catalogs I run, where the crappiest records rise to the top because they're so much shorter. I don't have a good solution for a broad-based search (e.g., against the full text of the document) An option for single-field queries like we have here is to have a sister-field (or fields) with If the general goal is to have collections show up first during keyword searches, I would go with the boost function as described above. If the real problem is known-item searching, I recommend we have a collection search type, and then implement multiple suggest handlers, one for each search type, to get rid of the noise. I've done this before with much kicking and screaming by Blacklight (see https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary where the "headword," "headword with alternate spellings," and "Modern english equivalent" all pull from different suggest handlers"). |
Spun out of #532
A pre-requisite to improving relevance.
The text was updated successfully, but these errors were encountered: