Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr search result ordering broken #4938

Closed
matthew-a-dunlap opened this issue Aug 8, 2018 · 9 comments
Closed

Solr search result ordering broken #4938

matthew-a-dunlap opened this issue Aug 8, 2018 · 9 comments
Assignees

Comments

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Aug 8, 2018

After upgrading solr, the configuration we had to boost search matches on certain Dataverse fields is broken.

Based upon our configuration, the boosting should be in this order:

dvName^170
dvSubject^160
dvDescription^150
dvAffiliation^140
title^130
subject^120
keyword^110
topicClassValue^100
dsDescriptionValue^90
authorName^80
authorAffiliation^70
publicationCitation^60
producerName^50
fileName^40
fileDescription^30
variableLabel^20
variableName^10
text^1.0

Yet the result is off (note that this screenshot does not contain all the fields boosted)
screen shot 2018-08-08 at 4 55 33 pm

There is more information (production examples, etc) captured in #4836 , along with information on solr highlighting which has been fixed.

@matthew-a-dunlap
Copy link
Contributor Author

matthew-a-dunlap commented Sep 20, 2018

After looking at the existing search order stories and other docs (#1928 , #2472), we need a new definition of our goals when doing search order. Having those in a "spec" will make it easier to config solr for our needs. There is some good info to start on in #1928 but its doesn't cover the breadth of the term space.

For reference, we have these fields boosted in the solrconfig.xml (I have removed the boost values because those are fairly arbitrary and need rework). I am curious if this order in general matches what we'd want out of solr. Are there fields in this list that shouldn't be so high? Or ones that should be prioritized? Note that this covers all fields, the ones not specifically listed are in the catch-all text.

dvName
dvSubject
dvDescription
dvAffiliation
title
subject
keyword
topicClassValue
dsDescriptionValue
authorName
authorAffiliation
publicationCitation
producerName
fileName
fileDescription
variableLabel
variableName
text

In the end there is going to be some "magic" in how the results are returned, but we can steer that magic with some configuration. We should be able to pin some fields to show up extremely high in the results, but we'll need to be fairly choosy in the design we want to apply.

p.s. I think part of our problem currently is that we added boost values without a real metric of the raw score values being returned per field by search algorithm. Also, I think one quick win is we can add is a tie value that encourages results that match for multiple fields to be scored higher.

@mheppler @djbrooke @scolapasta @pdurbin @dlmurphy

@matthew-a-dunlap
Copy link
Contributor Author

matthew-a-dunlap commented Sep 21, 2018

I realized I made a fundamental mistake in how I was testing solr results. Looks like where we put the boosting in solr-4.6 did not work when we switched to solr-7.x. So our boosting IS turned off in 4.9.2.

That being said, changes in solr-7.3 mean that the search ordering is wacky with our old boost values. So the discussions were still of value and will inform the new solr configuration.

@djbrooke
Copy link
Contributor

Hey @matthew-a-dunlap, I know we discussed this last week, but as this heads to QA is there a short summary of the expected changes before/after with this PR? Thank you!

@pdurbin
Copy link
Member

pdurbin commented Sep 24, 2018

Pull request #5080 looks fine to me so I moved it to QA.

The reason the fix works (in theory, since I haven't tested it) is tied in with the long comment I left in d3b721e when I changed the Solr request handler from "/spell" to "/select" in pull request #4520 when we upgraded from Solr 4 to Solr 7. Here's that comment:

Back in 60e640b when I was playing with spelling suggestions from Solr I
changed the request handler from "/select" (the default) to "/spell". We
didn't have time to fully explore the spelling suggestions feature of
Solr during the 4.0 rewrite and the "/spell" request handler seems to be
leading to other bugs, such as not being able to search on the
"identifier" portion of a DOI (i.e. "JNIUOA") from basic search. In
short we are switching to the default request handler for Solr,
something I would have done before tagging 4.0 if I had realized I had
left the "/spell" request handler in there.

In the new pull request, we are putting the boosting under "/select", the out of the box request handler we use in Solr 7 rather than "/spell", the non-default request handler we used in the Solr 4 days. This is probably hard to follow but I'm happy to explain more in person or write more here.

@matthew-a-dunlap
Copy link
Contributor Author

matthew-a-dunlap commented Sep 24, 2018

@djbrooke Summary is I updated the solrconfig.xml to move the boosting to the more normal location where it'll be picked up by Solr 7. Also, due to changes in the (non-boosted) weights in Solr 7 and needs to prioritize Dataverses even more, I increased the spread of the boosting so Dataverses show at the top for most matches. I also added a tie variable that means objects that match multiple fields will get more weight.

The upgrade needs have not changed for the next release as upgrading this is the same as upgrading highlighting. We'll need to add the new solrconfig.xml, restart and reindex.

@matthew-a-dunlap
Copy link
Contributor Author

matthew-a-dunlap commented Sep 27, 2018

This was moved back into develop for two reasons:

  1. The results are still not really where we want them. Specifically:
    • Dataverses are boosted high, but probably too high still as they drown out some results.
    • Searching for multiple words does not prioritize the results that match those two words in the same order high enough. Switching to the old boosting helped for this somewhat but we may need to look into other ways of encouraging these results (for example, the slop setting).
    • The order of boosting of fields across dataverses, datasets, files could use more thought. We've talked about boosting author higher, switching to "title, author, subj, desc", removing attributes; among other things.
    • Harvested files show up highly in results and if we can find a way to prioritize non-harvested it could be helpful.
  2. The new solrconfig.xml needs to be used for docker. This'll be done after the search order work is figured out.

@matthew-a-dunlap
Copy link
Contributor Author

I've taken another stab at the boosting. It is more in line with the original boosting values, but I added an extra xml option for phrase based boosting. This means that when a user searches two words results with both of those words will get boosted extra, especially if they are right next to each other.

@djbrooke
Copy link
Contributor

djbrooke commented Oct 2, 2018

Thanks @matthew-a-dunlap, let's discuss today if possible.

@djbrooke
Copy link
Contributor

djbrooke commented Oct 4, 2018

I moved this over to QA because I'm happy with the results shown to me on a test server. @mheppler is happy with it as well.

@matthew-a-dunlap - if there's anything helpful that you can add in here for QA, please do so. Thanks for your work on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants