Solr search result ordering broken #4938

matthew-a-dunlap · 2018-08-08T21:31:50Z

After upgrading solr, the configuration we had to boost search matches on certain Dataverse fields is broken.

Based upon our configuration, the boosting should be in this order:

dvName^170
dvSubject^160
dvDescription^150
dvAffiliation^140
title^130
subject^120
keyword^110
topicClassValue^100
dsDescriptionValue^90
authorName^80
authorAffiliation^70
publicationCitation^60
producerName^50
fileName^40
fileDescription^30
variableLabel^20
variableName^10
text^1.0

Yet the result is off (note that this screenshot does not contain all the fields boosted)

There is more information (production examples, etc) captured in #4836 , along with information on solr highlighting which has been fixed.

The text was updated successfully, but these errors were encountered:

matthew-a-dunlap · 2018-09-20T18:11:05Z

After looking at the existing search order stories and other docs (#1928 , #2472), we need a new definition of our goals when doing search order. Having those in a "spec" will make it easier to config solr for our needs. There is some good info to start on in #1928 but its doesn't cover the breadth of the term space.

For reference, we have these fields boosted in the solrconfig.xml (I have removed the boost values because those are fairly arbitrary and need rework). I am curious if this order in general matches what we'd want out of solr. Are there fields in this list that shouldn't be so high? Or ones that should be prioritized? Note that this covers all fields, the ones not specifically listed are in the catch-all text.

dvName
dvSubject
dvDescription
dvAffiliation
title
subject
keyword
topicClassValue
dsDescriptionValue
authorName
authorAffiliation
publicationCitation
producerName
fileName
fileDescription
variableLabel
variableName
text

In the end there is going to be some "magic" in how the results are returned, but we can steer that magic with some configuration. We should be able to pin some fields to show up extremely high in the results, but we'll need to be fairly choosy in the design we want to apply.

p.s. I think part of our problem currently is that we added boost values without a real metric of the raw score values being returned per field by search algorithm. Also, I think one quick win is we can add is a tie value that encourages results that match for multiple fields to be scored higher.

@mheppler @djbrooke @scolapasta @pdurbin @dlmurphy

matthew-a-dunlap · 2018-09-21T00:07:22Z

I realized I made a fundamental mistake in how I was testing solr results. Looks like where we put the boosting in solr-4.6 did not work when we switched to solr-7.x. So our boosting IS turned off in 4.9.2.

That being said, changes in solr-7.3 mean that the search ordering is wacky with our old boost values. So the discussions were still of value and will inform the new solr configuration.

djbrooke · 2018-09-24T15:07:45Z

Hey @matthew-a-dunlap, I know we discussed this last week, but as this heads to QA is there a short summary of the expected changes before/after with this PR? Thank you!

pdurbin · 2018-09-24T15:12:59Z

Pull request #5080 looks fine to me so I moved it to QA.

The reason the fix works (in theory, since I haven't tested it) is tied in with the long comment I left in d3b721e when I changed the Solr request handler from "/spell" to "/select" in pull request #4520 when we upgraded from Solr 4 to Solr 7. Here's that comment:

Back in 60e640b when I was playing with spelling suggestions from Solr I
changed the request handler from "/select" (the default) to "/spell". We
didn't have time to fully explore the spelling suggestions feature of
Solr during the 4.0 rewrite and the "/spell" request handler seems to be
leading to other bugs, such as not being able to search on the
"identifier" portion of a DOI (i.e. "JNIUOA") from basic search. In
short we are switching to the default request handler for Solr,
something I would have done before tagging 4.0 if I had realized I had
left the "/spell" request handler in there.

In the new pull request, we are putting the boosting under "/select", the out of the box request handler we use in Solr 7 rather than "/spell", the non-default request handler we used in the Solr 4 days. This is probably hard to follow but I'm happy to explain more in person or write more here.

matthew-a-dunlap · 2018-09-24T15:17:37Z

@djbrooke Summary is I updated the solrconfig.xml to move the boosting to the more normal location where it'll be picked up by Solr 7. Also, due to changes in the (non-boosted) weights in Solr 7 and needs to prioritize Dataverses even more, I increased the spread of the boosting so Dataverses show at the top for most matches. I also added a tie variable that means objects that match multiple fields will get more weight.

The upgrade needs have not changed for the next release as upgrading this is the same as upgrading highlighting. We'll need to add the new solrconfig.xml, restart and reindex.

matthew-a-dunlap · 2018-09-27T20:54:49Z

This was moved back into develop for two reasons:

The results are still not really where we want them. Specifically:
- Dataverses are boosted high, but probably too high still as they drown out some results.
- Searching for multiple words does not prioritize the results that match those two words in the same order high enough. Switching to the old boosting helped for this somewhat but we may need to look into other ways of encouraging these results (for example, the slop setting).
- The order of boosting of fields across dataverses, datasets, files could use more thought. We've talked about boosting author higher, switching to "title, author, subj, desc", removing attributes; among other things.
- Harvested files show up highly in results and if we can find a way to prioritize non-harvested it could be helpful.
The new solrconfig.xml needs to be used for docker. This'll be done after the search order work is figured out.

matthew-a-dunlap · 2018-10-01T23:29:51Z

I've taken another stab at the boosting. It is more in line with the original boosting values, but I added an extra xml option for phrase based boosting. This means that when a user searches two words results with both of those words will get boosted extra, especially if they are right next to each other.

djbrooke · 2018-10-02T14:15:01Z

Thanks @matthew-a-dunlap, let's discuss today if possible.

djbrooke · 2018-10-04T20:43:29Z

I moved this over to QA because I'm happy with the results shown to me on a test server. @mheppler is happy with it as well.

@matthew-a-dunlap - if there's anything helpful that you can add in here for QA, please do so. Thanks for your work on this issue.

matthew-a-dunlap mentioned this issue Aug 8, 2018

Solr Container Scaling #4762

Closed

djbrooke added the Status: Backlog label Aug 15, 2018

djbrooke assigned matthew-a-dunlap Aug 15, 2018

djbrooke added the ready for estimation label Aug 15, 2018

djbrooke unassigned matthew-a-dunlap Aug 15, 2018

djbrooke removed the ready for estimation label Aug 22, 2018

djbrooke added Status: This/Next Sprint and removed Status: Backlog labels Sep 12, 2018

matthew-a-dunlap self-assigned this Sep 17, 2018

matthew-a-dunlap added Status: Development and removed Status: This/Next Sprint labels Sep 17, 2018

matthew-a-dunlap added a commit that referenced this issue Sep 21, 2018

Move boost from spell to select #4938

7b3fce2

matthew-a-dunlap added a commit that referenced this issue Sep 21, 2018

New boost larger spread #4938

f5f5dac

matthew-a-dunlap added a commit that referenced this issue Sep 21, 2018

Additional boost documentation #4938

4a60546

matthew-a-dunlap added Status: Code Review and removed Status: Development labels Sep 21, 2018

matthew-a-dunlap mentioned this issue Sep 21, 2018

4938 solr search order #5080

Merged

5 tasks

djbrooke unassigned matthew-a-dunlap Sep 24, 2018

pdurbin added Status: QA and removed Status: Code Review labels Sep 24, 2018

kcondon self-assigned this Sep 24, 2018

matthew-a-dunlap added Status: Development and removed Status: QA labels Sep 27, 2018

matthew-a-dunlap unassigned kcondon Sep 27, 2018

djbrooke assigned djbrooke and matthew-a-dunlap Sep 28, 2018

matthew-a-dunlap added a commit that referenced this issue Oct 1, 2018

Reboosted with extra boost for phrase matching #4938

d7eef18

matthew-a-dunlap added a commit that referenced this issue Oct 3, 2018

Boost non-harvested #4938

81fd45f

matthew-a-dunlap added a commit that referenced this issue Oct 3, 2018

More boost tweaks #4938

f6252d5

matthew-a-dunlap added a commit that referenced this issue Oct 3, 2018

Cleanup, update docker solrconfig.xml #4938

f7cf14e

matthew-a-dunlap added Status: Code Review and removed Status: Development labels Oct 3, 2018

matthew-a-dunlap removed their assignment Oct 3, 2018

djbrooke added Status: QA and removed Status: Code Review labels Oct 4, 2018

djbrooke removed their assignment Oct 4, 2018

kcondon self-assigned this Oct 4, 2018

kcondon closed this as completed Oct 4, 2018

kcondon removed the Status: QA label Oct 4, 2018

matthew-a-dunlap mentioned this issue Oct 10, 2018

Basic search yields no results from non-boosted fields #5153

Closed

djbrooke added this to the 4.10 - Additional Data Transfer Options milestone Dec 11, 2018

djbrooke mentioned this issue Feb 10, 2020

6633 update solr 772 #6631

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solr search result ordering broken #4938

Solr search result ordering broken #4938

matthew-a-dunlap commented Aug 8, 2018 •

edited

Loading

matthew-a-dunlap commented Sep 20, 2018 •

edited

Loading

matthew-a-dunlap commented Sep 21, 2018 •

edited

Loading

djbrooke commented Sep 24, 2018

pdurbin commented Sep 24, 2018

matthew-a-dunlap commented Sep 24, 2018 •

edited

Loading

matthew-a-dunlap commented Sep 27, 2018 •

edited

Loading

matthew-a-dunlap commented Oct 1, 2018

djbrooke commented Oct 2, 2018

djbrooke commented Oct 4, 2018

Solr search result ordering broken #4938

Solr search result ordering broken #4938

Comments

matthew-a-dunlap commented Aug 8, 2018 • edited Loading

matthew-a-dunlap commented Sep 20, 2018 • edited Loading

matthew-a-dunlap commented Sep 21, 2018 • edited Loading

djbrooke commented Sep 24, 2018

pdurbin commented Sep 24, 2018

matthew-a-dunlap commented Sep 24, 2018 • edited Loading

matthew-a-dunlap commented Sep 27, 2018 • edited Loading

matthew-a-dunlap commented Oct 1, 2018

djbrooke commented Oct 2, 2018

djbrooke commented Oct 4, 2018

matthew-a-dunlap commented Aug 8, 2018 •

edited

Loading

matthew-a-dunlap commented Sep 20, 2018 •

edited

Loading

matthew-a-dunlap commented Sep 21, 2018 •

edited

Loading

matthew-a-dunlap commented Sep 24, 2018 •

edited

Loading

matthew-a-dunlap commented Sep 27, 2018 •

edited

Loading