-
Notifications
You must be signed in to change notification settings - Fork 7
Improve search results #716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #716 +/- ##
==========================================
+ Coverage 36.53% 37.74% +1.21%
==========================================
Files 536 536
Lines 23689 24407 +718
Branches 2857 3073 +216
==========================================
+ Hits 8655 9213 +558
- Misses 14175 14279 +104
- Partials 859 915 +56 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This PR seems to make exactly the trade-offs @axlewin mentions. I've performed a "black-box" analysis of the scope and quality of the changes introduced, and understood the changes made to the queries. Based on this, I think we should merge this PR. Here's a summary.
How this PR achieved this
Evaluating this strategy
Future improvements, thoughts
AdaThe above observations have been about the Physics site. I've performed the same analysis for Ada as well. For that site, we didn't have a set of pre-existing test cases, so I only relied on last year's search terms and the Ada-version of Alex's dataset.
Examining the search terms that produce the largest change, we see that they're either very generic (eg: |
multiMatchSearchesGroupedByTerm.putIfAbsent(term, Sets.newHashSet()); | ||
multiMatchSearchesGroupedByTerm.get(term).add(searchInField.getField()); | ||
|
||
if (!isSearchableContentField) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the Fuzzy
strategy so it generates a single, fuzzy match
query for content fields, and a match
, a wildcard
and a multimatch
query for any other fields. Before this change, using a strategy on any field always resulted in the same query, which I think was clearer. Rather than hard-coding field-specific behavior in an existing strategy, I suggest applying a different strategy to content fields. This could be a new SimpleFuzzy
strategy, or some flag that modifies the behaviour of the Simple
strategy so it generates a fuzzy match
query.
For reference, these are the 4 strategies we currently have:

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@axlewin , I'm happy to merge this branch, just please tell me if you plan to make any changes in response to this comment
A note about deploying: until the first content push, there will be no |
I agree that the fuzzy strategy ought to behave the same for every field. What do you think of aebd121? This effectively just uses (This approach does introduce one slight behavioural change: searchable content in the QF is now treated the same way as it is in site search. Since your analysis showed the site search was performing better anyway though, I'm not too worried about that.) |
…ng strategy Restores original QF behaviour
we've agreed with @axlewin to use |
…y existing strategy" This reverts commit e6cf256.
Improve search for book questions by section number This commit significantly improves search relevance, particularly for finding book questions by section number (e.g., "A1.2") stored in the subtitle. It introduces high-priority, exact-string matches for id, title, and subtitle and increases their respective search weights. To reduce noise and improve performance, wildcard matching on general page content has been removed in favor of stricter, whole-word matching. This results in a 2-4x performance increase and much more accurate results for targeted queries. The trade-off is that partial-word searches within content (e.g., "supercond" for "superconductivity") will no longer return matches. In a few months, we should review that everything is fine, as part of this ticket: https://trello.com/c/26WQGXnn/5782-review-search-improvements.
Originally merged this to |
Improves the behaviour of the search endpoint, particularly for finding book questions on sci.
Some users use the sitewide search to search for book questions by section number (e.g. "Pre-Uni Maths for Sciences A1.2", or more commonly just "A1" or "A1.2"), as contained in the question's subtitle. This does not currently return the correct question(s). The aim of these changes is primarily to make searches of this type work.
This PR adjusts some weights and reduces how generously we match on words in the content, raising the relative priority of matches on fields such as subtitles. There are now separate high-priority searches for exact whole-string matches on id/title/subtitle, in addition to the existing fuzzy matches on the tokenised search string. There's a trade-off here with slightly worse matching if searching for words/phrases in a page's content.
As part of the above, subtitle is now included in the list of raw fields for elasticsearch. This solves a problem affecting questions from the Pre-Uni Physics book where near-matches on titles are prioritised over exact matches on subtitles (try "Essential Pre-Uni Physics E1.1" before & after running ETL). This works well but changes how content is indexed; since this is only needed for results from this specific book, it may make sense to revert this change.
Switching the QF searches to use fuzzy instead of substring matches would improve matching for book questions on the QF, but since the substring match method was introduced specifically for the QF I haven't changed it.
A side-effect of these changes is that searching by full url now finds the correct content object, although this isn't a common use case.
Searches that rely on stemming (e.g. "Lagrange" to match "Lagrangian points") still don't work, but they don't in the current implementation either.
For Ada, these changes should have minimal impact since Ada's search already works well. Most Ada searches are for topics rather than specific questions, so we should make sure that keyword matching on searchable content still works as well as it did before here. Similarly, non-(book-)question pages on sci should still match reliably.
There is a test script which can be run on some test cases against a local API to assert the above.