Improve search results #716

axlewin · 2025-08-14T12:56:15Z

Improves the behaviour of the search endpoint, particularly for finding book questions on sci.

Some users use the sitewide search to search for book questions by section number (e.g. "Pre-Uni Maths for Sciences A1.2", or more commonly just "A1" or "A1.2"), as contained in the question's subtitle. This does not currently return the correct question(s). The aim of these changes is primarily to make searches of this type work.

This PR adjusts some weights and reduces how generously we match on words in the content, raising the relative priority of matches on fields such as subtitles. There are now separate high-priority searches for exact whole-string matches on id/title/subtitle, in addition to the existing fuzzy matches on the tokenised search string. There's a trade-off here with slightly worse matching if searching for words/phrases in a page's content.

As part of the above, subtitle is now included in the list of raw fields for elasticsearch. This solves a problem affecting questions from the Pre-Uni Physics book where near-matches on titles are prioritised over exact matches on subtitles (try "Essential Pre-Uni Physics E1.1" before & after running ETL). This works well but changes how content is indexed; since this is only needed for results from this specific book, it may make sense to revert this change.

Switching the QF searches to use fuzzy instead of substring matches would improve matching for book questions on the QF, but since the substring match method was introduced specifically for the QF I haven't changed it.

A side-effect of these changes is that searching by full url now finds the correct content object, although this isn't a common use case.

Searches that rely on stemming (e.g. "Lagrange" to match "Lagrangian points") still don't work, but they don't in the current implementation either.

For Ada, these changes should have minimal impact since Ada's search already works well. Most Ada searches are for topics rather than specific questions, so we should make sure that keyword matching on searchable content still works as well as it did before here. Similarly, non-(book-)question pages on sci should still match reliably.

There is a test script which can be run on some test cases against a local API to assert the above.

codecov · 2025-08-14T12:59:28Z

Codecov Report

❌ Patch coverage is 94.73684% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.74%. Comparing base (442b150) to head (d514d78).
⚠️ Report is 26 commits behind head on master.

Files with missing lines	Patch %	Lines
...am/cl/dtg/segue/dao/content/GitContentManager.java	96.00%	0 Missing and 1 partial ⚠️
.../ac/cam/cl/dtg/segue/etl/ElasticSearchIndexer.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #716      +/-   ##
==========================================
+ Coverage   36.53%   37.74%   +1.21%     
==========================================
  Files         536      536              
  Lines       23689    24407     +718     
  Branches     2857     3073     +216     
==========================================
+ Hits         8655     9213     +558     
- Misses      14175    14279     +104     
- Partials      859      915      +56

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

barna-isaac · 2025-10-01T13:24:35Z

This PR seems to make exactly the trade-offs @axlewin mentions. I've performed a "black-box" analysis of the scope and quality of the changes introduced, and understood the changes made to the queries. Based on this, I think we should merge this PR. Here's a summary.

Scope of changes. On Isaac Science, for last year's 30K search terms:
- for site-wide search, the first match has changed 1/3rd of the time, and the median amount of change among
  returned search results is 20%.
- for the question finder, the first match has changed 1/6th of the time, and the median amount of change among
  returned search results is 9%.
On the site-wide search, very small percentage of searches, like cytok, supercond, orthogonal, vecti and
kirchof no longer return any results. This is because, on the site-wide search only, wide-card queries are no
longer performed on content.
Test results. For Isaac Science, Meurig's original test set performs a little worse and Alex's new test set (which contains mostly the kind of searches this PR has been looking to improve) performs significantly better. Conceiving of our test cases as labelled examples, we see that
- for site-wide search, Meurig's test suite degrades overall. A degradation over 1 rank is observed in just 25% of
  cases, but there are 2 outliers where it's very significant. On the other hand, Alex's new tests show a marked
  improvement. 25% of cases is improved by over 3 ranks, and there are 7 cases where the improvement is very
  significant.
- for the question finder, Meurig's original test suite yields identical results. There are just two questions
  where the rank degrades very slightly. For Alex's new data set, the rank is improved somewhat. Although an
  improvement over 1 is observed in just 25% of cases, there are 5 outliers where the improvement is very
  significant.
Looking at the specific changes:
- we see that performance degrades for general queries like sin, sine or Block, and the new matches are as relevant as what we've seen before
- we see that performance improves the most for searches based on a reference to a book section. For searches like Essential Pre-Uni Physics L8.5, Essential Pre-Uni Physics B2.3 or Stpe into Phsyics: Curent and Cicruit Pratcice 7, the old search didn't return any results, and the new one does.
Performance: Site-wide search is now 2 to 4x faster. This is very likely due to the fact that site-wide search no longer performs wildcard queries on the content.

How this PR achieved this

for the site wide search:
- id is now matched using the default strategy (a fuzzy and a non-fuzzy match query), and its weight has changed from 2.0 to 10.0 and 8.0.
- weights for title, subtitle, summary, tags have changed from 2.0 to 5.0
- weights for prioritizedSearchableContent, searchableContent have remained at 2.0. We previously performed both match and wildCard searches on these fields, but we now use match only (only matches entire words, allowing for a few characters of distance)
- exact matches for id, title and subtitle have been introduced with a weight of 10.
for the Question finder
- weights for title, subtitle, summary, tags have changed from 10 to 5. Weights for prioritizedSearchableContent, searchableContent have changed from 10 and 5 to 1.
- exact matches for id, title and subtitle have been introduced with a weight of 10.
to support exact matches on the "subtitle" field, a raw version of that field is now made available (indexed)

Evaluating this strategy

As the above analysis has shown, abandoning wildcard content matches for site wide search means that it is now more difficult to search for content. Because we still present word-based matches, only word-part searches are now impossible. A user that remembers a sentence fragment from some content can still find the question using that. The only use case can think is now gone is where a user was trying to make up for the absence of stemming, eg. by searching "Lagrang" to get matches for both "Lagrange" and "Lagrangian". Still, this only effects the site-wide search. The question finder still performs wildcard matches.
Although the PR has significantly improved matches for book questions using section number, users will need to be very exact. This is because part of the boost these receive comes from exact matches on subtitle. Unfortunately, even casing differences between the subtitle break this (so Essential Pre-Uni Physics L8.5 is found with a rank of 3, but Essential Pre-Uni physics L8.5 is not found at all).

Future improvements, thoughts

It feels like there's a limit to the improvements we can achieve by adjusting the relative weights of the question fields: some users will be expecting to find a question by title, some by a piece of content they remember, and some by its section in the book. Users would be able to perform better searches if we let them specify the fields they'd like to include.
Questions can be found through both the site-wide search and the question finder. I think this is fine, but it's a little confusing that even just for questions, the site-wide search works differently from the question finder. Looking at the results above, it seems a user who's trying to find a question by book section is better off using the site wide search (the test set containing book section names performs better on the site wide search) -- this is counter-intuitive, and I think most users would expect the question finder to be the better tool for finding questions. I think it'd be better if the site-wide search and question finder used the same logic for finding questions.
I didn't find the test case runner's output too useful (it just told me that a lot of cases have failed both before and after the changes), but the fact that the test cases can be used as labelled examples for the search has been huge help. @axlewin has introduced new labelled examples for this PR, which I thought was very helpful. I hope we keep expanding this dataset.

Ada

The above observations have been about the Physics site. I've performed the same analysis for Ada as well. For that site, we didn't have a set of pre-existing test cases, so I only relied on last year's search terms and the Ada-version of Alex's dataset.

the scope of changes is similar to what we've observed for Physics, but a few percentage points lower. This is likely explained by the fact that Ada make use of the subtitle field a lot less frequently (there are only ~40 distinct values for subtitle in the entire ada content repo)
rank changes in Alex's test suite are significantly lower, as they didn't have any "book section" style searches to add

Examining the search terms that produce the largest change, we see that they're either very generic (eg: yx, con, ye), rely on wildcard matches in the id (webtech), or wildcard matches in the content (asynchro, monoalp).

barna-isaac · 2025-10-01T08:47:01Z

src/main/java/uk/ac/cam/cl/dtg/segue/search/IsaacSearchInstructionBuilder.java

-                        multiMatchSearchesGroupedByTerm.putIfAbsent(term, Sets.newHashSet());
-                        multiMatchSearchesGroupedByTerm.get(term).add(searchInField.getField());
-
+                        if (!isSearchableContentField) {


This changes the Fuzzy strategy so it generates a single, fuzzy match query for content fields, and a match, a wildcard and a multimatch query for any other fields. Before this change, using a strategy on any field always resulted in the same query, which I think was clearer. Rather than hard-coding field-specific behavior in an existing strategy, I suggest applying a different strategy to content fields. This could be a new SimpleFuzzy strategy, or some flag that modifies the behaviour of the Simple strategy so it generates a fuzzy match query.

For reference, these are the 4 strategies we currently have:

@axlewin , I'm happy to merge this branch, just please tell me if you plan to make any changes in response to this comment

barna-isaac · 2025-10-01T16:33:06Z

A note about deploying: until the first content push, there will be no subtitle.raw field. Queries will still succeed, and find any matches based on other fields, but we should initiate ETL (for Ada and Physics both) so this field gets added.

axlewin · 2025-10-02T08:49:12Z

I agree that the fuzzy strategy ought to behave the same for every field. What do you think of aebd121? This effectively just uses isSearchableContentField as its own strategy, rather than needing a Strategy.SEARCHABLE_CONTENT which we'd only use in these cases anyway. I can see it might be neater to add this to the Strategy enum though, happy to change it if you prefer.

(This approach does introduce one slight behavioural change: searchable content in the QF is now treated the same way as it is in site search. Since your analysis showed the site search was performing better anyway though, I'm not too worried about that.)

…ng strategy Restores original QF behaviour

barna-isaac · 2025-10-02T09:39:52Z

we've agreed with @axlewin to use searchableContentField as its own strategy but keep the option to override it, and to leave things unchanged on the QuestionFinder (so we don't need to re-evaluate performance).

…y existing strategy" This reverts commit e6cf256.

barna-isaac · 2025-10-03T09:43:29Z

After reviewing the changes, we've decided to backport the content field changes to the question finder as well.
Here's how this changes the scope:

And here's how the test set performs with the content field related changes backported:

Performance is similar on the tests sets. The reason we decided to roll out backporting these changes is because this way, the Question Finder and Site Wide Search both use the same strategy for content-related searches at least.

Improve search for book questions by section number This commit significantly improves search relevance, particularly for finding book questions by section number (e.g., "A1.2") stored in the subtitle. It introduces high-priority, exact-string matches for id, title, and subtitle and increases their respective search weights. To reduce noise and improve performance, wildcard matching on general page content has been removed in favor of stricter, whole-word matching. This results in a 2-4x performance increase and much more accurate results for targeted queries. The trade-off is that partial-word searches within content (e.g., "supercond" for "superconductivity") will no longer return matches. In a few months, we should review that everything is fine, as part of this ticket: https://trello.com/c/26WQGXnn/5782-review-search-improvements.

barna-isaac · 2025-10-03T10:25:27Z

Originally merged this to master by accident, but it's now merged on main. (And @jsharkey13 has forced pushed master so that merge is now gone).

axlewin added 3 commits August 13, 2025 12:15

Don't use wildcard instructions for searchable content

45904fb

Require stricter matching for searching by id

25c0314

Restore slight boost to searchable content

e210166

axlewin added 6 commits August 15, 2025 17:26

Prioritise exact matches on id/title/subtitle

b810ce6

Prioritise exact matches for question finder search

93ce11e

Add null check for QF search string

06180ad

Change access modifier

2b3da1c

Fix indentation

39780ad

Fix indentation

2891e46

axlewin marked this pull request as ready for review August 19, 2025 15:34

barna-isaac approved these changes Oct 1, 2025

View reviewed changes

Separate searchable content logic out from fuzzy strategy

aebd121

Allow isSearchableContent field strategy to be overridden by existi…

e6cf256

…ng strategy Restores original QF behaviour

Revert "Allow isSearchableContent field strategy to be overridden b…

d514d78

…y existing strategy" This reverts commit e6cf256.

barna-isaac merged commit 0996aab into master Oct 3, 2025
5 checks passed

barna-isaac deleted the hotfix/improve-search-results branch October 3, 2025 10:08

barna-isaac restored the hotfix/improve-search-results branch October 3, 2025 10:10

barna-isaac mentioned this pull request Oct 3, 2025

Revert "Improve search results" #726

Merged

barna-isaac deleted the hotfix/improve-search-results branch October 3, 2025 10:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve search results #716

Improve search results #716

Uh oh!

axlewin commented Aug 14, 2025 •

edited

Loading

Uh oh!

codecov bot commented Aug 14, 2025 •

edited

Loading

Uh oh!

barna-isaac commented Oct 1, 2025 •

edited

Loading

Uh oh!

barna-isaac Oct 1, 2025

Uh oh!

barna-isaac Oct 1, 2025

Uh oh!

barna-isaac commented Oct 1, 2025 •

edited

Loading

Uh oh!

axlewin commented Oct 2, 2025

Uh oh!

barna-isaac commented Oct 2, 2025

Uh oh!

barna-isaac commented Oct 3, 2025

Uh oh!

Uh oh!

barna-isaac commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve search results #716

Improve search results #716

Uh oh!

Conversation

axlewin commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

barna-isaac commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How this PR achieved this

Evaluating this strategy

Future improvements, thoughts

Ada

Uh oh!

barna-isaac Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

barna-isaac Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

barna-isaac commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

axlewin commented Oct 2, 2025

Uh oh!

barna-isaac commented Oct 2, 2025

Uh oh!

barna-isaac commented Oct 3, 2025

Uh oh!

Uh oh!

barna-isaac commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

axlewin commented Aug 14, 2025 •

edited

Loading

codecov bot commented Aug 14, 2025 •

edited

Loading

barna-isaac commented Oct 1, 2025 •

edited

Loading

barna-isaac commented Oct 1, 2025 •

edited

Loading