Skip to content

Conversation

axlewin
Copy link
Contributor

@axlewin axlewin commented Aug 14, 2025

Improves the behaviour of the search endpoint, particularly for finding book questions on sci.

Some users use the sitewide search to search for book questions by section number (e.g. "Pre-Uni Maths for Sciences A1.2", or more commonly just "A1" or "A1.2"), as contained in the question's subtitle. This does not currently return the correct question(s). The aim of these changes is primarily to make searches of this type work.

This PR adjusts some weights and reduces how generously we match on words in the content, raising the relative priority of matches on fields such as subtitles. There are now separate high-priority searches for exact whole-string matches on id/title/subtitle, in addition to the existing fuzzy matches on the tokenised search string. There's a trade-off here with slightly worse matching if searching for words/phrases in a page's content.

As part of the above, subtitle is now included in the list of raw fields for elasticsearch. This solves a problem affecting questions from the Pre-Uni Physics book where near-matches on titles are prioritised over exact matches on subtitles (try "Essential Pre-Uni Physics E1.1" before & after running ETL). This works well but changes how content is indexed; since this is only needed for results from this specific book, it may make sense to revert this change.

Switching the QF searches to use fuzzy instead of substring matches would improve matching for book questions on the QF, but since the substring match method was introduced specifically for the QF I haven't changed it.

A side-effect of these changes is that searching by full url now finds the correct content object, although this isn't a common use case.

Searches that rely on stemming (e.g. "Lagrange" to match "Lagrangian points") still don't work, but they don't in the current implementation either.

For Ada, these changes should have minimal impact since Ada's search already works well. Most Ada searches are for topics rather than specific questions, so we should make sure that keyword matching on searchable content still works as well as it did before here. Similarly, non-(book-)question pages on sci should still match reliably.

There is a test script which can be run on some test cases against a local API to assert the above.

Copy link

codecov bot commented Aug 14, 2025

Codecov Report

❌ Patch coverage is 94.73684% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.74%. Comparing base (442b150) to head (d514d78).
⚠️ Report is 26 commits behind head on master.

Files with missing lines Patch % Lines
...am/cl/dtg/segue/dao/content/GitContentManager.java 96.00% 0 Missing and 1 partial ⚠️
.../ac/cam/cl/dtg/segue/etl/ElasticSearchIndexer.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #716      +/-   ##
==========================================
+ Coverage   36.53%   37.74%   +1.21%     
==========================================
  Files         536      536              
  Lines       23689    24407     +718     
  Branches     2857     3073     +216     
==========================================
+ Hits         8655     9213     +558     
- Misses      14175    14279     +104     
- Partials      859      915      +56     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@axlewin axlewin marked this pull request as ready for review August 19, 2025 15:34
@barna-isaac
Copy link
Contributor

barna-isaac commented Oct 1, 2025

This PR seems to make exactly the trade-offs @axlewin mentions. I've performed a "black-box" analysis of the scope and quality of the changes introduced, and understood the changes made to the queries. Based on this, I think we should merge this PR. Here's a summary.

  • Scope of changes. On Isaac Science, for last year's 30K search terms:

    • for site-wide search, the first match has changed 1/3rd of the time, and the median amount of change among
      returned search results is 20%.
    • for the question finder, the first match has changed 1/6th of the time, and the median amount of change among
      returned search results is 9%.

    On the site-wide search, very small percentage of searches, like cytok, supercond, orthogonal, vecti and
    kirchof no longer return any results. This is because, on the site-wide search only, wide-card queries are no
    longer performed on content.

    Screenshot 2025-10-01 at 13 40 05
  • Test results. For Isaac Science, Meurig's original test set performs a little worse and Alex's new test set (which contains mostly the kind of searches this PR has been looking to improve) performs significantly better. Conceiving of our test cases as labelled examples, we see that

    • for site-wide search, Meurig's test suite degrades overall. A degradation over 1 rank is observed in just 25% of
      cases, but there are 2 outliers where it's very significant. On the other hand, Alex's new tests show a marked
      improvement. 25% of cases is improved by over 3 ranks, and there are 7 cases where the improvement is very
      significant.
    • for the question finder, Meurig's original test suite yields identical results. There are just two questions
      where the rank degrades very slightly. For Alex's new data set, the rank is improved somewhat. Although an
      improvement over 1 is observed in just 25% of cases, there are 5 outliers where the improvement is very
      significant.
    Screenshot 2025-10-01 at 13 52 52

    Looking at the specific changes:

    • we see that performance degrades for general queries like sin, sine or Block, and the new matches are as relevant as what we've seen before
    Screenshot 2025-10-01 at 13 57 41
    • we see that performance improves the most for searches based on a reference to a book section. For searches like Essential Pre-Uni Physics L8.5, Essential Pre-Uni Physics B2.3 or Stpe into Phsyics: Curent and Cicruit Pratcice 7, the old search didn't return any results, and the new one does.
    Screenshot 2025-10-01 at 13 58 05
  • Performance: Site-wide search is now 2 to 4x faster. This is very likely due to the fact that site-wide search no longer performs wildcard queries on the content.

How this PR achieved this

  • for the site wide search:

    • id is now matched using the default strategy (a fuzzy and a non-fuzzy match query), and its weight has changed from 2.0 to 10.0 and 8.0.
    • weights for title, subtitle, summary, tags have changed from 2.0 to 5.0
    • weights for prioritizedSearchableContent, searchableContent have remained at 2.0. We previously performed both match and wildCard searches on these fields, but we now use match only (only matches entire words, allowing for a few characters of distance)
    • exact matches for id, title and subtitle have been introduced with a weight of 10.
  • for the Question finder

    • weights for title, subtitle, summary, tags have changed from 10 to 5. Weights for prioritizedSearchableContent, searchableContent have changed from 10 and 5 to 1.
    • exact matches for id, title and subtitle have been introduced with a weight of 10.
  • to support exact matches on the "subtitle" field, a raw version of that field is now made available (indexed)

Evaluating this strategy

  • As the above analysis has shown, abandoning wildcard content matches for site wide search means that it is now more difficult to search for content. Because we still present word-based matches, only word-part searches are now impossible. A user that remembers a sentence fragment from some content can still find the question using that. The only use case can think is now gone is where a user was trying to make up for the absence of stemming, eg. by searching "Lagrang" to get matches for both "Lagrange" and "Lagrangian". Still, this only effects the site-wide search. The question finder still performs wildcard matches.
  • Although the PR has significantly improved matches for book questions using section number, users will need to be very exact. This is because part of the boost these receive comes from exact matches on subtitle. Unfortunately, even casing differences between the subtitle break this (so Essential Pre-Uni Physics L8.5 is found with a rank of 3, but Essential Pre-Uni physics L8.5 is not found at all).

Future improvements, thoughts

  • It feels like there's a limit to the improvements we can achieve by adjusting the relative weights of the question fields: some users will be expecting to find a question by title, some by a piece of content they remember, and some by its section in the book. Users would be able to perform better searches if we let them specify the fields they'd like to include.
  • Questions can be found through both the site-wide search and the question finder. I think this is fine, but it's a little confusing that even just for questions, the site-wide search works differently from the question finder. Looking at the results above, it seems a user who's trying to find a question by book section is better off using the site wide search (the test set containing book section names performs better on the site wide search) -- this is counter-intuitive, and I think most users would expect the question finder to be the better tool for finding questions. I think it'd be better if the site-wide search and question finder used the same logic for finding questions.
  • I didn't find the test case runner's output too useful (it just told me that a lot of cases have failed both before and after the changes), but the fact that the test cases can be used as labelled examples for the search has been huge help. @axlewin has introduced new labelled examples for this PR, which I thought was very helpful. I hope we keep expanding this dataset.

Ada

The above observations have been about the Physics site. I've performed the same analysis for Ada as well. For that site, we didn't have a set of pre-existing test cases, so I only relied on last year's search terms and the Ada-version of Alex's dataset.

  • the scope of changes is similar to what we've observed for Physics, but a few percentage points lower. This is likely explained by the fact that Ada make use of the subtitle field a lot less frequently (there are only ~40 distinct values for subtitle in the entire ada content repo)
    Screenshot 2025-10-01 at 16 56 08

  • rank changes in Alex's test suite are significantly lower, as they didn't have any "book section" style searches to add
    Screenshot 2025-10-01 at 17 02 07

Examining the search terms that produce the largest change, we see that they're either very generic (eg: yx, con, ye), rely on wildcard matches in the id (webtech), or wildcard matches in the content (asynchro, monoalp).

multiMatchSearchesGroupedByTerm.putIfAbsent(term, Sets.newHashSet());
multiMatchSearchesGroupedByTerm.get(term).add(searchInField.getField());

if (!isSearchableContentField) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the Fuzzy strategy so it generates a single, fuzzy match query for content fields, and a match, a wildcard and a multimatch query for any other fields. Before this change, using a strategy on any field always resulted in the same query, which I think was clearer. Rather than hard-coding field-specific behavior in an existing strategy, I suggest applying a different strategy to content fields. This could be a new SimpleFuzzy strategy, or some flag that modifies the behaviour of the Simple strategy so it generates a fuzzy match query.

For reference, these are the 4 strategies we currently have:

Image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@axlewin , I'm happy to merge this branch, just please tell me if you plan to make any changes in response to this comment

@barna-isaac
Copy link
Contributor

barna-isaac commented Oct 1, 2025

A note about deploying: until the first content push, there will be no subtitle.raw field. Queries will still succeed, and find any matches based on other fields, but we should initiate ETL (for Ada and Physics both) so this field gets added.

@axlewin
Copy link
Contributor Author

axlewin commented Oct 2, 2025

I agree that the fuzzy strategy ought to behave the same for every field. What do you think of aebd121? This effectively just uses isSearchableContentField as its own strategy, rather than needing a Strategy.SEARCHABLE_CONTENT which we'd only use in these cases anyway. I can see it might be neater to add this to the Strategy enum though, happy to change it if you prefer.

(This approach does introduce one slight behavioural change: searchable content in the QF is now treated the same way as it is in site search. Since your analysis showed the site search was performing better anyway though, I'm not too worried about that.)

…ng strategy

Restores original QF behaviour
@barna-isaac
Copy link
Contributor

we've agreed with @axlewin to use searchableContentField as its own strategy but keep the option to override it, and to leave things unchanged on the QuestionFinder (so we don't need to re-evaluate performance).

@barna-isaac
Copy link
Contributor

After reviewing the changes, we've decided to backport the content field changes to the question finder as well.
Here's how this changes the scope:
Screenshot 2025-10-03 at 10 41 21

And here's how the test set performs with the content field related changes backported:
Screenshot 2025-10-03 at 10 41 59

Performance is similar on the tests sets. The reason we decided to roll out backporting these changes is because this way, the Question Finder and Site Wide Search both use the same strategy for content-related searches at least.

@barna-isaac barna-isaac merged commit 0996aab into master Oct 3, 2025
5 checks passed
@barna-isaac barna-isaac deleted the hotfix/improve-search-results branch October 3, 2025 10:08
@barna-isaac barna-isaac restored the hotfix/improve-search-results branch October 3, 2025 10:10
barna-isaac added a commit that referenced this pull request Oct 3, 2025
Improve search for book questions by section number

This commit significantly improves search relevance, particularly for finding book questions by section number (e.g., "A1.2") stored in the subtitle.

It introduces high-priority, exact-string matches for id, title, and subtitle and increases their respective search weights. To reduce noise and improve performance, wildcard matching on general page content has been removed in favor of stricter, whole-word matching.

This results in a 2-4x performance increase and much more accurate results for targeted queries. The trade-off is that partial-word searches within content (e.g., "supercond" for "superconductivity") will no longer return matches.

In a few months, we should review that everything is fine, as part of this ticket: https://trello.com/c/26WQGXnn/5782-review-search-improvements.
@barna-isaac barna-isaac deleted the hotfix/improve-search-results branch October 3, 2025 10:20
@barna-isaac
Copy link
Contributor

Originally merged this to master by accident, but it's now merged on main. (And @jsharkey13 has forced pushed master so that merge is now gone).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants