Reworking search: tokenization, handling of quoted literal search, and postgres fuzziness #2351

jecorn · 2023-04-18T21:49:08Z

What type of PR is this?

feature
bug (one tiny one)

What this PR does / why we need it:

The current search is based on exact phrases entered into the search box. This behaviour is similar to quoting a search string in a search engine (though diacritics are removed and matches are case insensitive). For example, searching for beans pinto does not find pinto beans, egg over easy does not find and eggs over easy, etc.. This is convenient to code, but leads to some unexpected behaviour for users where "easy" search hits are missed.

This PR adds four improvements to search:

Search is by default now tokenized based on whitespace, so that search strings are separated and hits can be found from matches independent of intervening words and word order. For example, pinto beet burgers matches both pinto burgers and beet pinto burgers
Special characters except quotes (' and ") are removed from the search string and replaced by whitespace, which is then used for tokenization. So slow-cooker carrots is tokenized for search to slow and cooker and carrots
Tokenized search (including special character removal) can be overridden using quotes (either ' or "). Quoted searches are searched for literally and can be mixed and matched with non-quoted searches. For example, "slow-cooker" beans casserole 'with paprika' is tokenized to a search for slow-cooker, beans, casserole and with paprika.
On a postgres backend, fuzzy searching is now default using word similarity trigrams with GIN indexing of the recipe.name_normalized, recipe.description_normalized, recipe_ingredient.note_normalized and recipe_ingredient.original_text_normalized columns. Trigram searching avoids the need to define a language for the database, as would otherwise be needed for full-text search stemming and stop word removal. Fuzziness can therefore work on mixed-language databases. The fuzziness is calibrated to try to return hits if a search term has 1 or two mismatches (depending on word length) while keeping false positives low. Fuzzy search orders hits by trigram similarity to the recipe name. Fuzzy search is incompatible with quoted literal search, so adding quoted strings to search automatically falls back to token search.

Which issue(s) this PR fixes:

(fixes an un-numbered bug in RecipeIngredientModel.note_normalized triggers)
Fixes #2325
Addresses #2335

Special notes for your reviewer:

I tried implementing both trigram and full text searching in postgres. Needing to define a language for the database for full-text search indexing represented a problem because text search depends on defining an index per language in each column. What if a database contains words in multiple languages? This is often the case for international recipes, and we don't know the languages used inside a recipe ahead of time or even at recipe entry/import time.

The main benefit of full-text search over trigrams is performance. But for a recipe database, we are unlikely to run into the a million-row situation that is a problem for trigrams. I found that trigrams with GIN indexing are performant on even a 6,000 real-recipe database, and they have the huge benefit of being language-independent.

(The number of commits in this PR reflects my development style, which spans multiple machines in multiple physical locations, not the complexity of the final code)

Testing

I created/changed several recipe search tests in test_recipe_repository.py:

Refactored the test strings for recipes. They are no longer randomly generated each test time. If they are randomly generated, there is a small chance that two supposedly orthogonal test strings will be close enough to be cross-matched by fuzzy search and false-fail the test. So now they are based on foods (though somewhat silly ones)
Test literal search: quoted strings bypass tokenised search and only match exact phrases in the search text
Test special character removal from non-literal searches: non-quoted strings have special characters removed and are properly tokenised between those special characters
Test token separation: non-quoted strings are separated into tokens based on whitespace and good search results are returned independent of token order
Test fuzzy search: postgres only, typos and small word differences (e.g. plural or singular) return good search results with title-match prioritized

PR passes all automated tests run from scratch (both previous and added above) using both an sqlite and postgres backend.

In addition to the automated tests, I tested tokenized, quoted literal, and fuzzy search on both sqlite and postgres backend using a 6,000 row database of real recipes. Search is fast in all contexts. Fuzzy search returns false positives (as expected, e.g. mean is a false positive for bean), but also rescues searches that would otherwise fail (e.g. bean is a true positive for beans). Ordering by trigram similarity to the recipe name brings the most relevant hits to the top of the list.

Release Notes

Search is now by default independent of word order and matches despite intervening words. @jecorn 
Searches can be quoted to find literal matches & quoted search can be mixed with non-quoted search. @jecorn 
On postgres only, search is now fuzzy by default. This can be overridden by using quoted search. @jecorn

michael-genson · 2023-04-19T03:39:36Z

The backend tests on the GitHub workflow run a few extra tests, namely ruff. Looks like some linting tests are failing.
You can run make backend-all, which runs all the linting tests too (or just run make backend-lint for ruff).

jecorn · 2023-04-19T04:16:49Z

Thanks! Saves a lot of time being able to run the extra tests locally. Stay tuned.

jecorn · 2023-04-19T09:24:28Z

While running the backend-all tests, I started now seeing that the test_get_scheduled_webhooks_filter_query test fails. It's not finding anything at all (result list is empty). But this test also fails on a clean checkout of the mealie-next head with all of my changes removed. So I'm inclined to ignore it and move on, since the failure seems unrelated to my diffs.

Should be ready for the GitHub auto-checks.

michael-genson · 2023-04-19T15:58:07Z

While running the backend-all tests, I started now seeing that the test_get_scheduled_webhooks_filter_query test fails

Looks like it worked here ¯\_(ツ)_/¯ Might just be flaky

If this PR is ready, feel free to mark it ready for review. It might be a bit before hay-kot is able to get to it, I know he's got a lot going on

jecorn · 2023-04-19T21:12:42Z

Thanks, will do!

By the way, while doing all of the postgres development, I noticed that the tests are pretty stateful, in that they don't clean up after themselves. Which sometimes leads to weird collisions between tests if they are run without intermediate manual cleaning (e.g. multiple duplicate entries in the database, migration data hanging around that can cross-match between tests, etc). The make backend-clean works great for sqlite, but postgres is a pain. I was thinking of writing some kind of nuking helper function that nukes all data that could be run after each test that leaves data behind. But then again, the diversity of the tests means there would probably need to be many different nuking schemes. Painful.

jecorn · 2023-04-30T15:53:36Z

Is it bad form to add my user name to the release notes? I saw a note about doing that somewhere in the docs, and noticed it in previous release notes, which is why I did so. But also just saw that recent pull requests do not have a user name in the release notes.

jecorn · 2023-05-13T10:51:05Z

Hi @hay-kot. This search PR hasn’t changed in 3 weeks, passes all tests, and improves both SQLite and Postgres search. Is there more info you need from me? Or just let me know if life stuff is too hectic and I’ll give it a rest.

hay-kot

Implementation looks good. There's a few small things. Looks like the bulk of changes only change Postgres installs (apart from the bug fix)

Could you also add a note in the documentation so this feature is easily discoverable? Maybe an entry in the FAQ and some mention of it in the getting started guide, mentioning enhanced search with Postgres installation.

mealie/db/models/recipe/ingredient.py

mealie/repos/repository_recipes.py

jecorn · 2023-05-14T04:08:13Z

Thanks for taking a look! For clarity, tokenized search applies to sqlite (trigram search is already basically tokenized). Quoted search applies to both postgres and sqlite. And quoting parts of a postgres backend search falls back to tokenized search (you can't mix trigrams with exact substring matches).

For documentation, I added something about fuzzy search in the FAQ. And Postgres-specificity is mentioned in the Features section and Postgres container part of my docs commit (already pulled)

jecorn · 2023-05-16T11:00:36Z

@hay-kot Not to confuse things with multiple PRs from me, but I think this search PR is ready to go again whenever you had a moment to take a look.

mealie/repos/repository_recipes.py

…atabase

…le quotes alone.

jecorn marked this pull request as ready for review April 19, 2023 11:24

jecorn marked this pull request as draft April 19, 2023 11:34

jecorn marked this pull request as ready for review April 19, 2023 11:36

jecorn marked this pull request as draft April 19, 2023 11:38

jecorn marked this pull request as ready for review April 19, 2023 21:08

hay-kot force-pushed the postgres-fuzz branch from 3196943 to a12f1b9 Compare May 13, 2023 18:40

hay-kot requested changes May 13, 2023

View reviewed changes

mealie/db/models/recipe/ingredient.py Outdated Show resolved Hide resolved

mealie/repos/repository_recipes.py Show resolved Hide resolved

mealie/repos/repository_recipes.py Outdated Show resolved Hide resolved

hay-kot approved these changes May 28, 2023

View reviewed changes

mealie/repos/repository_recipes.py Show resolved Hide resolved

jecorn added 13 commits May 28, 2023 10:31

Creating postgres migration script and starting to set up to detect d…

2df3731

…atabase

non-working placeholders for postgres pg_tgrm

72e70f5

First draft of some indexes

a8b7ea8

non-working commit of postgres indexing

80d1b5b

Further non-working edits to db-centric fuzzy search

3fe9da0

update alembic for extensions

04f9896

More non-working setup

54c1cac

Move db type check to init_db

d4c7256

fix typo in db name check

853c43f

Add sqlite token search and postgres full text search

86c6fca

reorder search to hit exact matches faster

1378be7

Add settings and docs for POSTGRES_LANGUAGE (full text search)

4ddaa8f

Use user-specified POSTGRES_LANGUAGE in search

22256af

jecorn added 23 commits May 28, 2023 10:31

fix fuzzy search typo

98f6a28

Remove full text search and instead order by trigram match

e64082f

cleaner adding of indices, remove fulltext

3346811

Cleanup old import of getting app settings

41dda03

Fix typo in index

53e7fa0

Fix some alembic fuzzy typos

236e166

Remove diagnostic printing from alembic migration

adce758

Fix mixed up commutator for trigram operator and relax criteria

021bb2b

forgot to remove query debug

0ee9d55

sort only on name

8f1d1da

token and fuzzy search tests

81947f4

Refactor recipe search test to avoid rare random string cross-matches.

ba250c4

Add ability to quote parts of search for exact match

605b2de

Remove internal punctuation, unless it's quoted for literal search

54aad08

Add tests for special character removal and literal search

6cfe202

Remove the outer double quotes from searches, but leave internal sing…

dfc602b

…le quotes alone.

Update tests to avoid intra-test name collisions

37b2ba6

Fixing leftovers highlighted by lint

2a4a6a1

cleanup linting and mypy errors

e7d7da9

Fix test cross-matching on dirty db (leftovers from bulk import)

0eccb76

forgot to cleanup something when debugging mypy errors

7492a7f

re-order pg_trgm loading in postgres

2095a47

address comments

dc2d56c

hay-kot force-pushed the postgres-fuzz branch from 5a2c708 to dc2d56c Compare May 28, 2023 17:31

hay-kot approved these changes May 28, 2023

View reviewed changes

hay-kot merged commit 7e0d29a into mealie-recipes:mealie-next May 28, 2023

jecorn deleted the postgres-fuzz branch May 29, 2023 03:53

michael-genson mentioned this pull request Jun 1, 2023

[Nightly] - OCR Editor Fails to Copy Text to Selected Field in Instructions #2394

Closed

5 tasks

michael-genson mentioned this pull request Jul 29, 2023

Feature: Generalize Search to Other Models #2472

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reworking search: tokenization, handling of quoted literal search, and postgres fuzziness #2351

Reworking search: tokenization, handling of quoted literal search, and postgres fuzziness #2351

jecorn commented Apr 18, 2023 •

edited

Loading

michael-genson commented Apr 19, 2023

jecorn commented Apr 19, 2023

jecorn commented Apr 19, 2023 •

edited

Loading

michael-genson commented Apr 19, 2023

jecorn commented Apr 19, 2023 •

edited

Loading

jecorn commented Apr 30, 2023

jecorn commented May 13, 2023 •

edited

Loading

hay-kot left a comment

jecorn commented May 14, 2023

jecorn commented May 16, 2023

Reworking search: tokenization, handling of quoted literal search, and postgres fuzziness #2351

Reworking search: tokenization, handling of quoted literal search, and postgres fuzziness #2351

Conversation

jecorn commented Apr 18, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Testing

Release Notes

michael-genson commented Apr 19, 2023

jecorn commented Apr 19, 2023

jecorn commented Apr 19, 2023 • edited Loading

michael-genson commented Apr 19, 2023

jecorn commented Apr 19, 2023 • edited Loading

jecorn commented Apr 30, 2023

jecorn commented May 13, 2023 • edited Loading

hay-kot left a comment

Choose a reason for hiding this comment

jecorn commented May 14, 2023

jecorn commented May 16, 2023

jecorn commented Apr 18, 2023 •

edited

Loading

jecorn commented Apr 19, 2023 •

edited

Loading

jecorn commented Apr 19, 2023 •

edited

Loading

jecorn commented May 13, 2023 •

edited

Loading