Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Unified search query syntax using the full-text search capabilities of the underlying DB #11635

Merged
merged 59 commits into from
Oct 25, 2022
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
d49af47
use websearch_to_tsquery
novocaine Dec 22, 2021
fea2848
Support fallback using plainto_tsquery
novocaine Dec 22, 2021
ccb0d6c
cleanup
novocaine Dec 23, 2021
409afd6
add tests for search_rooms
novocaine Dec 23, 2021
ac002cb
deflake
novocaine Dec 23, 2021
bbbb651
improve docstring
novocaine Dec 23, 2021
02defb2
pass tsquery_func to _find_highlights_in_postgres
novocaine Dec 23, 2021
2eef5ed
Add tests for sqlite
novocaine Dec 23, 2021
28dc642
Don't preprocess sqllite queries
novocaine Dec 23, 2021
8ac3ea1
Also test the size of "results", as its generated by a different quer…
novocaine Dec 23, 2021
5198cb2
fix comment
novocaine Dec 23, 2021
b3c7e0d
Use plainto_tsquery instead of crafting the query ourselves
novocaine Dec 23, 2021
10f181a
black
novocaine Dec 23, 2021
905f572
Merge branch 'develop' of github.com:matrix-org/synapse into use-webs…
novocaine Dec 23, 2021
05a0cdd
Add feature file
novocaine Dec 23, 2021
b0417e9
Use common cases for all tests
novocaine Dec 23, 2021
04c394b
Merge branch 'develop' of github.com:novocaine/synapse into use-webse…
novocaine Jan 10, 2022
4cfb506
Merge branch 'develop' into use-websearch_to_tsquery-for-fts
novocaine May 26, 2022
d30a211
fix for removal of get_datastore()
novocaine May 26, 2022
68cae26
isort
novocaine May 26, 2022
5dbb2c0
add migration
novocaine May 26, 2022
64dd357
comment
novocaine May 26, 2022
20ab98e
black
novocaine May 26, 2022
e5aa916
isort
novocaine May 26, 2022
7941d8a
flake8
novocaine May 26, 2022
f7362f1
import List for python 3.7
novocaine May 26, 2022
f1769f9
create a background job, and don't do anything if the tokenizer is al…
novocaine May 26, 2022
de07c83
When creating a new db, create it with tokenize=porter in the first p…
novocaine May 26, 2022
1170e07
black
novocaine May 27, 2022
b987203
Revert change to 25/fts.py
novocaine May 27, 2022
6ef1ef8
fix json import
novocaine May 27, 2022
63c4270
slightly neater quote formatting
novocaine May 27, 2022
0ddf6b1
fix missing space
novocaine May 30, 2022
9d77bc4
Add a parser to produce uniform results on all DBs
novocaine May 30, 2022
b8b2e28
address flake8
novocaine May 30, 2022
8506b21
give mypy a hint
novocaine May 30, 2022
e279be5
Fix phrase handling of "word"
novocaine May 30, 2022
6a7cf49
Add comment
novocaine May 30, 2022
92ad70a
Merge branch 'develop' into use-websearch_to_tsquery-for-fts
novocaine May 31, 2022
a5f298b
document negation via -
novocaine May 31, 2022
d6ed19e
Merge remote-tracking branch 'origin/develop' into use-websearch_to_t…
clokep Oct 17, 2022
c544409
Move the database schema to the updated directory.
clokep Oct 17, 2022
be742f1
Use a deque.
clokep Oct 17, 2022
52700b0
Add basic tests for _tokenize_query.
clokep Oct 17, 2022
5d9d183
Handle edge-cases.
clokep Oct 17, 2022
219321b
temp
clokep Oct 18, 2022
9166035
Fix phrase handling.
clokep Oct 18, 2022
a7fd4f6
Handle not with a space after.
clokep Oct 18, 2022
6751f5f
Use the int version number to check if the feature is supported.
clokep Oct 18, 2022
6e6ebd9
Lint
clokep Oct 18, 2022
3272819
Simplify schema delta.
clokep Oct 18, 2022
4993e1c
Add docstring.
clokep Oct 18, 2022
9b7bb08
Remove support for parens.
clokep Oct 24, 2022
13f5306
Fix edge cases with double quotes.
clokep Oct 24, 2022
62f6f18
Lint.
clokep Oct 24, 2022
abc56ad
Merge remote-tracking branch 'origin/develop' into use-websearch_to_t…
clokep Oct 24, 2022
b3755ae
Remove backwards compat code for Postgres < 11 since (almost) EOL.
clokep Oct 25, 2022
5e5fc8d
Simplify phrase handling.
clokep Oct 25, 2022
223580a
Fix tests on postgres 10.
clokep Oct 25, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/11635.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Allow use of postgres and sqllite full-text search operators in search queries.
76 changes: 45 additions & 31 deletions synapse/storage/databases/main/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
LoggingTransaction,
)
from synapse.storage.databases.main.events_worker import EventRedactBehaviour
from synapse.storage.engines import BaseDatabaseEngine, PostgresEngine, Sqlite3Engine
from synapse.storage.engines import PostgresEngine, Sqlite3Engine
from synapse.types import JsonDict

if TYPE_CHECKING:
Expand Down Expand Up @@ -431,8 +431,6 @@ async def search_msgs(
"""
clauses = []

search_query = _parse_query(self.database_engine, search_term)

args: List[Any] = []

# Make sure we don't explode because the person is in too many rooms.
Expand All @@ -454,20 +452,25 @@ async def search_msgs(
count_clauses = clauses

if isinstance(self.database_engine, PostgresEngine):
search_query, tsquery_func = _parse_query_for_pgsql(
search_term, self.database_engine
)
sql = (
"SELECT ts_rank_cd(vector, to_tsquery('english', ?)) AS rank,"
f"SELECT ts_rank_cd(vector, {tsquery_func}('english', ?)) AS rank,"
" room_id, event_id"
" FROM event_search"
" WHERE vector @@ to_tsquery('english', ?)"
f" WHERE vector @@ {tsquery_func}('english', ?)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move these to multiline strings while we're here 😇

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to do a follow-up PR to update the entire module to multi-line strings. Would that be acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

)
args = [search_query, search_query] + args

count_sql = (
"SELECT room_id, count(*) as count FROM event_search"
" WHERE vector @@ to_tsquery('english', ?)"
f" WHERE vector @@ {tsquery_func}('english', ?)"
)
count_args = [search_query] + count_args
elif isinstance(self.database_engine, Sqlite3Engine):
search_query = _parse_query_for_sqlite(search_term)

sql = (
"SELECT rank(matchinfo(event_search)) as rank, room_id, event_id"
" FROM event_search"
Expand All @@ -479,7 +482,7 @@ async def search_msgs(
"SELECT room_id, count(*) as count FROM event_search"
" WHERE value MATCH ?"
)
count_args = [search_term] + count_args
Copy link
Contributor Author

@novocaine novocaine Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was a bug to pass search_term here rather than search_query..?

count_args = [search_query] + count_args
else:
# This should be unreachable.
raise Exception("Unrecognized database engine")
Expand Down Expand Up @@ -511,7 +514,9 @@ async def search_msgs(

highlights = None
if isinstance(self.database_engine, PostgresEngine):
highlights = await self._find_highlights_in_postgres(search_query, events)
highlights = await self._find_highlights_in_postgres(
search_query, events, tsquery_func
)

count_sql += " GROUP BY room_id"

Expand All @@ -520,7 +525,6 @@ async def search_msgs(
)

count = sum(row["count"] for row in count_results if row["room_id"] in room_ids)

return {
"results": [
{"event": event_map[r["event_id"]], "rank": r["rank"]}
Expand Down Expand Up @@ -552,9 +556,6 @@ async def search_rooms(
Each match as a dictionary.
"""
clauses = []

search_query = _parse_query(self.database_engine, search_term)

args: List[Any] = []

# Make sure we don't explode because the person is in too many rooms.
Expand Down Expand Up @@ -592,20 +593,24 @@ async def search_rooms(
args.extend([origin_server_ts, origin_server_ts, stream])

if isinstance(self.database_engine, PostgresEngine):
search_query, tsquery_func = _parse_query_for_pgsql(
search_term, self.database_engine
)
sql = (
"SELECT ts_rank_cd(vector, to_tsquery('english', ?)) as rank,"
f"SELECT ts_rank_cd(vector, {tsquery_func}('english', ?)) as rank,"
" origin_server_ts, stream_ordering, room_id, event_id"
" FROM event_search"
" WHERE vector @@ to_tsquery('english', ?) AND "
f" WHERE vector @@ {tsquery_func}('english', ?) AND "
)
args = [search_query, search_query] + args

count_sql = (
"SELECT room_id, count(*) as count FROM event_search"
" WHERE vector @@ to_tsquery('english', ?) AND "
f" WHERE vector @@ {tsquery_func}('english', ?) AND "
)
count_args = [search_query] + count_args
elif isinstance(self.database_engine, Sqlite3Engine):

# We use CROSS JOIN here to ensure we use the right indexes.
# https://sqlite.org/optoverview.html#crossjoin
#
Expand All @@ -624,13 +629,14 @@ async def search_rooms(
" CROSS JOIN events USING (event_id)"
" WHERE "
)
search_query = _parse_query_for_sqlite(search_term)
args = [search_query] + args

count_sql = (
"SELECT room_id, count(*) as count FROM event_search"
" WHERE value MATCH ? AND "
)
count_args = [search_term] + count_args
count_args = [search_query] + count_args
else:
# This should be unreachable.
raise Exception("Unrecognized database engine")
Expand Down Expand Up @@ -670,7 +676,9 @@ async def search_rooms(

highlights = None
if isinstance(self.database_engine, PostgresEngine):
highlights = await self._find_highlights_in_postgres(search_query, events)
highlights = await self._find_highlights_in_postgres(
search_query, events, tsquery_func
)

count_sql += " GROUP BY room_id"

Expand All @@ -696,7 +704,7 @@ async def search_rooms(
}

async def _find_highlights_in_postgres(
self, search_query: str, events: List[EventBase]
self, search_query: str, events: List[EventBase], tsquery_func: str
) -> Set[str]:
"""Given a list of events and a search term, return a list of words
that match from the content of the event.
Expand All @@ -707,6 +715,7 @@ async def _find_highlights_in_postgres(
Args:
search_query
events: A list of events
tsquery_func: The tsquery_* function to use when making queries

Returns:
A set of strings.
Expand Down Expand Up @@ -739,7 +748,7 @@ def f(txn: LoggingTransaction) -> Set[str]:
while stop_sel in value:
stop_sel += ">"

query = "SELECT ts_headline(?, to_tsquery('english', ?), %s)" % (
query = f"SELECT ts_headline(?, {tsquery_func}('english', ?), %s)" % (
_to_postgres_options(
{
"StartSel": start_sel,
Expand Down Expand Up @@ -770,20 +779,25 @@ def _to_postgres_options(options_dict: JsonDict) -> str:
return "'%s'" % (",".join("%s=%s" % (k, v) for k, v in options_dict.items()),)


def _parse_query(database_engine: BaseDatabaseEngine, search_term: str) -> str:
def _parse_query_for_sqlite(search_term: str) -> str:
"""Takes a plain unicode string from the user and converts it into a form
that can be passed to database.
We use this so that we can add prefix matching, which isn't something
that is supported by default.
that can be passed to sqllite's matchinfo().
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently a no-op, but we will probably want to add stuff here to handle certain cases e.g. #3024

"""
return search_term

# Pull out the individual words, discarding any non-word characters.
results = re.findall(r"([\w\-]+)", search_term, re.UNICODE)

if isinstance(database_engine, PostgresEngine):
return " & ".join(result + ":*" for result in results)
elif isinstance(database_engine, Sqlite3Engine):
return " & ".join(result + "*" for result in results)
def _parse_query_for_pgsql(search_term: str, engine: PostgresEngine) -> Tuple[str, str]:
"""Selects a tsquery_* func to use and transforms the search_term into syntax appropriate for it.

Args:
search_term: A user supplied search query.
engine: The database engine.

Returns:
A tuple of (parsed search_term, tsquery func to use).
"""

if engine.supports_websearch_to_tsquery:
clokep marked this conversation as resolved.
Show resolved Hide resolved
return search_term, "websearch_to_tsquery"
else:
# This should be unreachable.
raise Exception("Unrecognized database engine")
return search_term, "plainto_tsquery"
4 changes: 4 additions & 0 deletions synapse/storage/engines/postgres.py
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,10 @@ def supports_returning(self) -> bool:
"""Do we support the `RETURNING` clause in insert/update/delete?"""
return True

@property
def supports_websearch_to_tsquery(self) -> bool:
return int(self.server_version.split(".")[0]) >= 11

def is_deadlock(self, error: Exception) -> bool:
import psycopg2.extensions

Expand Down
2 changes: 1 addition & 1 deletion synapse/storage/schema/main/delta/25/fts.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@

SQLITE_TABLE = (
"CREATE VIRTUAL TABLE event_search"
" USING fts4 ( event_id, room_id, sender, key, value )"
" USING fts4 (tokenize=porter, event_id, room_id, sender, key, value )"
clokep marked this conversation as resolved.
Show resolved Hide resolved
)


Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Copyright 2021 The Matrix.org Foundation C.I.C.
clokep marked this conversation as resolved.
Show resolved Hide resolved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from synapse.storage.engines import Sqlite3Engine


def run_create(cur, database_engine, *args, **kwargs):
# Upgrade the event_search table to use the porter tokenizer if it isn't already
if isinstance(database_engine, Sqlite3Engine):
cur.execute("SELECT sql FROM sqlite_master WHERE name='event_search'")
sql = cur.fetchone()
if sql is None:
raise Exception("The event_search table doesn't exist")
if "tokenize=porter" not in sql[0]:
cur.execute("DROP TABLE event_search")
cur.execute("""CREATE VIRTUAL TABLE event_search
USING fts4 (tokenize=porter, event_id, room_id, sender, key, value )""")

# Run a background job to re-populate the event_search table.
cur.execute("SELECT MIN(stream_ordering) FROM events")
rows = cur.fetchall()
min_stream_id = rows[0][0]

cur.execute("SELECT MAX(stream_ordering) FROM events")
rows = cur.fetchall()
max_stream_id = rows[0][0]

if min_stream_id is not None and max_stream_id is not None:
progress = {
"target_min_stream_id_inclusive": min_stream_id,
"max_stream_id_exclusive": max_stream_id + 1,
"rows_inserted": 0,
}
progress_json = json.dumps(progress)

sql = (
"INSERT into background_updates (update_name, progress_json)"
" VALUES (?, ?)"
)

cur.execute(sql, ("event_search", progress_json))


def run_upgrade(cur, database_engine, *args, **kwargs):
pass
Loading