[Q&A] Speed issues... #1628
-
What happens?I'm testing different speed settings and seems like pg_search is slower compared to other solutions.... Wondering if I'm doing something wrong.... So I have a sample table of a bit over 830,000 game reviews a lot of the reviews would have some common words in them (like game, nice, play). Test is only searching on a single text field, not concatenating any fields -- so indexes are also only on that single Basic search is using Since we are using to_tsvector on GIN, the PG_Search will return different number of rows just because things like 'game' and 'games' are treated as the same word in tsvector/GIN making more matches occur... So the row counts being a bit different also is expected result. To make the search be a bit more complex and return a lot of results to get an idea how this is what I'm doing: Results: Please note this is all done on the exact same hardware and the three postgresql queries are in the same table & db. Should PG_SEARCH actually be slower for full text searching than everything else??? One gotcha that I discovered that isn't really documented (or I didn't see any warnings about) is that for the bm_25 index you really NEED the To ReproduceBasically looking for every row that has: Standard search is: select distinct review_id from scrape_reviews where
review_text ilike '%nice%' and review_text ilike '%game%' or
review_text ilike '%nice%' and review_text ilike '%play%' or
review_text ilike '%great%' GIN search is using this: select distinct review_id from reviews where to_tsvector('english', review_text) @@
to_tsquery('english', '((nice & (game | play)) | great)') PG_Search is using this: SELECT distinct review_id
FROM search_idx.search(
'(review_text:nice AND (review_text:game OR review_text:play) ) OR (review_text:great)'
); OS:Linux, x64 ParadeDB Version:0.9.3 Are you using ParadeDB Docker, Helm, or the extension(s) standalone?ParadeDB pg_search Extension Full Name:Nathanael Anderson Affiliation:SideQuest Did you include all relevant data sets for reproducing the issue?N/A - The reproduction does not require a data set Did you include the code required to reproduce the issue?
Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 11 replies
-
Hey @NathanaelA, you'll find that search results are much faster if you pass |
Beta Was this translation helpful? Give feedback.
-
@NathanaelA I've got another PR up that I'm putting the finishing touches on that I'd expect to drop your timings for pg_search, with a small row limit, to the milli-second range. The query syntax will be different, but as an example, with an entirely different dataset: -- akin to what you're doing now
[v16.2][419762] reddit=# select count(*) from idxreddit.search('body:(beer wine cheese)', stable_sort => false);
count
--------
179871
(1 row)
Time: 436.495 ms
-- with a limit... marginally faster but not much b/c we now have to internally sort
[v16.2][419762] reddit=# select count(*) from idxreddit.search('body:(beer wine cheese)', limit_rows => 10);
count
-------
10
(1 row)
Time: 332.796 ms
-- here's 10 (random) matching rows in 2ms
[v16.2][419762] reddit=# select count(*) from (select id from reddit where id @@@ 'body:(beer wine cheese)' limit 10);
count
-------
10
(1 row)
Time: 2.010 ms We'll have this released, I hope, later this week. As our postgres planner integrations improve, these sorts of drastic performance improvements will become commonplace. And @philippemnoel is right, at scale, pg_search is going to be far superior to postgres' gin/gist indexes as we're able to find the matching tuples much faster. For smaller data volumes I don't think it's surprising that one might be a bit faster than the other. There's a lot of tradeoffs when it comes to performance, as you know. |
Beta Was this translation helpful? Give feedback.
Hey @NathanaelA, you'll find that search results are much faster if you pass
limit_rows
andoffset_rows
to narrow down the number of results returned from the index at one time.