[Q&A] Speed issues... #1628

NathanaelA · 2024-09-05T18:06:42Z

NathanaelA
Sep 5, 2024

What happens?

I'm testing different speed settings and seems like pg_search is slower compared to other solutions.... Wondering if I'm doing something wrong....

So I have a sample table of a bit over 830,000 game reviews a lot of the reviews would have some common words in them (like game, nice, play).

Test is only searching on a single text field, not concatenating any fields -- so indexes are also only on that single text field.

Basic search is using ilike with % so row count would be a tad inflated compared to indexed results. So this is expected...

Since we are using to_tsvector on GIN, the PG_Search will return different number of rows just because things like 'game' and 'games' are treated as the same word in tsvector/GIN making more matches occur... So the row counts being a bit different also is expected result.

To make the search be a bit more complex and return a lot of results to get an idea how this is what I'm doing:

Results:
Using normal postgresql ilike search, it takes around 8 seconds to return about 19,500.
Using a GIN index it takes about 0.2 seconds to return 18,900 rows.
Using PG_search, it takes 0.43 seconds to return about 18,400 rows.
Using OpenSearch with this field indexed takes about 0.35 seconds to do the maximum of 10,000 rows...

Please note this is all done on the exact same hardware and the three postgresql queries are in the same table & db.

Should PG_SEARCH actually be slower for full text searching than everything else???
If so, then what nitch is it supposed to be filling?

One gotcha that I discovered that isn't really documented (or I didn't see any warnings about) is that for the bm_25 index you really NEED the key_field => 'id', to actually be a true indexed field, if it doesn't have an index, pg_search ends up being very slow as it appears it uses its bm_25 index to find the id's from the text then uses the normal key_field id to actually get the rows to return.... No index on that field, means it is SLOWWWW...

To Reproduce

Basically looking for every row that has: nice & game --OR-- nice & play --OR-- just great in it.

Standard search is:

select distinct review_id from scrape_reviews where 
    review_text ilike '%nice%' and review_text ilike '%game%' or 
    review_text ilike '%nice%' and review_text ilike '%play%' or 
    review_text ilike '%great%'

GIN search is using this:

select distinct review_id from reviews where to_tsvector('english', review_text) @@ 
     to_tsquery('english', '((nice & (game | play)) | great)')

PG_Search is using this:

SELECT distinct review_id
FROM search_idx.search(
  '(review_text:nice AND (review_text:game OR review_text:play) ) OR (review_text:great)'
);

OS:

Linux, x64

ParadeDB Version:

0.9.3

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB pg_search Extension

Full Name:

Nathanael Anderson

Affiliation:

SideQuest

Did you include all relevant data sets for reproducing the issue?

N/A - The reproduction does not require a data set

Did you include the code required to reproduce the issue?

Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

Yes, I have

Answered by neilyio

Sep 5, 2024

Hey @NathanaelA, you'll find that search results are much faster if you pass limit_rows and offset_rows to narrow down the number of results returned from the index at one time.

View full answer

neilyio · 2024-09-05T19:38:06Z

neilyio
Sep 5, 2024
Collaborator

Hey @NathanaelA, you'll find that search results are much faster if you pass limit_rows and offset_rows to narrow down the number of results returned from the index at one time.

9 replies

eeeebbbbrrrr Sep 5, 2024
Collaborator

Ya know what, @NathanaelA, I don't believe the code for honoring stable_sort as a "fast path" has been released yet! It's in this PR (#1613).

Until that's released, I suppose the perf you're seeing just is what it is. This is why PR #1613 was implemented last week.

I can't estimate what numbers you'll see, but during my work #1613, I was seeing 2x-10x improvements depending on the query complexity and the rows returned. YMMV, of course.

(In my defense, this is only my second week here at ParadeDB and I'm still learning the ropes).

eeeebbbbrrrr Sep 6, 2024
Collaborator

v0.9.4 is out now. Give it a go and report back here with the results.

https://github.com/paradedb/paradedb/releases/tag/v0.9.4

eeeebbbbrrrr Sep 9, 2024
Collaborator

Hi @NathanaelA! Were you able to do your measurements against pg_search v0.9.4?

I’d love to hear your feedback.

NathanaelA Sep 9, 2024
Author

@eeeebbbbrrrr - Hi Eric, I had a chance, it does shave off about 100ms. So the query was finishing in about 370ms which is actually quite the speed up dropping about 450ms for all the records.

It doesn't seem to affect the time very much when using a limit_rows, not sure if that is expected...

However, that still leads me to an earlier question that I am trying to figure out...

At what point would you expect pg_search to be faster than the built in GIN, or maybe alternatively what areas does PG_Search do a better job than just using a GIN index?

philippemnoel Sep 9, 2024
Maintainer

@eeeebbbbrrrr - Hi Eric, I had a chance, it does shave off about 100ms. So the query was finishing in about 370ms which is actually quite the speed up dropping about 450ms for all the records.

It doesn't seem to affect the time very much when using a limit_rows, not sure if that is expected...

However, that still leads me to an earlier question that I am trying to figure out...

At what point would you expect pg_search to be faster than the built in GIN, or maybe alternatively what areas does PG_Search do a better job than just using a GIN index?

Hi @NathanaelA. pg_search offers several improvements over GIN:

BM25 scoring, which provides better result relevancy
Tokenizer support in all major languages
Support for faceted search
Performance should be much superior at scale. When we did our first benchmarks, we started observing significant improvements around the 5M rows indexed mark. That said, this was quite a while ago and we need to run a more updated benchmarking effort. It is on our TODO list.

As for limit_rows having little impact, this is surprising. I'll defer to @eeeebbbbrrrr here.

eeeebbbbrrrr · 2024-09-10T18:56:19Z

eeeebbbbrrrr
Sep 10, 2024
Collaborator

As for limit_rows having little impact, this is surprising. I'll defer to @eeeebbbbrrrr here.

limit_rows requires that we do sorting (by score, in fact) such that the top-N matching docs are returned. In other words, it defeats the stable_sort => false setting. So this is expected behavior.

@NathanaelA I've got another PR up that I'm putting the finishing touches on that I'd expect to drop your timings for pg_search, with a small row limit, to the milli-second range. The query syntax will be different, but as an example, with an entirely different dataset:

-- akin to what you're doing now
[v16.2][419762] reddit=# select count(*) from idxreddit.search('body:(beer wine cheese)', stable_sort => false);
 count  
--------
 179871
(1 row)

Time: 436.495 ms

-- with a limit... marginally faster but not much b/c we now have to internally sort
[v16.2][419762] reddit=# select count(*) from idxreddit.search('body:(beer wine cheese)', limit_rows => 10);
 count 
-------
    10
(1 row)

Time: 332.796 ms

-- here's 10 (random) matching rows in 2ms
[v16.2][419762] reddit=# select count(*) from (select id from reddit where id @@@ 'body:(beer wine cheese)' limit 10);
 count 
-------
    10
(1 row)

Time: 2.010 ms

We'll have this released, I hope, later this week. As our postgres planner integrations improve, these sorts of drastic performance improvements will become commonplace.

And @philippemnoel is right, at scale, pg_search is going to be far superior to postgres' gin/gist indexes as we're able to find the matching tuples much faster. For smaller data volumes I don't think it's surprising that one might be a bit faster than the other. There's a lot of tradeoffs when it comes to performance, as you know.

2 replies

NathanaelA Sep 10, 2024
Author

I look forward to seeing this. My data isn't quite at major scale so it probably won't be millions of records for a long time, but the objective is to return the data as soon as possible from FTS across multiple fields, which is why I was looking at the different options that are around to make a decision on direction.

philippemnoel Dec 19, 2024
Maintainer

Give things a try! We've significantly improved performance. You can read more here: https://paradedb.com/blog/case_study_alibaba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParadeDB

[Q&A] Speed issues... #1628

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

ParadeDB

[Q&A] Speed issues... #1628

NathanaelA Sep 5, 2024

What happens?

To Reproduce

OS:

ParadeDB Version:

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

Full Name:

Affiliation:

Did you include all relevant data sets for reproducing the issue?

Did you include the code required to reproduce the issue?

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

Replies: 2 comments · 11 replies

neilyio Sep 5, 2024 Collaborator

eeeebbbbrrrr Sep 5, 2024 Collaborator

eeeebbbbrrrr Sep 6, 2024 Collaborator

eeeebbbbrrrr Sep 9, 2024 Collaborator

NathanaelA Sep 9, 2024 Author

philippemnoel Sep 9, 2024 Maintainer

eeeebbbbrrrr Sep 10, 2024 Collaborator

NathanaelA Sep 10, 2024 Author

philippemnoel Dec 19, 2024 Maintainer

NathanaelA
Sep 5, 2024

Replies: 2 comments 11 replies

neilyio
Sep 5, 2024
Collaborator

eeeebbbbrrrr Sep 5, 2024
Collaborator

eeeebbbbrrrr Sep 6, 2024
Collaborator

eeeebbbbrrrr Sep 9, 2024
Collaborator

NathanaelA Sep 9, 2024
Author

philippemnoel Sep 9, 2024
Maintainer

eeeebbbbrrrr
Sep 10, 2024
Collaborator

NathanaelA Sep 10, 2024
Author

philippemnoel Dec 19, 2024
Maintainer