Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Giving the tsvector and tsquery stuff a shot. #1019

Merged
merged 1 commit into from
Apr 19, 2024
Merged

Giving the tsvector and tsquery stuff a shot. #1019

merged 1 commit into from
Apr 19, 2024

Conversation

jwoertink
Copy link
Member

Ref #353

This adds in a new match method along with to_tsquery and to_tsvector to do a bit of full text searching. The implementation technically generates valid sql, but is this "good enough" for a start? 🤷‍♂️

Before I merge this in, I want to get some feedback and see if this approach will be ok for built-in, or if this will require a lot more planning.

describe "match" do
it "uses @@" do
name.to_tsvector.match("jeb").to_sql.should eq ["SELECT #{QueryMe::COLUMN_SQL} FROM users WHERE TO_TSVECTOR(users.name) @@ TO_TSQUERY($1)", "jeb"]
end
Copy link
Contributor

@robacarp robacarp Mar 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're onto something with this. It's a nice syntax, and if I'm reading it right I think it means this'll allow me to specify a column type as a tsvec too, which seems necessary to get this to provide reasonably fast searching.

Postgres provides a few variations on to_tsvector that do things a little differently when it comes to operators and escaping. I ended up using websearch_to_tsquery, and that may be what the intent here is -- it provides for correctly using boolean, unary, negation, and quotation operators as you might expect a web search to do. Maybe there's room to implement all of this function field or maybe Avram should just pick one? I'm not really sure.

The other thing that seems pretty important is the ts_rank stuff which allows you to actually sort results by relevance. ts_rank + websearch_to_tsquery got me a pretty usable search interface quickly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add websearch_to_tsquery easily in the same manner here. I'll have to look at the ranking stuff, that does seem pretty important. The extra options are where I got a little concerned. I may be able to add those in, but I wasn't sure how often they're really used. Like, if it's a very small % of people that need the extra options like to_tsquery('english', 'whatever') then maybe they just fall back to raw SQL again.

The other thing here is I'm making a huge assumption that the right side of @@ will always be cast to tsquery although postgres does allow a few variations. It's just that we don't have a way to map to a Crystal side TsQuery object, so I have to make some assumption on it. My main hope is that this is good enough for the 80% until someone comes along and ports pg_search to crystal 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the parametric index (ts_vector('english'...)) is out of reach for this stage. In order to accomplish that I'd probably reach for something more search-domain specific than the general purpose Avram query language.

I've posted this elsewhere, but I'll include it here for posterity.

In a migration and model, I setup a ts_vector column like this:

# in the migration:
    execute <<-SQL
      ALTER TABLE channels
      ADD COLUMN search_vector tsvector
      GENERATED ALWAYS AS (
        to_tsvector('english', coalesce(title, '') || ' ' || coalesce(summary, ''))
      ) STORED;
    SQL

# and in the model:
  table :channels do
    column search_vector : String?
  end

Then my search query is this:

    sql = <<-SQL
      WITH search_results AS (
        SELECT
          channels.id AS channel_id,
          ts_rank_cd(channels.search_vector, websearch_to_tsquery($1)) AS rank
        FROM channels
        WHERE channels.search_vector @@ websearch_to_tsquery($1)
        ORDER BY rank DESC
        LIMIT 25
      )

      SELECT
        channels.*
      FROM search_results
      JOIN channels ON search_results.channel_id = channels.id
    SQL

    AppDatabase.query_all(sql, query_input, as: PoDB::Channel)

The ts_rank_cd stuff is "under the hood," I think. I wonder if it would be possible to have a syntax like this: column.search(query) which:

  • is smart enough to know if the column type is a ts_vector or should be wrapped in a to_tsvector call -- or perhaps issues a warning if it's not?
  • wraps query in a websearch_to_tsquery
  • runs the match operator
  • sets a default order by key

Is that the right thing to do when someone writes that code? Is that even achievable? Is that too much magic?


As an aside, I had to construct this as a CTE because Avram doesn't know what to do with the rank column, but I need to SELECT it in order to have postgres calculate the rank in order to sort the results. If Avram could discard the column in the response, it would be unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oo nice. That helps seeing the example.

@jwoertink jwoertink merged commit db6ef66 into main Apr 19, 2024
9 checks passed
@jwoertink jwoertink deleted the issues/353_a branch April 19, 2024 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants