Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for the FTS5 trigram tokenizer #1655

Open
wants to merge 3 commits into
base: development
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Documentation/FTS5Tokenizers.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ All SQLite [built-in tokenizers](https://www.sqlite.org/fts5.html#tokenizers) to

- The [porter](https://www.sqlite.org/fts5.html#porter_tokenizer) tokenizer turns English words into their root: "database engine" gives the "databas" and "engin" tokens. The query "database engines" will match, because it produces the same tokens.

- The [trigram](https://sqlite.org/fts5.html#the_trigram_tokenizer) tokenizer treats each contiguous sequence of three characters as a token to allow general substring matching. "Sequence" gives "seq", "equ", "que", "uen", "enc" and "nce". The queries "SEQUENCE", "SEQUEN", "QUENC" and "QUE" all match as they decompose into a subset of the same trigrams.

However, built-in tokenizers don't match "first" with "1st", because they produce the different "first" and "1st" tokens.

Nor do they match "Grossmann" with "Großmann", because they produce the different "grossmann" and "großmann" tokens.
Expand Down
41 changes: 31 additions & 10 deletions Documentation/FullTextSearch.md
Original file line number Diff line number Diff line change
Expand Up @@ -386,7 +386,7 @@ See [SQLite documentation](https://www.sqlite.org/fts5.html) for more informatio

**A tokenizer defines what "matching" means.** Depending on the tokenizer you choose, full-text searches won't return the same results.

SQLite ships with three built-in FTS5 tokenizers: `ascii`, `porter` and `unicode61` that use different algorithms to match queries with indexed content.
SQLite ships with four built-in FTS5 tokenizers: `ascii`, `porter`, `unicode61` and `trigram` that use different algorithms to match queries with indexed content.

```swift
try db.create(virtualTable: "book", using: FTS5()) { t in
Expand All @@ -395,20 +395,23 @@ try db.create(virtualTable: "book", using: FTS5()) { t in
t.tokenizer = .unicode61(...)
t.tokenizer = .ascii
t.tokenizer = .porter(...)
t.tokenizer = .trigram(...)
}
```

See below some examples of matches:

| content | query | ascii | unicode61 | porter on ascii | porter on unicode61 |
| ----------- | ---------- | :----: | :-------: | :-------------: | :-----------------: |
| Foo | Foo | X | X | X | X |
| Foo | FOO | X | X | X | X |
| Jérôme | Jérôme | X ¹ | X ¹ | X ¹ | X ¹ |
| Jérôme | JÉRÔME | | X ¹ | | X ¹ |
| Jérôme | Jerome | | X ¹ | | X ¹ |
| Database | Databases | | | X | X |
| Frustration | Frustrated | | | X | X |
| content | query | ascii | unicode61 | porter on ascii | porter on unicode61 | trigram |
| ----------- | ---------- | :----: | :-------: | :-------------: | :-----------------: | :-----: |
| Foo | Foo | X | X | X | X | X |
| Foo | FOO | X | X | X | X | X |
| Jérôme | Jérôme | X ¹ | X ¹ | X ¹ | X ¹ | X ¹ |
| Jérôme | JÉRÔME | | X ¹ | | X ¹ | X ¹ |
| Jérôme | Jerome | | X ¹ | | X ¹ | X ¹ |
| Database | Databases | | | X | X | |
| Frustration | Frustrated | | | X | X | |
| Sequence | quenc | | | | | X |


¹ Don't miss [Unicode Full-Text Gotchas](#unicode-full-text-gotchas)

Expand Down Expand Up @@ -455,6 +458,24 @@ See below some examples of matches:

It strips diacritics from latin script characters if it wraps unicode61, and does not if it wraps ascii (see the example above).

- **trigram**

```swift
try db.create(virtualTable: "book", using: FTS5()) { t in
t.tokenizer = .trigram()
t.tokenizer = .trigram(matching: .caseInsensitiveRemovingDiacritics)
t.tokenizer = .trigram(matching: .caseSensitive)
}
```

The "trigram" tokenizer is case-insensitive for unicode characters by default. It matches "Jérôme" with "JÉRÔME".

Diacritics stripping can be enabled so it matches "jérôme" with "jerome". Case-sensitive matching can also be enabled but is mutually exclusive with diacritics stripping.

Unlike the other tokenizers, it provides general substring matching, matching "Sequence" with "que" by splitting character sequences into overlapping 3 character tokens (trigrams).

It can also act as an index for GLOB and LIKE queries depending on the configuration.
Jnosh marked this conversation as resolved.
Show resolved Hide resolved

See [SQLite tokenizers](https://www.sqlite.org/fts5.html#tokenizers) for more information, and [custom FTS5 tokenizers](FTS5Tokenizers.md) in order to add your own tokenizers.


Expand Down
42 changes: 42 additions & 0 deletions GRDB/FTS/FTS5.swift
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,48 @@ public struct FTS5 {
#endif
}

#if GRDBCUSTOMSQLITE || GRDBCIPHER
/// Options for trigram tokenizer character matching. Matches the raw
/// "case_sensitive" and "remove_diacritics" tokenizer arguments.
///
/// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
public enum TrigramTokenizerMatching: Sendable {
/// Case insensitive matching without removing diacritics. This
/// option matches the raw "case_sensitive=0 remove_diacritics=0"
/// tokenizer argument.
case caseInsensitive
/// Case insensitive matching that removes diacritics before
/// matching. This option matches the raw
/// "case_sensitive=0 remove_diacritics=1" tokenizer argument.
case caseInsensitiveRemovingDiacritics
/// Case sensitive matching. Diacritics are not removed when
/// performing case sensitive matching. This option matches the raw
/// "case_sensitive=1 remove_diacritics=0" tokenizer argument.
case caseSensitive
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have read the SQLite Documentation which says:

remove_diacritics [...] It may only be set to 1 if the case_sensitive options is set to 0 - setting both options to 1 is an error.

Still, I would suggest exposing two distinct options for case_sensitive and remove_diacritics.

The reasons for this suggestion are:

  1. As time has passed, I have learned that the Swift version of infrequently used SQLite APIs should be as close to the original as possible. It should have the full SQLite smell. This spares the user from translating the SQLite knowledge that was not always easily acquired into a foreign Swift API.

  2. SQLite evolves and fixes bugs. Today both options only support 0 and 1, but this may change. It is good to let the user extend the list of options, even if GRDB api lags behind:

    t.tokenizer = trigram(
      caseSensitive: .init(rawValue: 2),
      diacritics: .init(rawValue: 3))

    This kind of SQLite evolution has happened for Unicode61: remove_diacritics=2 was introduced once a bug was discovered in remove_diacritics=1.

Some GRDB APIs, such as FTS5.Diacritics used by Unicode61, do not support such extensibility. This is not an example to follow.


Maybe we could aim at:

// Public usage
t.tokenizer = trigram(
  caseSensitive: true, 
  diacritics: .remove)

API:

public struct FTS5TokenizerDescriptor {
    /// The "trigram" tokenizer.
    ///
    /// For example:
    ///
    /// ```swift
    /// try db.create(virtualTable: "book", using: FTS5()) { t in
    ///     t.tokenizer = .trigram()
    /// }
    /// ```
    ///
    /// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
    ///
    /// - parameters:
    ///     - caseSensitive: Unless specified, performs a case insensitive matching.
    ///     - diacritics: Unless specified, diacritics are not removed before matching.
    public static func trigram(
        caseSensitive: FTS5.TrigramCaseSensitiveOption? = nil,
        diacritics: FTS5.TrigramDiacriticsOption? = nil,
    ) -> FTS5TokenizerDescriptor { ... }
}

extension FTS5 {
    /// Case sensitivity options for the Trigram FTS5 tokenizer.
    /// Matches the raw "case_sensitive" tokenizer argument.
    ///
    /// Related SQLite documentation: <https://www.sqlite.org/fts5.html#the_trigram_tokenizer>
    public struct TrigramCaseSensitiveOption: RawRepresentable {
        var rawValue: Int
    }
    
    extension TrigramCaseSensitiveOption: ExpressibleByBooleanLiteral {
        /// When true, matches the "case_sensitive=1" trigram tokenizer argument.
        /// When false, it is "case_sensitive=0".
        public init(booleanLiteral value: Bool) {
            self = value ? .init(rawValue: 1) : .init(rawValue: 0)
        }
    }
    
    /// Diacritics options for the Trigram FTS5 tokenizer.
    /// Matches the raw "remove_diacritics" tokenizer argument.
    ///
    /// Related SQLite documentation: <https://www.sqlite.org/fts5.html#the_trigram_tokenizer>
    public struct TrigramDiacriticsOption: RawRepresentable {
        var rawValue: Int
        
        /// Do not remove diacritics. This option matches the raw
        /// "remove_diacritics=0" trigram tokenizer argument.
        public static let keep = Self(rawValue: 0)
        
        /// Remove diacritics. This option matches the raw
        /// "remove_diacritics=1" trigram tokenizer argument.
        public static let remove = Self(rawValue: 1)
    }
}

The FTS5.TrigramDiacriticsOption values (.keep and .remove), which require SQLite 3.45, would be unavailable on Apple platforms. FTS5.TrigramDiacriticsOption can remain fully available.

Copy link
Owner

@groue groue Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reading myself with a critic eye, I stick to my initial arguments about the "SQLite smell", but I really start to wonder if TrigramCaseSensitiveOption and TrigramDiacriticsOption are that useful.

t.tokenizer = trigram()
t.tokenizer = trigram(caseSensitive: 1)
t.tokenizer = trigram(caseSensitive: 0, removeDiacritics: 1)

public struct FTS5TokenizerDescriptor {
    public static func trigram(
        caseSensitive: Int? = nil,
        removeDiacritics: Int? = nil
    ) -> FTS5TokenizerDescriptor { ... }
}

This would pass my review like a breeze :-)

Yes, this would make the Swift FTS5 apis not very consistent. Keeping the old existing apis is very important, because all code that users can put in a migration must basically compile forever (modifying a migration is a cardinal sin). This should not prevent new apis from being shaped by the experience that was acquired over the years.

I leave it up to your good taste 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed feedback!

When using the raw Int? arguments we obviously can't attach any availability information. Do you think it would make sense to note the SQLite >= 3.45 requirement in the documentation for the removeDiacritics parameter? Or better to just leave that to the SQLite documentation entirely since we can't enforce it anyway?

(In theory we could also have two trigram() methods as well, one only taking the caseSensitive parameter and a second one with both parameters with more restrictive availability. But I don't think that makes sense since that approach won't work if SQLite adds more options for the existing parameters so probably better not to try to enforce availability for the parameters at all than try to partially do it.)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't think that makes sense since that approach won't work if SQLite adds more options for the existing parameters so probably better not to try to enforce availability for the parameters at all than try to partially do it.

Ah. I see. Yes it wouldn't be great if people could write t.trigram(caseSensitive: 3) in an app that targets iOS 17. OK, I have been a poor guide here.

So are we back to TrigramCaseSensitiveOption and TrigramDiacriticsOption, with built-in values subject to availability checkings, along with raw initializers for people who know what they are doing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense, that makes it possible to add availability checks while allowing the user to use raw values to bypass restrictions if needed.

A couple of minor questions:

  • You've used optional arguments to configure the options when creating the tokenizer as opposed to using a non-optional default value matching the SQLite default. Do I understand correctly that the idea here is for GRDB to respect the SQLite default, even if that default were to change in the future?
public static func trigram(
        caseSensitive: FTS5.TrigramCaseSensitiveOption? = nil,
        diacritics: FTS5.TrigramDiacriticsOption? = nil,
    ) -> FTS5TokenizerDescriptor { ... }
  • Do we make TrigramDiacriticsOption itself unavailable for non-custom SQLite or only the remove option? I don't think there is a use case for manually using a raw value on OSs where it isn't supported anyway?

  • Thoughts on adding Equatable, Hashable and BitwiseCopyable conformances for TrigramCaseSensitiveOption and TrigramDiacriticsOption; I don't imagine these would see much use but probably no downside to adding them either?

#else
/// Options for trigram tokenizer character matching. Matches the raw
/// "case_sensitive" and "remove_diacritics" tokenizer arguments.
///
/// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
@available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) // SQLite 3.35.0+ (3.34 actually)
public enum TrigramTokenizerMatching: Sendable {
/// Case insensitive matching without removing diacritics. This
/// option matches the raw "case_sensitive=0 remove_diacritics=0"
/// tokenizer argument.
case caseInsensitive
/// Case insensitive matching that removes diacritics before
/// matching. This option matches the raw
/// "case_sensitive=0 remove_diacritics=1" tokenizer argument.
@available(*, unavailable, message: "Requires a future OS release that includes SQLite >=3.45")
case caseInsensitiveRemovingDiacritics
/// Case sensitive matching. Diacritics are not removed when
/// performing case sensitive matching. This option matches the raw
/// "case_sensitive=1 remove_diacritics=0" tokenizer argument.
case caseSensitive
}
#endif

/// Creates an FTS5 module.
///
/// For example:
Expand Down
4 changes: 2 additions & 2 deletions GRDB/FTS/FTS5Tokenizer.swift
Original file line number Diff line number Diff line change
Expand Up @@ -148,11 +148,11 @@ extension FTS5Tokenizer {
private func tokenize(_ string: String, for tokenization: FTS5Tokenization)
throws -> [(token: String, flags: FTS5TokenFlags)]
{
try ContiguousArray(string.utf8).withUnsafeBufferPointer { buffer -> [(String, FTS5TokenFlags)] in
try string.utf8CString.withUnsafeBufferPointer { buffer -> [(String, FTS5TokenFlags)] in
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

guard let addr = buffer.baseAddress else {
return []
}
let pText = UnsafeMutableRawPointer(mutating: addr).assumingMemoryBound(to: CChar.self)
let pText = addr
let nText = CInt(buffer.count)

var context = TokenizeContext()
Expand Down
65 changes: 65 additions & 0 deletions GRDB/FTS/FTS5TokenizerDescriptor.swift
Original file line number Diff line number Diff line change
Expand Up @@ -210,5 +210,70 @@ public struct FTS5TokenizerDescriptor: Sendable {
}
return FTS5TokenizerDescriptor(components: components)
}

#if GRDBCUSTOMSQLITE || GRDBCIPHER
/// The "trigram" tokenizer.
///
/// For example:
///
/// ```swift
/// try db.create(virtualTable: "book", using: FTS5()) { t in
/// t.tokenizer = .trigram()
/// }
/// ```
///
/// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
///
/// - parameters:
/// - matching: By default SQLite will perform case insensitive
/// matching and not remove diacritics before matching.
public static func trigram(
matching: FTS5.TrigramTokenizerMatching = .caseInsensitive
) -> FTS5TokenizerDescriptor {
var components = ["trigram"]
switch matching {
case .caseInsensitive:
break
case .caseInsensitiveRemovingDiacritics:
components.append(contentsOf: ["remove_diacritics", "1"])
case .caseSensitive:
components.append(contentsOf: ["case_sensitive", "1"])
}

return FTS5TokenizerDescriptor(components: components)
}
#else
/// The "trigram" tokenizer.
///
/// For example:
///
/// ```swift
/// try db.create(virtualTable: "book", using: FTS5()) { t in
/// t.tokenizer = .trigram()
/// }
/// ```
///
/// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
///
/// - parameters:
/// - matching: By default SQLite will perform case insensitive
/// matching and not remove diacritics before matching.
@available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) // SQLite 3.35.0+ (3.34 actually)
public static func trigram(
matching: FTS5.TrigramTokenizerMatching = .caseInsensitive
) -> FTS5TokenizerDescriptor {
var components = ["trigram"]
switch matching {
case .caseInsensitive:
break
case .caseInsensitiveRemovingDiacritics:
components.append(contentsOf: ["remove_diacritics", "1"])
case .caseSensitive:
components.append(contentsOf: ["case_sensitive", "1"])
}

return FTS5TokenizerDescriptor(components: components)
}
#endif
}
#endif
82 changes: 82 additions & 0 deletions Tests/GRDBTests/FTS5TableBuilderTests.swift
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,89 @@ class FTS5TableBuilderTests: GRDBTestCase {
assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''unicode61'' ''tokenchars'' ''-.''')")
}
}

func testTrigramTokenizer() throws {
#if GRDBCUSTOMSQLITE || GRDBCIPHER
guard sqlite3_libversion_number() >= 3034000 else {
throw XCTSkip("FTS5 trigram tokenizer is not available")
}
#else
guard #available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) else {
throw XCTSkip("FTS5 trigram tokenizer is not available")
}
#endif

let dbQueue = try makeDatabaseQueue()
try dbQueue.inDatabase { db in
try db.create(virtualTable: "documents", using: FTS5()) { t in
t.tokenizer = .trigram()
t.column("content")
}
assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''trigram''')")
}
}

func testTrigramTokenizerCaseInsensitive() throws {
#if GRDBCUSTOMSQLITE || GRDBCIPHER
guard sqlite3_libversion_number() >= 3034000 else {
throw XCTSkip("FTS5 trigram tokenizer is not available")
}
#else
guard #available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) else {
throw XCTSkip("FTS5 trigram tokenizer is not available")
}
#endif

let dbQueue = try makeDatabaseQueue()
try dbQueue.inDatabase { db in
try db.create(virtualTable: "documents", using: FTS5()) { t in
t.tokenizer = .trigram(matching: .caseInsensitive)
t.column("content")
}
assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''trigram''')")
}
}

func testTrigramTokenizerCaseSensitive() throws {
#if GRDBCUSTOMSQLITE || GRDBCIPHER
guard sqlite3_libversion_number() >= 3034000 else {
throw XCTSkip("FTS5 trigram tokenizer is not available")
}
#else
guard #available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) else {
throw XCTSkip("FTS5 trigram tokenizer is not available")
}
#endif

let dbQueue = try makeDatabaseQueue()
try dbQueue.inDatabase { db in
try db.create(virtualTable: "documents", using: FTS5()) { t in
t.tokenizer = .trigram(matching: .caseSensitive)
t.column("content")
}
assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''trigram'' ''case_sensitive'' ''1''')")
}
}

func testTrigramTokenizerCaseInsensitiveRemovingDiacritics() throws {
#if GRDBCUSTOMSQLITE || GRDBCIPHER
guard sqlite3_libversion_number() >= 3045000 else {
throw XCTSkip("FTS5 trigram tokenizer remove_diacritics is not available")
}

let dbQueue = try makeDatabaseQueue()
try dbQueue.inDatabase { db in
try db.create(virtualTable: "documents", using: FTS5()) { t in
t.tokenizer = .trigram(matching: .caseInsensitiveRemovingDiacritics)
t.column("content")
}
assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''trigram'' ''remove_diacritics'' ''1''')")
}
#else
throw XCTSkip("FTS5 trigram tokenizer remove_diacritics is not available")
#endif
}

func testColumns() throws {
let dbQueue = try makeDatabaseQueue()
try dbQueue.inDatabase { db in
Expand Down
Loading