-
-
Notifications
You must be signed in to change notification settings - Fork 722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for the FTS5 trigram tokenizer #1655
base: development
Are you sure you want to change the base?
Conversation
Hello @Jnosh, thank you very much for this pull request. I intended to review this week-end, but I could not. Please stay tuned. |
Did some minor testing with this since it aligned with testing my own stuff. I defined an FTS5 virtual table with trigram tokenizer using new code here, loaded 105,000 word manuscript, each paragraph its own row, and ran some |
🥳 Thank you @Jnosh!
👍 I like it. In my review I just added a link where some details were missing.
Yes, I have a feedback ;-) See the review below.
OK.
Thanks for modernizing this old code 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much! Please find my comments below.
GRDB/FTS/FTS5.swift
Outdated
public enum TrigramTokenizerMatching: Sendable { | ||
/// Case insensitive matching without removing diacritics. This | ||
/// option matches the raw "case_sensitive=0 remove_diacritics=0" | ||
/// tokenizer argument. | ||
case caseInsensitive | ||
/// Case insensitive matching that removes diacritics before | ||
/// matching. This option matches the raw | ||
/// "case_sensitive=0 remove_diacritics=1" tokenizer argument. | ||
case caseInsensitiveRemovingDiacritics | ||
/// Case sensitive matching. Diacritics are not removed when | ||
/// performing case sensitive matching. This option matches the raw | ||
/// "case_sensitive=1 remove_diacritics=0" tokenizer argument. | ||
case caseSensitive | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have read the SQLite Documentation which says:
remove_diacritics
[...] It may only be set to 1 if thecase_sensitive
options is set to 0 - setting both options to 1 is an error.
Still, I would suggest exposing two distinct options for case_sensitive
and remove_diacritics
.
The reasons for this suggestion are:
-
As time has passed, I have learned that the Swift version of infrequently used SQLite APIs should be as close to the original as possible. It should have the full SQLite smell. This spares the user from translating the SQLite knowledge that was not always easily acquired into a foreign Swift API.
-
SQLite evolves and fixes bugs. Today both options only support 0 and 1, but this may change. It is good to let the user extend the list of options, even if GRDB api lags behind:
t.tokenizer = trigram( caseSensitive: .init(rawValue: 2), diacritics: .init(rawValue: 3))
This kind of SQLite evolution has happened for Unicode61:
remove_diacritics=2
was introduced once a bug was discovered inremove_diacritics=1
.
Some GRDB APIs, such as FTS5.Diacritics
used by Unicode61, do not support such extensibility. This is not an example to follow.
Maybe we could aim at:
// Public usage
t.tokenizer = trigram(
caseSensitive: true,
diacritics: .remove)
API:
public struct FTS5TokenizerDescriptor {
/// The "trigram" tokenizer.
///
/// For example:
///
/// ```swift
/// try db.create(virtualTable: "book", using: FTS5()) { t in
/// t.tokenizer = .trigram()
/// }
/// ```
///
/// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
///
/// - parameters:
/// - caseSensitive: Unless specified, performs a case insensitive matching.
/// - diacritics: Unless specified, diacritics are not removed before matching.
public static func trigram(
caseSensitive: FTS5.TrigramCaseSensitiveOption? = nil,
diacritics: FTS5.TrigramDiacriticsOption? = nil,
) -> FTS5TokenizerDescriptor { ... }
}
extension FTS5 {
/// Case sensitivity options for the Trigram FTS5 tokenizer.
/// Matches the raw "case_sensitive" tokenizer argument.
///
/// Related SQLite documentation: <https://www.sqlite.org/fts5.html#the_trigram_tokenizer>
public struct TrigramCaseSensitiveOption: RawRepresentable {
var rawValue: Int
}
extension TrigramCaseSensitiveOption: ExpressibleByBooleanLiteral {
/// When true, matches the "case_sensitive=1" trigram tokenizer argument.
/// When false, it is "case_sensitive=0".
public init(booleanLiteral value: Bool) {
self = value ? .init(rawValue: 1) : .init(rawValue: 0)
}
}
/// Diacritics options for the Trigram FTS5 tokenizer.
/// Matches the raw "remove_diacritics" tokenizer argument.
///
/// Related SQLite documentation: <https://www.sqlite.org/fts5.html#the_trigram_tokenizer>
public struct TrigramDiacriticsOption: RawRepresentable {
var rawValue: Int
/// Do not remove diacritics. This option matches the raw
/// "remove_diacritics=0" trigram tokenizer argument.
public static let keep = Self(rawValue: 0)
/// Remove diacritics. This option matches the raw
/// "remove_diacritics=1" trigram tokenizer argument.
public static let remove = Self(rawValue: 1)
}
}
The FTS5.TrigramDiacriticsOption
values (.keep
and .remove
), which require SQLite 3.45, would be unavailable on Apple platforms. FTS5.TrigramDiacriticsOption
can remain fully available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-reading myself with a critic eye, I stick to my initial arguments about the "SQLite smell", but I really start to wonder if TrigramCaseSensitiveOption
and TrigramDiacriticsOption
are that useful.
t.tokenizer = trigram()
t.tokenizer = trigram(caseSensitive: 1)
t.tokenizer = trigram(caseSensitive: 0, removeDiacritics: 1)
public struct FTS5TokenizerDescriptor {
public static func trigram(
caseSensitive: Int? = nil,
removeDiacritics: Int? = nil
) -> FTS5TokenizerDescriptor { ... }
}
This would pass my review like a breeze :-)
Yes, this would make the Swift FTS5 apis not very consistent. Keeping the old existing apis is very important, because all code that users can put in a migration must basically compile forever (modifying a migration is a cardinal sin). This should not prevent new apis from being shaped by the experience that was acquired over the years.
I leave it up to your good taste 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed feedback!
When using the raw Int?
arguments we obviously can't attach any availability information. Do you think it would make sense to note the SQLite >= 3.45 requirement in the documentation for the removeDiacritics
parameter? Or better to just leave that to the SQLite documentation entirely since we can't enforce it anyway?
(In theory we could also have two trigram()
methods as well, one only taking the caseSensitive
parameter and a second one with both parameters with more restrictive availability. But I don't think that makes sense since that approach won't work if SQLite adds more options for the existing parameters so probably better not to try to enforce availability for the parameters at all than try to partially do it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't think that makes sense since that approach won't work if SQLite adds more options for the existing parameters so probably better not to try to enforce availability for the parameters at all than try to partially do it.
Ah. I see. Yes it wouldn't be great if people could write t.trigram(caseSensitive: 3)
in an app that targets iOS 17. OK, I have been a poor guide here.
So are we back to TrigramCaseSensitiveOption
and TrigramDiacriticsOption
, with built-in values subject to availability checkings, along with raw initializers for people who know what they are doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense, that makes it possible to add availability checks while allowing the user to use raw values to bypass restrictions if needed.
A couple of minor questions:
- You've used optional arguments to configure the options when creating the tokenizer as opposed to using a non-optional default value matching the SQLite default. Do I understand correctly that the idea here is for GRDB to respect the SQLite default, even if that default were to change in the future?
public static func trigram(
caseSensitive: FTS5.TrigramCaseSensitiveOption? = nil,
diacritics: FTS5.TrigramDiacriticsOption? = nil,
) -> FTS5TokenizerDescriptor { ... }
-
Do we make
TrigramDiacriticsOption
itself unavailable for non-custom SQLite or only theremove
option? I don't think there is a use case for manually using a raw value on OSs where it isn't supported anyway? -
Thoughts on adding
Equatable
,Hashable
andBitwiseCopyable
conformances forTrigramCaseSensitiveOption
andTrigramDiacriticsOption
; I don't imagine these would see much use but probably no downside to adding them either?
@@ -148,11 +148,11 @@ extension FTS5Tokenizer { | |||
private func tokenize(_ string: String, for tokenization: FTS5Tokenization) | |||
throws -> [(token: String, flags: FTS5TokenFlags)] | |||
{ | |||
try ContiguousArray(string.utf8).withUnsafeBufferPointer { buffer -> [(String, FTS5TokenFlags)] in | |||
try string.utf8CString.withUnsafeBufferPointer { buffer -> [(String, FTS5TokenFlags)] in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Co-authored-by: Gwendal Roué <gr@pierlis.com>
…igramDiacriticsOption
@Jnosh I'm very slow to reply because I'm quite busy these days. Please be assured I did not forget this PR. |
No worries! |
I had added support for the newish FTS5 trigram tokenizer in my own project at some point. Upstreaming it here after some polish.
I kept the documentation basic, the trigram tokenizer is a bit different than the regular ones and has some caveats but probably best to leave it to the SQLite documentation to lists the details?
The tokenizer (currently) has two boolean arguments but only 3 combinations are valid so I implemented the options as a single enum in Swift. I hope that is OK? Feedback welcome, especially on the enum and case naming.
I tried to add
GRDBCUSTOMSQLITE
,GRDBCIPHER
conditionals and availability annotations as needed. I would welcome if you had a look if they look sensible.One of the tokenizer options (
remove_diacritics
, e.g.TrigramTokenizerMatching.caseInsensitiveRemovingDiacritics
) is available from SQLite3.45
(according to an SQLite forum post, it's not mentioned in the release notes).That is newer than what Apple currently ships so this option can't currently be used with the system provided SQLite. I was wondering if you have a preferred way of handling that.
For now I added an
@available(*, unavailable)
annotation with a message indicating that a future OS release is required.Alternatively the stdlib approach with an
@available(macOS 9999, ...)
annotation could be used but I feel a custom message is maybe a bit more self-explanatory.Finally we could just comment that option out (for the non-custom build) until Apple ships a release with SQLite
>3.45
.I made a minor adjustment in
FTS5Tokenizer.swift
to create a null-terminated C string when invoking a tokenizer from GRDB to work around a bug with the trigram tokenizer where it doesn't always handle non-null terminated strings correctly.The string contents were also copied before so this shouldn't perform worse.
Pull Request Checklist
development
branch.make smokeTest
terminal command runs without failure.