FTS5 auxiliary and tokenization function support · Issue #473 · rogerbinns/apsw

Make all this possible. Especially useful for ranking functions and synonyms.

By far the biggest difficulty is dealing with utf8 byte offsets in the tokenizer instead of codepoints.

Wrap calling existing tokenizers
Add an exception type for stale pointers that occur in other places in the code too
Implement own tokenizers
apsw.fts unicode tokenizer
apsw.fts.regex tokenizer
apsw.fts tokenizer that tries to work out utf8 offsets, see code in regex. Works providing text was not changed
apsw.fts stopwords tokenizer
apsw.fts synonyms tokenizer
apsw.fts test strings
apsw.fts facetted searching
fossil tokenizers? sqlite site search uses tokenize='html stoken unicode61 tokenchars _'
aux function to return which columns matched
aux function to score higher earlier in column match occurred
aux function to give different weighting to different columns
bm25 in python to show how to do it
apsw.fts query builder? Useful for query expansion etc
shell dot command .ftsq
JSON tokenizer
Emoji names synonyms (not doing because they can be multiple words)
Ngram tokenizer to use units of grapheme clusters not codepoints
Tokenizer filter that allows injecting tokens directly (in TOKENIZE_QUERY mode recognise inband signalling and irectly return, otherwise pass upstream. check in fts5table if token injection is supported)
Check .pyi file has tokenize constants
Wrap auxiliary function Fts5ExtensionApi
Implement own aux function
Rename various things in apsw.fts to be better
like shlex.split but using sqlite quoting rules; useful for shell and other contexts
key terms
IDF in C if Python too slow
grapheme aware highlight
sentence aware snippet
emoji
subsequence matching (eg any consecutive characters are matched in documents where there can be any number of characters between each input)
doc -m apsw.fts
consider -m apsw.ftstool for all the import tools
allow str or bytes everywhere utf8 is a parameter
Query expansion like in whoosh
"Did you mean?" replacements for mis-typed words like in whoosh
Autcomplete example using ngram and subsequence
Better ngram than builtin?
Performance profile html tokenizer - .ftsq search zeromalloc on sqlite doc takes one second to show results which is way too slow.. RESULT: snippet calls the tokenizer twice (once to score sentences, once to highlight). Most time is spent in the stdlib html parser parse_start/end_tag, goahead methods and all the regular expression stuff they do. Our code is less than 5% of execution time.
Update example code
Check tokendata works (embedded null in the token) and perhaps advise its use?
Type stubs need overload generation to fixup different returns based on parameters for at least tokenizer call
Implement Unicode TR-29 and TR-14 #509
Helper to figure out eg play station as two tokens could be playstation as one token or vice versa
Convenience wrapper around fts5vocab
Convenience wrapper around commands
Updates for extension api new functions https://www.sqlite.org/src/timeline?r=fts5-token-data
Anything else useful including equivalent examples from whoosh
Usage of stuff from https://www.nltk.org/
Figure out if, and how to handle dates. eg tokenizer in doc mode can colocate various levels of precision, while in query mode can turn yesterday or last year into tokens matching doc mode
Check for ::TODO::s
Change category mask in _unicodedb to use 64 bit so we get one bit per category
Changes doc about new out of scope exception
Possible to highlight ftsq matches in shell - using snippet with colour codes fails because they get quoted. Perhaps a private use char to mark begin and end of highlight that modes understand?
ShowResourceUsage should also get sqlite3_db_status fields and show changes
Tokens should be bytes not str - no they should be str
content table as view
Check xapian doc for features and examples
Check and update typing. Generator should be used on yielding functions
Consider adding codepoint names to apsw.unicode - need an effective "compression" mechanism
fts5-locale branch
GIL release in all the places?
Move ftstest.py into tests.py
Remove makefile ftscoverage rule
Remove "utf8" parameter from all encode and decode calls as it is the default
line snippet function using width 60 lines showing N lines before and after each matching phrase
Work with all the CPython versions
Update example output in doc of running tests since so many more are run now
Check tests run if FTS5 is compiled out of SQLite
tokendata prefix new option in 3.48

Some inspiration:

(examples) https://neuml.hashnode.dev/building-an-efficient-sparse-keyword-index-in-python
(tokenizer extension) https://github.com/kovidgoyal/calibre/blob/master/src/calibre/db/sqlite_extension.cpp
(different one) https://github.com/hideaki-t/sqlite-fts-python/
(test strings) https://www.unicode.org/Public/15.0.0/ucd/auxiliary/GraphemeBreakTest.txt
(deep dive) https://hsivonen.fi/string-length/

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FTS5 auxiliary and tokenization function support #473

FTS5 auxiliary and tokenization function support #473

rogerbinns commented Aug 17, 2023 •

edited

Loading

FTS5 auxiliary and tokenization function support #473

FTS5 auxiliary and tokenization function support #473

Comments

rogerbinns commented Aug 17, 2023 • edited Loading

rogerbinns commented Aug 17, 2023 •

edited

Loading