You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
aux function to score higher earlier in column match occurred
aux function to give different weighting to different columns
bm25 in python to show how to do it
apsw.fts query builder? Useful for query expansion etc
shell dot command .ftsq
JSON tokenizer
Emoji names synonyms (not doing because they can be multiple words)
Ngram tokenizer to use units of grapheme clusters not codepoints
Tokenizer filter that allows injecting tokens directly (in TOKENIZE_QUERY mode recognise inband signalling and irectly return, otherwise pass upstream. check in fts5table if token injection is supported)
Check .pyi file has tokenize constants
Wrap auxiliary function Fts5ExtensionApi
Implement own aux function
Rename various things in apsw.fts to be better
like shlex.split but using sqlite quoting rules; useful for shell and other contexts
key terms
IDF in C if Python too slow
grapheme aware highlight
sentence aware snippet
emoji
subsequence matching (eg any consecutive characters are matched in documents where there can be any number of characters between each input)
doc -m apsw.fts
consider -m apsw.ftstool for all the import tools
allow str or bytes everywhere utf8 is a parameter
Query expansion like in whoosh
"Did you mean?" replacements for mis-typed words like in whoosh
Autcomplete example using ngram and subsequence
Better ngram than builtin?
Performance profile html tokenizer - .ftsq search zeromalloc on sqlite doc takes one second to show results which is way too slow.. RESULT: snippet calls the tokenizer twice (once to score sentences, once to highlight). Most time is spent in the stdlib html parser parse_start/end_tag, goahead methods and all the regular expression stuff they do. Our code is less than 5% of execution time.
Update example code
Check tokendata works (embedded null in the token) and perhaps advise its use?
Type stubs need overload generation to fixup different returns based on parameters for at least tokenizer call
Figure out if, and how to handle dates. eg tokenizer in doc mode can colocate various levels of precision, while in query mode can turn yesterday or last year into tokens matching doc mode
Check for ::TODO::s
Change category mask in _unicodedb to use 64 bit so we get one bit per category
Changes doc about new out of scope exception
Possible to highlight ftsq matches in shell - using snippet with colour codes fails because they get quoted. Perhaps a private use char to mark begin and end of highlight that modes understand?
ShowResourceUsage should also get sqlite3_db_status fields and show changes
Tokens should be bytes not str - no they should be str
content table as view
Check xapian doc for features and examples
Check and update typing. Generator should be used on yielding functions
Consider adding codepoint names to apsw.unicode - need an effective "compression" mechanism
fts5-locale branch
GIL release in all the places?
Move ftstest.py into tests.py
Remove makefile ftscoverage rule
Remove "utf8" parameter from all encode and decode calls as it is the default
line snippet function using width 60 lines showing N lines before and after each matching phrase
Work with all the CPython versions
Update example output in doc of running tests since so many more are run now
Make all this possible. Especially useful for ranking functions and synonyms.
By far the biggest difficulty is dealing with utf8 byte offsets in the tokenizer instead of codepoints.
tokenize='html stoken unicode61 tokenchars _'
.ftsq search zeromalloc
on sqlite doc takes one second to show results which is way too slow.. RESULT: snippet calls the tokenizer twice (once to score sentences, once to highlight). Most time is spent in the stdlib html parser parse_start/end_tag, goahead methods and all the regular expression stuff they do. Our code is less than 5% of execution time.play station
as two tokens could beplaystation
as one token or vice versayesterday
orlast year
into tokens matching doc modeSome inspiration:
The text was updated successfully, but these errors were encountered: