A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as much space as the files it indexes.
Microfts implements a trigram GIN (generalized inverted index), relying on LMDB for storage, an open source, embedded, NOSQL, key-value store library (so it’s linked into microfts, not an external service). It uses AskAlexSharov’s fork of bmatsuo’s lmdb-go package to connect to it.
Microfts is MIT licensed, (c) 2020 Bill Burdick. All rights reserved.
Note that building may generate warning messages from lmdb-go’s compilation of the LMDB C code.
go build -o microfts
./microfts create /tmp/bubba
This adds /tmp/tst to the database in /tmp/bubba
rm -rf /tmp/bubba
./microfts create /tmp/bubba
cat > /tmp/tst <<here
one
two three
four
four five
one two three
one three two
here
./microfts input -file /tmp/bubba /tmp/tst
./microfts info /tmp/bubba
./microfts search /tmp/bubba "one two"
./microfts delete /tmp/bubba /tmp/tst
./microfts compact /tmp/bubba
./microfts grams "this is a test"
./microfts grams -gx "this is a test"
./microfts search -candidates -grams /tmp/bubba thi tes est
- misc error
- file is missing
- file has changed
- file is unreadable
- no entry for file in database
- database missing
Usage: microfts info -groups DB print information about each group in the database, whether it is missing or changed whether it is an org-mode entry microfts info [-chunks] DB GROUP print info for a GROUP -chunks also prints the chunks in GROUP if it has a corresponding file microfts info [-grams] DB print info for database displays any groups which do not exist as files displays any groups which refer to files that have changed -grams displays distribution information about the trigram index microfts create [-s GRAMSIZE] DB create DATABASE if it does not exist microfts chunk [-nx | -data D | -dx] -d DELIM DB GROUP GRAMS microfts chunk [-nx | -data D | -dx] -gx DB GROUP GRAMS ADD a chunk to GROUP with GRAMS. -d means use DELIM to split GRAMS. -gx means GRAMS is hex encoded with two bytes for each gram using base 37. microfts grams [-gx] CHUNK output grams for CHUNK microfts input [-nx | -dx | -org] DB FILE... For each FILE, create a group with its name and add a CHUNK for each chunk of input. Chunk data is the line number, offset, and length for each chunk (starting at 1). -org means chunks are org elements, otherwise chunks are lines microfts delete [-nx] DB GROUP delete GROUP, its chunks, and tag entries. NOTE: THIS DOES NOT RECLAIM SPACE! USE COMPACT FOR THAT microfts compact DB Reclaim space for deleted groups microfts search [-n | -partial | -f | - limit N | -filter REGEXP | -u] DB TEXT query with TEXT for objects -f force search to skip changed and missing files instead of exiting -filter makes search only return chunks that match the REGEXP REGEXP syntax is here: https://golang.org/pkg/regexp/syntax/ microfts search -candidates [-grams | -gx | -gd | -n | -f | -limit N | -dx | -u] DB TERM1 ... dispay all candidates with the grams for TERMS without filtering -grams indicates TERMS are grams, otherwise extract grams from TERMS -gx: grams are in hex, -gd: grams are in decimal, otherwise they are 3-char strings microfts data [-nx | -dx] DB GROUP get data for each doc in GROUP microfts update [-t] DB reinput files that have changed delete files that have been removed -t means do a test run, printing what would have happened microfts empty DB GROUP... Create empty GROUPs, ignoring existing ones microfts is targeted for groups of small documents, like lines in a file. -candidates return docs with grams for search -chunks info DB GROUP: display all of a group's chunks -comp string compression type to use when creating a database -d string delimiter for unicode tags (default ",") -data string data to define for object -dx use hex instead of unicode for object data -end-format string search: Go format string for the end of a group Arg to printf is the FILE The default value is "" if -sexp is provided and -end-format is not, the default is "\n" Not used with search -fuzzy -sort -f search: skip changed and missing files instead of exiting -file search: display files rather than chunks -filter string search: filter results that match REGEXP -format string search: Go format string for each result Args to printf are FILE POSITION LINE OFFSET PERCENTAGE CHUNK FILE (string) is the name of the file POSITION (int) is the 1-based character position of the chunk in the file LINE (int) is the 1-based line of the chunk in the file OFFSET (int) is the 0-based offset of the first match in the chunk PERCENTAGE (float) is the percentage of a fuzzy match Note that you can place [ARGNUM] after the % to pick a particular arg to format The default format is %s:%[2]s:%[5]s\n -sexp sets format to (:filename "%s" :line %[3]d :offset %[4]d :text "%[6]s" :percent %[5]f) Note that this will cause all matches to be on one (potentially large) line of output (default "%[6]s:%[2]d:%[5]s\n") -fuzzy float search: specify a percentage fuzzy match -gd use decimal instead of unicode for grams -grams get: specify tags for intead of text info: print gram coverage search: specify grams instead of search terms -groups info: display information for each group -gx use hex instead of unicode for grams -limit int search: limit the number of results (default 9223372036854775807) -n only print line numbers for search -org index org-mode chunks instead of lines -partial search: allow partial matches in search -prof profile cpu -s int gram size -sep print candidates on separate lines -sexp search: output matches as an s-expression ((FILE (POS LINE OFFSET chunk) ... ) ... ) POS is the 1-based character position of the chunk in the file LINE is the 1-based line of the chunk in the file OFFSET is the 0-based offset of the first match in the chunk -sort search -fuzzy: sort all matches This ignores start-format and end-format because it sorts all matches, regardless of which file they come from. -start-format string search: Go format string for the start of a group Arg to printf is the FILE The default value is "" Not used with search -fuzzy -sort -t update: do a test run, printing what would have happened -u search: update the database before searching -v verbose
Only alphanumeric characters are represented faithfully in grams, other characters are considered whitespace and display as ‘.’. This makes a base-37 triple (0-9 and A-Z), which just fits into 2 bytes. Which is a big deal, spacewise. Grams for starts of words begin with two whitespaces and ends of words end with one whitespace. There are no grams that end with two whitespaces.
The index consists of grams for chunks that belong to groups. Groups have names and the default is to use file names as group names.
Microfts supports using file names as groups and splitting files into chunks either by line or by org-mode element, with the chunk data being a triple of line, offset, chunk-length. Searching finds candidate chunks by intersecting gram entries and then consults the files named by the groups for the actual content.
If this is not sufficient, the command also supports custom usage: you can add chunks to a group, specifying data and grams. Searching can return candidate chunks for a set of grams.
7 bits | 0 - 127 | 0xxxxxxx |
12 bits | 128 - 4095 | 1000xxxx X |
20 bits | 4096 - 1048575 | 1001xxxx X X |
28 bits | 1048576 - 268435455 | 1010xxxx X X X |
36 bits | 268435456 - 68719476735 | 1011xxxx X X X X |
44 bits | 68719476736 - 17592186044415 | 1100xxxx X X X X X |
52 bits | 17592186044416 - 4503599627370495 | 1101xxxx X X X X X X |
60 bits | 4503599627370496 - 1152921504606846975 | 1110xxxx X X X X X X X |
64 bits | 1152921504606846976 - 18446744073709551615 | 1111---- X X X X X X X X |
GRAM is a 2-byte value
OID LIST |
---|
9 lists of oids: [9][]byte.
Note – this is probably too ornate and a simple byte array and a count might have the same performance and space.
# 1-byte OIDS |
# 2-byte OIDS |
# 3-byte OIDS |
# 4-byte OIDS |
# 5-byte OIDS |
# 6-byte OIDS |
# 7-byte OIDS |
# 8-byte OIDS |
# 9-byte OIDS |
OIDS |
---|
next unused oid |
next unused gid |
free oids |
free gids |
---|
OIDS are compressed integers
GID |
data (e.g. line number) |
gram count |
---|
GIDS are compressed integers
NAME |
oid count |
last changed timestamp |
validity (valid = 0, deleted = 1) |
org flag (whether -org was used) |
---|