Add fuzzy-matching for autocomplete #601

tconbeer · 2024-07-11T20:55:03Z

ESPECIALLY for member autocomplete (fuzzy-matching table and column names, etc.).

Dotrar · 2024-08-14T04:03:18Z

Hey @tconbeer - this program is awesome! thanks for making it!

I'm really keen to have fuzzy-matching, i often don't know what column names are in my database, so i rely on fuzzy matching (in tools like dbeaver) to get them for me.

Would it be as simple as changing the line here for what constitutes a matched completion option?
https://github.com/tconbeer/harlequin/blob/main/src/harlequin/autocomplete/completers.py#L52

I'd imagine, for fuzzy matching, everything would have a "score" which you could threshold.. or maybe, you'd just sort all options by the score (so that the list will be everything but the most likely options are near the top)

This could be quite slow in straight python... do you have any rules / guidelines for adding in C libs into the application? FZF comes to mind.

This might turn out to be a simple change, that I can have a look at doing if you would like.. I'm not sure what the difference between word and member completion classes are though 😅

tconbeer · 2024-08-14T14:48:45Z

Yes, you're on the right track there. To answer your questions:

The WordCompleter is called when you start typing any word character in the editor. It matches any keyword, db object, etc. The MemberCompleter is called when you type . (and any subsequent word characters); it returns matches that are members of a certain parent context (e.g., column names if the parent is a table). Both would likely benefit from fuzzy-matching, possibly with some different config. (There is also a PathCompleter but that's probably out of scope).
I agree that speed is important here. I'm fine adding c-extensions if they distribute wheels for most platforms. (e.g., I briefly looked at using a Trie for this, and marisa-trie was a candidate, since they distribute wheels for everything). If they don't have wheels, I'd want to add them as an optional dependency to make installing Harlequin as easy as possible for most people. RapidFuzz looks very promising. That said, I wouldn't rule out a pure-python approach; what stands today was my naive implementation that I assumed would be too slow but wasn't. A lot of string operations, regex, etc., are highly optimized by the stdlib already.
Textual already ships with a regex-based fuzzy matcher (source) that powers the command palette. I don't know if this is fast enough or provides the features we want but it could be a good starting point. If it's not fast enough, one basic optimization would be to replace their regex pattern with something that bounds the number of missing characters:

        self._query_regex = compile(
            ".{0,5}?".join(f"({escape(character)})" for character in query),
            flags=0 if case_sensitive else IGNORECASE,
        )

jspaezp · 2024-09-11T00:33:26Z

Hey there!

I crunched some of the numbers (extracted the values from WordCompleter.completions)
a local duckdb I had at hand (1412 options). Made a simpler version of the current implementation (that does not handle completion -> label conversion, and uses strings as inputs, instead of completions).

And made 2 simple options for the fuzzy find, one that is pretty much the implementation you posted (2 characters instead of 5 as max intermediates) and one that also adds .* at the begining and end of the expression [I didn't add a score to them for now...].

So for "sele", the expression would be:

unbound -> ^.*(s).{0,2}?(e).{0,2}?(l).{0,2}?(e).*$
bound -> ^.{0,2}?(s).{0,2}?(e).{0,2}?(l).{0,2}?(e).*$

For the word "select":
-> reference: "" takes 130 us (microseconds), "s" takes 80, anything else takes ("se" ..) ~ 76 ...
-> non-bound regex takes 230 us, 260 and 290 respectively.
-> binding the regex takes 231, 163 and 150 respectively.

if lets say ... I want to find the column "europe_sales" and I query "sale"
-> Reference would not find it (but take ~80us)
-> non-bound would take 240 us
-> bound not find it either (and would take 127 us)

sooo ... I think none of these cases is slow beyond being usable but there might be tradeoffs depending on what you want out of them.

LMK if you want me to write the PR and if you have a preference on the implementation!

[I also thought about potential 'hybrid' options ... like using the regex only if there are no suggestions on an exact match ... which might be a happy medium where you pay 100us for mis-typing the first character]

functions code used for the test:

def deduple_str(completions: list[str]):
    uniques: set[str] = set()
    uniques_add = uniques.add
    deduped = [m for m in completions if not (m in uniques or uniques_add(m))]
    return deduped


def refenrence_impl_str(prefix: str, completions: tuple[str, ...]):
    match_val = prefix.lower()

    exact_matches = [c for c in completions if c == match_val]
    matches = [c for c in completions if c.startswith(match_val)]
    return deduple_str(exact_matches + matches)


def alt_impl_str_1(prefix: str, completions: tuple[str, ...]):
    match_val = prefix.lower()
    regex_base = ".{0,2}?".join(f"({re.escape(c)})" for c in match_val)
    regex = "^.*" + regex_base + ".*$" ######## <------------- This is the diff line
    match_regex = re.compile(regex, re.IGNORECASE)

    exact_matches = [c for c in completions if c == match_val]
    matches = [c for c in completions if match_regex.match(c)]
    return deduple_str(exact_matches + matches)


def alt_impl_str_2(prefix: str, completions: tuple[str, ...]):
    match_val = prefix.lower()
    regex_base = ".{0,2}?".join(f"({re.escape(c)})" for c in match_val)
    regex = "^.{0,2}?" + regex_base + ".*$"  ######## <------------- This is the diff line
    match_regex = re.compile(regex, re.IGNORECASE)

    exact_matches = [c for c in completions if c == match_val]
    matches = [c for c in completions if match_regex.match(c)]
    return deduple_str(exact_matches + matches)

tconbeer added the enhancement New feature or request label Jul 11, 2024

jspaezp linked a pull request Oct 23, 2024 that will close this issue

feat: fuzzy matching completions #671

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fuzzy-matching for autocomplete #601

Add fuzzy-matching for autocomplete #601

tconbeer commented Jul 11, 2024

Dotrar commented Aug 14, 2024

tconbeer commented Aug 14, 2024 •

edited

Loading

jspaezp commented Sep 11, 2024

Add fuzzy-matching for autocomplete #601

Add fuzzy-matching for autocomplete #601

Comments

tconbeer commented Jul 11, 2024

Dotrar commented Aug 14, 2024

tconbeer commented Aug 14, 2024 • edited Loading

jspaezp commented Sep 11, 2024

tconbeer commented Aug 14, 2024 •

edited

Loading