Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fuzzy-matching for autocomplete #601

Open
tconbeer opened this issue Jul 11, 2024 · 3 comments · May be fixed by #671
Open

Add fuzzy-matching for autocomplete #601

tconbeer opened this issue Jul 11, 2024 · 3 comments · May be fixed by #671
Labels
enhancement New feature or request

Comments

@tconbeer
Copy link
Owner

ESPECIALLY for member autocomplete (fuzzy-matching table and column names, etc.).

@tconbeer tconbeer added the enhancement New feature or request label Jul 11, 2024
@Dotrar
Copy link

Dotrar commented Aug 14, 2024

Hey @tconbeer - this program is awesome! thanks for making it!

I'm really keen to have fuzzy-matching, i often don't know what column names are in my database, so i rely on fuzzy matching (in tools like dbeaver) to get them for me.

Would it be as simple as changing the line here for what constitutes a matched completion option?
https://github.com/tconbeer/harlequin/blob/main/src/harlequin/autocomplete/completers.py#L52

I'd imagine, for fuzzy matching, everything would have a "score" which you could threshold.. or maybe, you'd just sort all options by the score (so that the list will be everything but the most likely options are near the top)

This could be quite slow in straight python... do you have any rules / guidelines for adding in C libs into the application? FZF comes to mind.

This might turn out to be a simple change, that I can have a look at doing if you would like.. I'm not sure what the difference between word and member completion classes are though 😅

@tconbeer
Copy link
Owner Author

tconbeer commented Aug 14, 2024

Yes, you're on the right track there. To answer your questions:

  1. The WordCompleter is called when you start typing any word character in the editor. It matches any keyword, db object, etc. The MemberCompleter is called when you type . (and any subsequent word characters); it returns matches that are members of a certain parent context (e.g., column names if the parent is a table). Both would likely benefit from fuzzy-matching, possibly with some different config. (There is also a PathCompleter but that's probably out of scope).
  2. I agree that speed is important here. I'm fine adding c-extensions if they distribute wheels for most platforms. (e.g., I briefly looked at using a Trie for this, and marisa-trie was a candidate, since they distribute wheels for everything). If they don't have wheels, I'd want to add them as an optional dependency to make installing Harlequin as easy as possible for most people. RapidFuzz looks very promising. That said, I wouldn't rule out a pure-python approach; what stands today was my naive implementation that I assumed would be too slow but wasn't. A lot of string operations, regex, etc., are highly optimized by the stdlib already.
  3. Textual already ships with a regex-based fuzzy matcher (source) that powers the command palette. I don't know if this is fast enough or provides the features we want but it could be a good starting point. If it's not fast enough, one basic optimization would be to replace their regex pattern with something that bounds the number of missing characters:
        self._query_regex = compile(
            ".{0,5}?".join(f"({escape(character)})" for character in query),
            flags=0 if case_sensitive else IGNORECASE,
        )

@jspaezp
Copy link

jspaezp commented Sep 11, 2024

Hey there!

I crunched some of the numbers (extracted the values from WordCompleter.completions)
a local duckdb I had at hand (1412 options). Made a simpler version of the current implementation (that does not handle completion -> label conversion, and uses strings as inputs, instead of completions).

And made 2 simple options for the fuzzy find, one that is pretty much the implementation you posted (2 characters instead of 5 as max intermediates) and one that also adds .* at the begining and end of the expression [I didn't add a score to them for now...].

So for "sele", the expression would be:

  • unbound -> ^.*(s).{0,2}?(e).{0,2}?(l).{0,2}?(e).*$
  • bound -> ^.{0,2}?(s).{0,2}?(e).{0,2}?(l).{0,2}?(e).*$

For the word "select":
-> reference: "" takes 130 us (microseconds), "s" takes 80, anything else takes ("se" ..) ~ 76 ...
-> non-bound regex takes 230 us, 260 and 290 respectively.
-> binding the regex takes 231, 163 and 150 respectively.

if lets say ... I want to find the column "europe_sales" and I query "sale"
-> Reference would not find it (but take ~80us)
-> non-bound would take 240 us
-> bound not find it either (and would take 127 us)

sooo ... I think none of these cases is slow beyond being usable but there might be tradeoffs depending on what you want out of them.

LMK if you want me to write the PR and if you have a preference on the implementation!

[I also thought about potential 'hybrid' options ... like using the regex only if there are no suggestions on an exact match ... which might be a happy medium where you pay 100us for mis-typing the first character]

functions code used for the test:

def deduple_str(completions: list[str]):
    uniques: set[str] = set()
    uniques_add = uniques.add
    deduped = [m for m in completions if not (m in uniques or uniques_add(m))]
    return deduped


def refenrence_impl_str(prefix: str, completions: tuple[str, ...]):
    match_val = prefix.lower()

    exact_matches = [c for c in completions if c == match_val]
    matches = [c for c in completions if c.startswith(match_val)]
    return deduple_str(exact_matches + matches)


def alt_impl_str_1(prefix: str, completions: tuple[str, ...]):
    match_val = prefix.lower()
    regex_base = ".{0,2}?".join(f"({re.escape(c)})" for c in match_val)
    regex = "^.*" + regex_base + ".*$" ######## <------------- This is the diff line
    match_regex = re.compile(regex, re.IGNORECASE)

    exact_matches = [c for c in completions if c == match_val]
    matches = [c for c in completions if match_regex.match(c)]
    return deduple_str(exact_matches + matches)


def alt_impl_str_2(prefix: str, completions: tuple[str, ...]):
    match_val = prefix.lower()
    regex_base = ".{0,2}?".join(f"({re.escape(c)})" for c in match_val)
    regex = "^.{0,2}?" + regex_base + ".*$"  ######## <------------- This is the diff line
    match_regex = re.compile(regex, re.IGNORECASE)

    exact_matches = [c for c in completions if c == match_val]
    matches = [c for c in completions if match_regex.match(c)]
    return deduple_str(exact_matches + matches)

@jspaezp jspaezp linked a pull request Oct 23, 2024 that will close this issue
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants