-
-
Notifications
You must be signed in to change notification settings - Fork 389
Speed up fuzzy search #2639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up fuzzy search #2639
Conversation
Thanks @Bodigrim (and sorry to have dragged you into another project :) ) I have a benchmark in the Sigma codebase for this completions fuzzy search, but I will need a few days to find the time. Probably over the weekend. |
I've probably broken something, but it's still surprising that 9.0.1 Ubuntu and 9.2.1 Mac pass, while 9.2.1 Ubuntu fails. |
So helpful! Cannot wait for GHC 9.4 with stack traces for |
@pepeiborra this is finally ready for review. |
Benchmarked using the completions experiment of the ghcide bench suite on the Sigma codebase. The results show that the new code allocates much less and is up to 2X as fast. Nice job! BEFORE
AFTER
|
go !totalScore !currScore !currPOff !currSOff | ||
-- If pattern has been matched in full | ||
| currPOff >= pTotal | ||
= Just totalScore | ||
-- If there is not enough left to match the rest of the pattern, equivalent to | ||
-- (sOff + sLen - currSOff) < (pOff + pLen - currPOff) | ||
| currSOff > currPOff + sDelta | ||
= Nothing | ||
-- This is slightly broken for non-ASCII: | ||
-- 1. If code units, consisting a single pattern code point, are found as parts | ||
-- of different code points, it counts as a match. Unless you use a ton of emojis | ||
-- as identifiers, such false positives should not be be a big deal, | ||
-- and anyways HLS does not currently support such use cases, because it uses | ||
-- code point and UTF-16 code unit positions interchangeably. | ||
-- 2. Case conversions is not applied to non-ASCII code points, because one has | ||
-- to call T.toLower (not T.map toLower), reallocating the string in full, which | ||
-- is too much of performance penalty for fuzzy search. Again, anyway HLS does not | ||
-- attempt to do justice to Unicode: proper Unicode text matching requires | ||
-- `unicode-transforms` and friends. | ||
-- Altogether we sacrifice correctness for the sake of performance, which | ||
-- is a right trade-off for fuzzy search. | ||
| pByte <- TA.unsafeIndex pArr currPOff | ||
, sByte <- TA.unsafeIndex sArr currSOff | ||
-- First byte (currPOff == pOff) should match exactly, otherwise - up to case. | ||
, pByte == sByte || (currPOff /= pOff && pByte == toLowerAscii sByte) | ||
= let curr = currScore * 2 + 1 in | ||
go (totalScore + curr) curr (currPOff + 1) (currSOff + 1) | ||
| otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know that the new implementation produces the same scores? I would like to see a unit test here, or a property test.
It's ok for the test suite to depend on the fuzzy
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, even the old implementation was not fully matching fuzzy
package because of a different approach to case matching. What's the best place to add unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ghcide test suite.
@pepeiborra just a gentle reminder about my questions above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed a test suite
My intuition is that this approach should be faster, but I do not have data to back such claim. @pepeiborra any chance to test it with a huge list of completions?