Function trigramDistance() added for string similarity search #4466
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
For changelog. Remove if this is non-significant change.
Category (leave one):
Short description (up to few sentences):
Trigram distance was added. It is similar to q-gram metrics in R language. I call it trigramDistance()
Detailed description (optional):
OLD:We compute all the trigrams from needle and haystack, calculate the symmetric difference between these two sets and normalize by dividing to the sum of lengths. The speed is not that great but this metric is known to be one of the fastest.
NEW:
Distance function implementation.
We calculate all the trigrams from left string and count by the index of 16 bits hash of them in the map.
Then calculate all the trigrams from the right string and calculate the trigram distance on the flight by adding and subtracting from the hashmap. Then return the map into the condition of which it was after the left string calculation. If the right string size is big (more than 2**15 bytes), the strings are not similar at all and we return 1.