Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function trigramDistance() added for string similarity search #4466

Merged
merged 8 commits into from
Feb 25, 2019
Merged

Function trigramDistance() added for string similarity search #4466

merged 8 commits into from
Feb 25, 2019

Conversation

danlark1
Copy link
Contributor

@danlark1 danlark1 commented Feb 21, 2019

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

  • New Feature

Short description (up to few sentences):
Trigram distance was added. It is similar to q-gram metrics in R language. I call it trigramDistance()

Detailed description (optional):
OLD:We compute all the trigrams from needle and haystack, calculate the symmetric difference between these two sets and normalize by dividing to the sum of lengths. The speed is not that great but this metric is known to be one of the fastest.

NEW:
Distance function implementation.
We calculate all the trigrams from left string and count by the index of 16 bits hash of them in the map.
Then calculate all the trigrams from the right string and calculate the trigram distance on the flight by adding and subtracting from the hashmap. Then return the map into the condition of which it was after the left string calculation. If the right string size is big (more than 2**15 bytes), the strings are not similar at all and we return 1.

@danlark1 danlark1 changed the title Function trigramDistance added for string similarity search Function distance() added for string similarity search Feb 22, 2019
@alexey-milovidov
Copy link
Member

Comments are in Telegram chat.

@danlark1 danlark1 changed the title Function distance() added for string similarity search Function trigramDistance() added for string similarity search Feb 22, 2019
@alexey-milovidov alexey-milovidov merged commit bffe514 into ClickHouse:master Feb 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants