Locality Sensitive Hashing for semantic similarity

vs 3.x

LSH (Locality Sensitive Hashing) is primarily used to find, given a large set of documents, the near-duplicates among them. It can use hamming distance, jaccard coefficient, edit distance or other distance notion.

You can read the following tutorials if you want to understand more about it:

Although LSH is more to duplicated documents than to semantic similar ones, in this approach I make an effort to use LSH to calculate semantic similarity among texts. For that, the algorithm extracts, using TFIDF, the text's main tokens (or you can pre-calculate them and pass as param). Also, in this approach I use MinHash (which uses Jaccard similarity) as the Similarity function.

The overall aim is to reduce the number of comparisons needed to find similar items. LSH uses hash collisions to capture objects similarities. The hash collisions come in handy here as similar documents have a high probability of having the same hash value. The probability of a hash collision for a minhash is exactly the Jaccard similarity of two sets.

See this tutorial to see how use this LSH!

Run as following to install dependencies:

  python3 -m pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
LSH.py		LSH.py
README.md		README.md
TFIDF.py		TFIDF.py
requirements.txt		requirements.txt
tutorial.ipynb		tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Locality Sensitive Hashing for semantic similarity

vs 3.x

About

Releases

Packages

Languages

italo-batista/lsh-semantic-similarity

Folders and files

Latest commit

History

Repository files navigation

Locality Sensitive Hashing for semantic similarity

vs 3.x

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages