Skip to content

tinfante/CosineSimilarity

Repository files navigation

Cosine Similarity

SimpleCosineSimilarity

A very simple implementation of document similarity in Python using a vector space model with TF-IDF weights and Cosine Similarity.

Obviously this could be vastly improved using Numpy arrays, NLP libraries (e.g. NLTK, Spacy) to tokenize and maybe do lemmatization or stemming, an inverted index for querying in constant time, additive smoothing, etc. It could also be much shorter (e.g. using Sklearn's TF-IDF Vectorizer), but the goal was to have a very simple, easily understood, Python implementation from scratch. If one were to build a search engine in Python, then there's the excellent Whoosh library, that does all this and more.

SklearnCosineSimilarity

An example showing how easy it is to do the same using Sklearn's TfIdfVectorizer class and the cosine_similarity function. Again, this could be improved doing stemming/lemmatization, improving stopword filtering, using n-grams, etc., but the idea is to keep it simple and show how it can be done in less than 10 lines of code.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published