Parser of PDF/DOCX/TXT/HTML formats in XML
Development of a finite state machine for regular expression, implementation of two string comparison algorithms and analysis of their behavior using the example of a dendrogram of hierarchical clustering of words from an article
Making a stop-word dictionary and thematic dictionaries using TF-IDF and a contrast method
Co-co-occurrence matrix, PPMI and LSA matrix, cosine similarity and scalar similarity
Naive Bayes spam filtering
Test results:
Precision(spam): 0.9371069182389937 Recall(spam): 0.9802631578947368 F-score(spam): 0.9581993569131834.
Vectorization of words and making a list of top n most likely words for each topic and a list of top n most likely topics for each document based on the PLSA model (own implementation). And Word2Vec for word similarity