-
Notifications
You must be signed in to change notification settings - Fork 0
Tools
The following is a list of current tools included in the prima, instructions on how to use them, and links for further explanation. Please note, all tools need to be run from inside the directory created by init_project.sh.
Taking the number of documents to be returned (defaults to 10 documents) and a set of space-separated keywords in quotes as input, this ranks documents in the collection by relevance according to the BM25 model. For example, running
$ bm25.sh "prima query bm25"
will rank documents within the source/ folder and output the result (a list of 10 documents and scores ranked by score) in the directory processed/bm25/bm25.csv which holds all queries and scores ever successfully calculated. Running
$ bm25.sh 4 "prima query bm25"
will write to bm25.csv only the 4 highest-scoring documents.
Taking either nothing (defaults to 3 clusters) or the number of clusters k as input, this clusters the documents in the corpus into k groups according to the k means algorithm. For example, running
$ k_means_clusterer.sh
will classify the documents within the source/ folder into 3 clusters while running
$ k_means_clusterer.sh 10
will classify the documents within the source/ folder into 10 clusters. The clusters and their associated documents are saved in processed/k_means/k_means.csv as comma-separated lists of documents. Vectors are weighted using lnc.ltc according to SMART notation.
Taking either nothing (defaults to 100 dimensions) or the number of dimensions k as input, this builds a low-rank approximation of a term document matrix of size k according to Latent Dirichlet allocation. This matrix is then saved in your collection processed/lda/lda.csv. For example, running
$ lda.sh
will reduce the term-document matrix c into a 100-by-100 matrix and save it in lda.csv while running
$ lda.sh 200
will reduce the term-document matrix c into a 200-by-200 matrix.
Taking either nothing (defaults to 100 dimensions) or the number of dimensions k as input, this builds a low-rank approximation of a term document matrix of size k according to Latent semantic indexing. This matrix is then saved in your collection processed/lsi/lsi.csv. For example, running
$ lsi.sh
will reduce the term-document matrix c into a 100-by-100 matrix and save it in lsi.csv while running
$ lsi.sh 200
will reduce the term-document matrix c into a 200-by-200 matrix.
These two functions compute the MinHash values of all the documents in the source/ directory and output the most similar documents to a query document respectively. MinHash values are saved in processed/min_hash/min_hash.csv as a table with columns labelled by document names and rows corresponding to each different hash function used. Query documents and a list of their closest duplicate documents are saved in processed/min_hash/min_hash_sim.csv. min_hash.sh takes no command line arguments. min_hash_sim.sh takes either one or two command line arguments - the path to the directory for which near-duplicates are to be found (required) and either the number of documents to be listed in the output or the minimum score bound (default is 10 documents). For example running
$ min_hash.sh
$ min_hash_sim.sh source/folder/document 20
will save the min_hash table in min_hash.csv and in min_hash_sim.csv at the bottom below the line source/folder/document will be a list of 20 documents and, for each document, a number between 0 and 1 corresponding to what percentage of the hash functions hashed to the same values. Since min_hash.sh has already been run, running
$ min_hash_sim.sh source/folder/document 0.75
will save the same document in the same location but instead of 20 documents, it will rank only the documents which scored over 0.75. Finally, running
$ min_hash_sim.sh source/folder/document
will save a list of 10 documents and their respective scores.
Taking no input this calculates the term frequency (tf), document frequency (df), and tfidf of all the documents in the collection. These values are then saved in the directory processed/tfidf as df.csv (holding terms and their document frequencies), tf.csv (term document pairs and their term frequencies), and tfidf.csv (term document pairs and their tf-idf values). Run with
$ tfidf.sh
This also creates a file processed/tfidf/tfidf.json which holds a JSON list of the values calculated.
idf was calculated using log(N/df) where N is the size of documents in the corpus (corpus here is defined as the whole collection and N is the total number of documents read).
Taking no input this will count all words in the readable files (.txt and .pdf) and folders in the input directory. For example, running
$ word_count.sh
will create a file in the directory processed/word_count called word_count.csv with the word count of all folders in source and all documents within those folders of the form:
source, 10
source/folder1, 4
source/folder1/file1.txt, 2
source/folder1/file2.pdf, 2
source/folder2, 6
source/folder2/file1.txt, 6