-
Notifications
You must be signed in to change notification settings - Fork 0
Use Cases
File system
Use Case: Set up the file system user by prima tools for input and output and pull two collection to get metrics on from the ia.
Precondition: prima has been successfully installed.
Postcondition: The appropriate directories and files are all in place for the user to use the below tools.
Basic Flow
-
To create a workspace user runs
$ init_workspace.sh $ cd workspace
-
To create the first collection user runs
$ init_project.sh projectname1 $ cd projectname1 $ init_collection.sh
-
To fetch the first collection from the archive user runs
$ fetch_collection.sh collectionname1
-
To move back out to workspace and create the second collection user runs
$ cd .. $ init_project.sh projectname2 $ cd projectname2 $ init_collection.sh
-
To fetch the second collection from the archive user runs
$ fetch_collection.sh collectionname2
-
User can then use the following to move into the appropriate project and use the tools below.
$ cd projectname
Okapi BM25
Use Case: Return a list of 10 documents (default) that best match a set of keywords (prima, query, and bm25).
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/bm25 directory there is a file, bm25.csv, containing (at the bottom) the query terms entered and a list of documents with their scores.
Basic Flow:
-
User runs
$ bm25.sh "prima query bm25"
-
User navigates into the processed/bm25 directory to find bm25.csv.
Use Case: Return a list of 50 documents that best match a set of keywords (prima, query, and bm25).
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/bm25 directory there is a file, bm25.csv, containing (at the bottom) the query terms entered and a list of documents with their scores.
Basic Flow:
-
User runs
$ bm25.sh 50 "prima query bm25"
-
User navigates into the processed/bm25 directory to find bm25.csv.
K-means
Use Case: Cluster similar documents into 3 groups (default) and view the document clusters.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/k_means directory there is a file, k_means.csv, with documents listed in k rows, each row corresponding to one cluster.
Basic Flow:
-
User runs
$ k_means_clusterer.sh
-
User navigates to the processed/k_means directory to find k_means.csv.
Use Case: Cluster similar documents into 7 groups and view the document clusters.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/k_means directory there is a file, k_means.csv, with documents listed in k rows, each row corresponding to one cluster.
Basic Flow:
-
User runs
$ k_means_clusterer.sh 7
-
User navigates to the processed/k_means directory to find k_means.csv.
Latent Dirichlet allocation
Use Case: Reduce the term-document matrix to 100 dimensions (default) using Latent Dirichlet allocation.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/lda directory there is a file, lda.csv, containing the new matrix.
Basic Flow:
-
User runs
$ lda.sh
-
User navigates to the processed/lda directory to find lda.csv.
Latent semantic indexing
Use Case: Reduce the term-document matrix to 200 dimensions using Latent semantic indexing.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/lsi directory there is a file, lsi.csv, containing the new matrix.
Basic Flow:
-
User runs
$ lsi.sh 200
-
User navigates to the processed/lsi directory to find lsi.csv.
Min Hash
Use Case: Find the 10 documents in the collection that are the most similar to a specific document according to 3-shingles on 200 hash functions (defaults).
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/min_hash directory there is a file min_hash_sim.csv which holds, for each query document, the list of the 10 most similar documents to the query document and their scores.
Basic Flow:
-
User runs the following to set up the min_hash table in min_hash.csv.
$ min_hash.sh
-
User runs
$ min_hash_sim.sh source/folder/doc1.txt
-
User navigates to the processed.min_hash_sim.csv, scrolls down to their query row ("doc1.txt") with the correct timestamp and reviews the list of similar documents.
Use Case: Find the 20 documents in the collection that are the most similar to a specific document according to 4-shingles on 250 hash functions.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/min_hash directory there is a file min_hash_sim.csv which holds, for each query document, the list of the 10 most similar documents to the query document and their scores.
Basic Flow:
-
User goes into projectname/processed/ and removes shingles.db then moves back to the projectname directory.
-
User runs the following to set up the min_hash table in min_hash.csv.
$ min_hash.sh 4 250
-
User runs
$ min_hash_sim.sh source/folder/doc2.txt 20
-
User navigates to the processed.min_hash_sim.csv, scrolls down to their query row ("doc2.txt") with the correct timestamp and reviews the list of similar documents.
Use Case: Find all documents in the collection that score at least .75 in min_hash similarity to a specific document.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User has also already run min_hash.sh with the desired number of shingles and hash functions (steps 1 and 2 in the previous use case.) User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/min_hash directory there is a file min_hash_sim.csv which holds, for each query, the list of documents in the collection that scored at least .75 and their scores.
Basic Flow:
-
User runs
$ min_hash_sim.sh source/folder/doc1.txt .75
-
User navigates to the processed.min_hash_sim.csv, scrolls down to their query row ("doc1.txt") with the correct timestamp and reviews the list of similar documents.
tf-idf
Use Case: Create a file containing the tf-idf of all term-document pairs in the collection.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/tfidf directory there are three files, idf.csv, df.csv, and tf.csv, holding the appropriate values.
Basic Flow:
-
User runs
$ tfidf.sh
-
User navigates into the processed/tfidf directory to find tfidf.csv.
tf-idf
Use Case: Create a .json file containing the document frequency of all term-document pairs in the collection.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/tfidf directory there are three files, idf.csv, df.csv, and tf.csv, holding the appropriate values as well as a fourth file tfidf.json containing the values.
Basic Flow:
-
User runs
$ tfidf.sh .csv
-
User navigates into the processed/tfidf directory to find df.tsv and tfidf.json.
Word Count
Use Case: Count all words in a file.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/word_count directory there is now a file, word_count.csv, containing a list of files and directories with their word counts, including the desired file.
Basic Flow:
-
User runs
$ word_count.sh
-
User navigates into the processed/word_count directory to find word_count.csv.