Use Cases

File system
Use Case: Set up the file system user by prima tools for input and output and pull two collection to get metrics on from the ia.
Precondition: prima has been successfully installed.
Postcondition: The appropriate directories and files are all in place for the user to use the below tools.
Basic Flow

To create a workspace user runs
```
$ init_workspace.sh
$ cd workspace
```

To create the first collection user runs

$ init_project.sh projectname1
$ cd projectname1
$ init_collection.sh

To fetch the first collection from the archive user runs
```
$ fetch_collection.sh collectionname1
```

To move back out to workspace and create the second collection user runs

$ cd ..
$ init_project.sh projectname2
$ cd projectname2
$ init_collection.sh

To fetch the second collection from the archive user runs
```
$ fetch_collection.sh collectionname2
```
User can then use the following to move into the appropriate project and use the tools below.
```
$ cd projectname
```

Okapi BM25
Use Case: Return a list of 10 documents (default) that best match a set of keywords (prima, query, and bm25).
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/bm25 directory there is a file, bm25.csv, containing (at the bottom) the query terms entered and a list of documents with their scores.
Basic Flow:

User runs
```
$ bm25.sh "prima query bm25" 
```
User navigates into the processed/bm25 directory to find bm25.csv.

Use Case: Return a list of 50 documents that best match a set of keywords (prima, query, and bm25).
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/bm25 directory there is a file, bm25.csv, containing (at the bottom) the query terms entered and a list of documents with their scores.
Basic Flow:

User runs
```
$ bm25.sh 50 "prima query bm25" 
```
User navigates into the processed/bm25 directory to find bm25.csv.

K-means
Use Case: Cluster similar documents into 3 groups (default) and view the document clusters.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/k_means directory there is a file, k_means.csv, with documents listed in k rows, each row corresponding to one cluster.
Basic Flow:

User runs
```
$ k_means_clusterer.sh
```
User navigates to the processed/k_means directory to find k_means.csv.

Use Case: Cluster similar documents into 7 groups and view the document clusters.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/k_means directory there is a file, k_means.csv, with documents listed in k rows, each row corresponding to one cluster.
Basic Flow:

User runs
```
$ k_means_clusterer.sh 7
```
User navigates to the processed/k_means directory to find k_means.csv.

Latent Dirichlet allocation
Use Case: Reduce the term-document matrix to 100 dimensions (default) using Latent Dirichlet allocation.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/lda directory there is a file, lda.csv, containing the new matrix.
Basic Flow:

User runs
```
$ lda.sh
```
User navigates to the processed/lda directory to find lda.csv.

Latent semantic indexing
Use Case: Reduce the term-document matrix to 200 dimensions using Latent semantic indexing.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/lsi directory there is a file, lsi.csv, containing the new matrix.
Basic Flow:

User runs
```
$ lsi.sh 200
```
User navigates to the processed/lsi directory to find lsi.csv.

Min Hash
Use Case: Find the 10 documents in the collection that are the most similar to a specific document according to 3-shingles on 200 hash functions (defaults).
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/min_hash directory there is a file min_hash_sim.csv which holds, for each query document, the list of the 10 most similar documents to the query document and their scores.
Basic Flow:

User runs the following to set up the min_hash table in min_hash.csv.
```
$ min_hash.sh
```

User runs

$ min_hash_sim.sh source/folder/doc1.txt

User navigates to the processed.min_hash_sim.csv, scrolls down to their query row ("doc1.txt") with the correct timestamp and reviews the list of similar documents.

Use Case: Find the 20 documents in the collection that are the most similar to a specific document according to 4-shingles on 250 hash functions.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/min_hash directory there is a file min_hash_sim.csv which holds, for each query document, the list of the 10 most similar documents to the query document and their scores.
Basic Flow:

User goes into projectname/processed/ and removes shingles.db then moves back to the projectname directory.
User runs the following to set up the min_hash table in min_hash.csv.
```
$ min_hash.sh 4 250
```

User runs

$ min_hash_sim.sh source/folder/doc2.txt 20

User navigates to the processed.min_hash_sim.csv, scrolls down to their query row ("doc2.txt") with the correct timestamp and reviews the list of similar documents.

Use Case: Find all documents in the collection that score at least .75 in min_hash similarity to a specific document.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User has also already run min_hash.sh with the desired number of shingles and hash functions (steps 1 and 2 in the previous use case.) User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/min_hash directory there is a file min_hash_sim.csv which holds, for each query, the list of documents in the collection that scored at least .75 and their scores.
Basic Flow:

User runs

$ min_hash_sim.sh source/folder/doc1.txt .75

User navigates to the processed.min_hash_sim.csv, scrolls down to their query row ("doc1.txt") with the correct timestamp and reviews the list of similar documents.

tf-idf
Use Case: Create a file containing the tf-idf of all term-document pairs in the collection.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/tfidf directory there are three files, idf.csv, df.csv, and tf.csv, holding the appropriate values.
Basic Flow:

User runs
```
$ tfidf.sh
```
User navigates into the processed/tfidf directory to find tfidf.csv.

tf-idf
Use Case: Create a .json file containing the document frequency of all term-document pairs in the collection.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/tfidf directory there are three files, idf.csv, df.csv, and tf.csv, holding the appropriate values as well as a fourth file tfidf.json containing the values.
Basic Flow:

User runs
```
$ tfidf.sh .csv
```
User navigates into the processed/tfidf directory to find df.tsv and tfidf.json.

Word Count
Use Case: Count all words in a file.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/word_count directory there is now a file, word_count.csv, containing a list of files and directories with their word counts, including the desired file.
Basic Flow:

User runs
```
$ word_count.sh
```
User navigates into the processed/word_count directory to find word_count.csv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Cases

Clone this wiki locally