Skip to content

Use Cases

Erin edited this page Jun 21, 2017 · 14 revisions

File system
Use Case: Set up the file system user by prima tools for input and output and pull two collection to get metrics on from the ia.
Precondition: prima has been successfully installed.
Postcondition: The appropriate directories and files are all in place for the user to use the below tools.
Basic Flow

  1. To create a workspace user runs

    $ init_workspace.sh
    $ cd workspace
    
  2. To create the first collection user runs

    $ init_project.sh projectname1
    $ cd projectname1
    $ init_collection.sh
    
  3. To fetch the first collection from the archive user runs

    $ fetch_collection.sh collectionname1
    
  4. To move back out to workspace and create the second collection user runs

    $ cd ..
    $ init_project.sh projectname2
    $ cd projectname2
    $ init_collection.sh
    
  5. To fetch the second collection from the archive user runs

    $ fetch_collection.sh collectionname2
    
  6. User can then use the following to move into the appropriate project and use the tools below.

    $ cd projectname
    

Okapi BM25
Use Case: Return a list of 10 documents (default) that best match a set of keywords (prima, query, and bm25).
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/bm25 directory there is a file, bm25.csv, containing (at the bottom) the query terms entered and a list of documents with their scores.
Basic Flow:

  1. User runs

    $ bm25.sh "prima query bm25" 
    
  2. User navigates into the processed/bm25 directory to find bm25.csv.

Use Case: Return a list of 50 documents that best match a set of keywords (prima, query, and bm25).
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/bm25 directory there is a file, bm25.csv, containing (at the bottom) the query terms entered and a list of documents with their scores.
Basic Flow:

  1. User runs

    $ bm25.sh 50 "prima query bm25" 
    
  2. User navigates into the processed/bm25 directory to find bm25.csv.

K-means
Use Case: Cluster similar documents into 3 groups (default) and view the document clusters.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/k_means directory there is a file, k_means.csv, with documents listed in k rows, each row corresponding to one cluster.
Basic Flow:

  1. User runs

    $ k_means_clusterer.sh
    
  2. User navigates to the processed/k_means directory to find k_means.csv.

Use Case: Cluster similar documents into 7 groups and view the document clusters.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/k_means directory there is a file, k_means.csv, with documents listed in k rows, each row corresponding to one cluster.
Basic Flow:

  1. User runs

    $ k_means_clusterer.sh 7
    
  2. User navigates to the processed/k_means directory to find k_means.csv.

Latent Dirichlet allocation
Use Case: Reduce the term-document matrix to 100 dimensions (default) using Latent Dirichlet allocation.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/lda directory there is a file, lda.csv, containing the new matrix.
Basic Flow:

  1. User runs

    $ lda.sh
    
  2. User navigates to the processed/lda directory to find lda.csv.

Latent semantic indexing
Use Case: Reduce the term-document matrix to 200 dimensions using Latent semantic indexing.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/lsi directory there is a file, lsi.csv, containing the new matrix.
Basic Flow:

  1. User runs

    $ lsi.sh 200
    
  2. User navigates to the processed/lsi directory to find lsi.csv.

Min Hash
Use Case: Find the 10 documents in the collection that are the most similar to a specific document according to 3-shingles on 200 hash functions (defaults).
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/min_hash directory there is a file min_hash_sim.csv which holds, for each query document, the list of the 10 most similar documents to the query document and their scores.
Basic Flow:

  1. User runs the following to set up the min_hash table in min_hash.csv.

    $ min_hash.sh
    
  2. User runs

    $ min_hash_sim.sh source/folder/doc1.txt
    
  3. User navigates to the processed.min_hash_sim.csv, scrolls down to their query row ("doc1.txt") with the correct timestamp and reviews the list of similar documents.

Use Case: Find the 20 documents in the collection that are the most similar to a specific document according to 4-shingles on 250 hash functions.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/min_hash directory there is a file min_hash_sim.csv which holds, for each query document, the list of the 10 most similar documents to the query document and their scores.
Basic Flow:

  1. User goes into projectname/processed/ and removes shingles.db then moves back to the projectname directory.

  2. User runs the following to set up the min_hash table in min_hash.csv.

    $ min_hash.sh 4 250
    
  3. User runs

    $ min_hash_sim.sh source/folder/doc2.txt 20
    
  4. User navigates to the processed.min_hash_sim.csv, scrolls down to their query row ("doc2.txt") with the correct timestamp and reviews the list of similar documents.

Use Case: Find all documents in the collection that score at least .75 in min_hash similarity to a specific document.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User has also already run min_hash.sh with the desired number of shingles and hash functions (steps 1 and 2 in the previous use case.) User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/min_hash directory there is a file min_hash_sim.csv which holds, for each query, the list of documents in the collection that scored at least .75 and their scores.
Basic Flow:

  1. User runs

    $ min_hash_sim.sh source/folder/doc1.txt .75
    
  2. User navigates to the processed.min_hash_sim.csv, scrolls down to their query row ("doc1.txt") with the correct timestamp and reviews the list of similar documents.

tf-idf
Use Case: Create a file containing the tf-idf of all term-document pairs in the collection.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/tfidf directory there are three files, idf.csv, df.csv, and tf.csv, holding the appropriate values.
Basic Flow:

  1. User runs

    $ tfidf.sh
    
  2. User navigates into the processed/tfidf directory to find tfidf.csv.

tf-idf
Use Case: Create a .json file containing the document frequency of all term-document pairs in the collection.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/tfidf directory there are three files, idf.csv, df.csv, and tf.csv, holding the appropriate values as well as a fourth file tfidf.json containing the values.
Basic Flow:

  1. User runs

    $ tfidf.sh .csv
    
  2. User navigates into the processed/tfidf directory to find df.tsv and tfidf.json.

Word Count
Use Case: Count all words in a file.
Precondition: Inside the source/ directory there is at least one subdirectory containing item files from the ia. User is currently in the workspace/projectname directory.
Postcondition: Inside the processed/word_count directory there is now a file, word_count.csv, containing a list of files and directories with their word counts, including the desired file.
Basic Flow:

  1. User runs

    $ word_count.sh
    
  2. User navigates into the processed/word_count directory to find word_count.csv.

Clone this wiki locally