Screen a set of documents for given terms.
Given a set of PDF files encoded with text metadata and a complete terms input table, check_for_terms.py will search each PDF for submitted terms and variations. The script will output, in CSV format, a matrix identifying which of the documents contain which of the submitted terms.
- Identify a directory containing all PDF files to search. The script will search subdirectories.
- Update check_for_terms.py, line 21, to reflect the path for the above directory.
- Following the format of terms_template.csv, list all terms and variations to find. Name this file terms.csv and place the file in the base directory. Column descriptions:
- Canonical - The shared name you want to use for a given term or set of term variations; The results report by unique Canonical
- Variation - A specific instance of a term that the tool should match; Any Variation of a single Canonical will flag the Canonical as present
- Exclude - True will remove the term from text before searching; False will search for the term
- MatchPunctuation - True will match based on punctuation; False will ignore punctuation
- MatchCase - True will match based on case/capitalization; False will ignore case/capitalization
- Run check_for_terms.py.
- Find results in the provided directory, filename results_YYYY-MM-DD.csv
- Add error-handling for missing/unexpected values in the terms input table.
- Add prompt for filepath/filename.