pdf_ scraper is a tool to find all pdf files in a folder and all of its sub-folders, extract the text from each pdf files, remove punctuation and stop-words from the text, and count the number of word occurrences in the text. The most common keywords are stored for each pdf file together with the file path. In the end, similar pdf files can be identified by comparison of the keywords between files.
$./pdf_scraper <path>
pdf_scraper is not yet available on PyPI, but I will do my best.
$ python -m pip install pdf_scraper