Skip to content

Extracts the most common key words from a pdf file and finds similar pdf files in the same folder or sub-folder.

Notifications You must be signed in to change notification settings

aojanzen/pdf_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

pdf_scraper

pdf_ scraper is a tool to find all pdf files in a folder and all of its sub-folders, extract the text from each pdf files, remove punctuation and stop-words from the text, and count the number of word occurrences in the text. The most common keywords are stored for each pdf file together with the file path. In the end, similar pdf files can be identified by comparison of the keywords between files.

$./pdf_scraper <path>

Installing pdf_scraper

pdf_scraper is not yet available on PyPI, but I will do my best.

$ python -m pip install pdf_scraper

About

Extracts the most common key words from a pdf file and finds similar pdf files in the same folder or sub-folder.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages