GitHub - lanhamt/pypdfsearch: search aggregated pdfs efficiently

#pyPDFSearch -- v.CHARLIE

##Overview pyPDFSearch is a solution to aggregate web PDF search. Instead of opening each PDF on a webpage, pyPDFSearch allows you to download and search all PDFs concurrently.

##Installation You can run pyPDFSearch with Python3 by invoking the following at command line:

$ python3 pyPDFSearch.py

Note that you must have the following packages installed: requests, re, sys, os, hashlib, _thread, threading, urllib, and PyPDF2. Dependencies can be installed with the following (note you should install using pip3 due to version requirements for the requests and pyPDF2 modules):

$ pip3 install -r requirements.txt

If you encounter difficulties installing dependencies, you might have to rely on administrator access which can be done by using sudo before the above command.

##Technical Also, pyPDFSearch will download online PDFs to your local machine to search in the working directory where it is installed. When you are finished searching, pyPDFSearch will automatically delete the local directory it creates to store your files.

However, if you run into an unexpected runtime error, it is possible that the directory will not be deleted. In such an event, the next time you try to use pyPDFSearch, the program will recognize that the directoy exists and ask if you would like to overwrite it - if you do not, you will not be able to run pyPDFSearch since a local directory is needed.

##Limitations Given the nonuniform generation of PDF documents, pyPDFSearch is limited to PDFs that are well-formatted according to PDF standards. For non-standard PDFs, the PyPDF2 module that pyPDFSearch relies on will not be able to extract data from the file so search will not function properly. Even though search will not be able to find anything, it will still iterate over all files (although they will all have empty contents) and eventually return a valid message that no term was found.

###Contact tlanham@cs.stanford.edu

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pyPDFSearch.py		pyPDFSearch.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

lanhamt/pypdfsearch

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages