PDF Keyword Extractor Web

A web-based application to extract keywords and summaries from PDF documents, with OCR support for scanned PDFs. The app uses Flask for the web interface, PyMuPDF for PDF text extraction, and Tesseract OCR for scanned PDFs. Keyword extraction is available via Frequency and TF-IDF methods, and automatic text summarization is provided.

Features

Extract text from PDFs (supports OCR for scanned PDFs)
Keyword extraction using Frequency or TF-IDF methods
Automatic text summarization
Custom stopwords support (via data/stopwords.txt)
User-friendly web interface

Installation

1. Install System Dependencies

Tesseract OCR and Poppler are required for OCR and PDF-to-image conversion.
- Windows:
  - Tesseract
  - Poppler
  - Ensure Tesseract is installed in C:\Program Files\Tesseract-OCR\tesseract.exe or C:\Program Files (x86)\Tesseract-OCR\tesseract.exe, or update the path in your system environment variables.
- macOS:
  - brew install tesseract poppler
- Linux:
  - sudo apt install tesseract-ocr poppler-utils

2. Clone the Repository

git clone https://github.com/rajdeep13-coder/PDF_Keyword_extractor.git
cd PDF_Keyword_extractor

3. Install Python Dependencies

It is recommended to use a virtual environment.

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Usage

Start the Application:
```
python app.py
```
The app will run on http://127.0.0.1:5000/ by default.
Upload a PDF:
- Go to the web interface.
- Upload a PDF file.
- Choose the keyword extraction method (Frequency or TF-IDF).
- The app will display extracted keywords and a summary.
OCR Support:
- If the PDF is scanned or contains images, OCR will be automatically used if text extraction fails or is insufficient.
- Ensure Tesseract and Poppler are installed and accessible.

Project Structure

PDF_Keyword_extractor/
├── app.py                  # Main Flask application
├── extractor/
│   ├── keyword_finder.py   # Keyword extraction and summarization logic
│   └── pdf_reader.py       # PDF text extraction with OCR support
├── data/
│   └── stopwords.txt       # Custom stopwords (optional)
├── static/
│   └── styles.css          # CSS styles
├── templates/
│   └── index.html          # Web interface template
├── requirements.txt        # Python dependencies
└── README.md               # Project documentation

Configuration

Secret Key: The Flask app uses a secret key from the SECRET_KEY environment variable, or defaults to 'your_secret_key_here'.
Stopwords: You can customize stopwords by editing data/stopwords.txt.

Dependencies

Flask
nltk
scikit-learn
gensim
scipy
PyMuPDF
Pillow
gunicorn (for production deployment)
pdf2image
pytesseract

Notes

For best OCR results, ensure Tesseract and Poppler are correctly installed and configured in your system PATH.
The app will automatically use OCR if the extracted text is too short or if explicitly requested in the code.
Summarization uses a TF-IDF-based approach for sentence ranking.

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Keyword Extractor Web

Features

Installation

1. Install System Dependencies

2. Clone the Repository

3. Install Python Dependencies

Usage

Project Structure

Configuration

Dependencies

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
extractor		extractor
static		static
templates		templates
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

rajdeep13-coder/PDF_Keyword_extractor

Folders and files

Latest commit

History

Repository files navigation

PDF Keyword Extractor Web

Features

Installation

1. Install System Dependencies

2. Clone the Repository

3. Install Python Dependencies

Usage

Project Structure

Configuration

Dependencies

Notes

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages