PDF Question Answering System

This Python script implements a question answering system that processes PDF documents and answers questions based on their content. The system utilizes natural language processing techniques and pre-trained models to extract relevant information and generate answers.

Overview

The system consists of several key components:

PDF text extraction
Text indexing and embedding
Semantic search for relevant context
Question answering based on retrieved context

Input

The main function main() takes two inputs:

pdf_path (str): The file path to the PDF document to be processed.
question (str): The question to be answered based on the PDF content.

Output

The system returns a string containing the answer to the given question based on the content of the PDF.

Key Components

PDF Text Extraction

The pdf_to_text() function uses the PyMuPDF library to convert a PDF file into plain text.

Text Indexing and Embedding

The build_index() function creates a FAISS index from the extracted text using a sentence transformer model (default: 'all-MiniLM-L6-v2'). This index allows for efficient semantic search.

Semantic Search

The retrieve_relevant_text() function performs a semantic search to find the most relevant sentences from the PDF content based on the input question.

Question Answering

The answer_question() function uses a pre-trained question-answering model (default: 'roberta-base-squad2') to generate an answer based on the retrieved context.

Usage

To use the system, ensure all required libraries are installed and the necessary pre-trained models are available. Then, you can use the main() function as follows:

pdf_path = 'example.pdf'
question = 'What concerns did Trump raise about NATO during his campaign?'
answer = main(pdf_path, question)
print(f"Answer: {answer}")

Dependencies

PyMuPDF (fitz)
FAISS
Transformers
SentenceTransformer
PyTorch

Notes

The script assumes that the required pre-trained models ('all-MiniLM-L6-v2' and 'roberta-base-squad2') are available in the current working directory.
The system's performance depends on the quality and relevance of the PDF content to the asked questions.
For optimal results, ensure that the input question is clear and directly related to the content of the PDF.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
example.pdf		example.pdf
gpt.py		gpt.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Question Answering System

Overview

Input

Output

Key Components

PDF Text Extraction

Text Indexing and Embedding

Semantic Search

Question Answering

Usage

Dependencies

Notes

About

Uh oh!

Releases

Packages

Languages

moonbytex/answer_question

Folders and files

Latest commit

History

Repository files navigation

PDF Question Answering System

Overview

Input

Output

Key Components

PDF Text Extraction

Text Indexing and Embedding

Semantic Search

Question Answering

Usage

Dependencies

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages