Skip to content

Commit

Permalink
Add DocuSense, refactor, update tests and README.
Browse files Browse the repository at this point in the history
  • Loading branch information
bshastry committed Oct 10, 2023
1 parent fd3e451 commit edd3ff4
Show file tree
Hide file tree
Showing 7 changed files with 236 additions and 33 deletions.
39 changes: 29 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,23 @@
# DocuBot
# DocuBot and DocuSense

[![Run Tests](https://github.com/bshastry/docubot/actions/workflows/tests.yml/badge.svg?branch=main)](https://github.com/bshastry/docubot/actions/workflows/tests.yml)[![Bandit Security Scan](https://github.com/bshastry/docubot/actions/workflows/bandit.yaml/badge.svg?branch=main)](https://github.com/bshastry/docubot/actions/workflows/bandit.yaml)[![Run Coverage](https://github.com/bshastry/docubot/actions/workflows/coverage.yml/badge.svg?branch=main)](https://github.com/bshastry/docubot/actions/workflows/coverage.yml)

DocuBot is a command-line chatbot that answers questions using a knowledge base of documents provided by you.
DocuBot and DocuSense are command-line tools.
DocuBot is a chatbot that answers questions using a knowledge base of documents provided by you.
It allows you to interactively get answers to questions with citations from the documents provided.
It is written in Python3.
DocuSense summarizes the document provided by you.
They are written in Python3.

## Supported Document Types

DocuBot supports the following document types:
DocuBot and DocuSense supports the following document types:

- .pdf: Portable Document Format
- .docx: Microsoft Word Document
- .md: Markdown Document
- .txt: Plain Text Document

## Features
## DocuBot Features

- Session based: DocuBot remembers previous interactions within the current session.
- Citations provided: DocuBot generates answers based on information from specific documents. It provides citations to these documents, including page numbers if available.
Expand All @@ -34,10 +36,12 @@ To avoid OpenAI rate-limiting issues, it is recommended to preload funds into yo

**Note:** DocuBot provides an estimated cost of indexing documents at the beginning of the process. This helps you understand the potential cost implications before proceeding. Please review the estimated cost and ensure that you have sufficient funds in your OpenAI account to cover the indexing process.

For using DocuSense, pinecone API and ENVIRONMENT keys are not required.


## Installation

To use DocuBot, follow these steps:
To use DocuBot and DocuSense, follow these steps:

1. Clone the repository:

Expand All @@ -60,6 +64,7 @@ To use DocuBot, follow these steps:
- `OPENAI_API_KEY`: Your OpenAI API key

Make sure to replace the placeholder values with your actual API keys and ENV variables.
If you are only going to use DocuSense, providing an `OPENAI_API_KEY` is sufficient.


4. Collect documents you want DocuBot to work with in a local sub-directory:
Expand All @@ -77,19 +82,33 @@ To use DocuBot, follow these steps:

You could create a similar script for your specific use-case.

5. Run the `docubot.py` script:
For using DocuSense, you need to provide a single document, so please ignore this step.

5. Run the script:

To use DocuBot, run

```bash
python3 docubot.py /path/to/documents/directory
```

Please replace `/path/to/documents/directory` with the path to the directory that holds documents you want DocuBot to interface with (e.g., `ethereum-docs` from the previous step)

To use DocuSense, run

```
python3 docusense.py /path/to/document /path/to/summary.txt [--chunk_size <chunk_size>] [--chunk_overlap <chunk_overlap>]
```
`--chunk_size` and `--chunk_overlap` are optional arguments that accept the size of chunk of document and the overlap between chunks (both measured in OpenAI tokens).
`--chunk_size` defaults to 3300 tokens, and `--chunk_overlap` defaults to 100 tokens.

**Note:** DocuSense splits a large document into smaller chunks if it may not be summarized in one shot. The chunk size and overlap impact of large documents are summarized. For example, smaller chunk sizes and larger chunk overlaps may result in an increased number of OpenAPI calls but offer finer granularity. The defaults have been chosen as a balance between summarization cost and accuracy. The defaults may not work for every document, so you can use these parameters to arrive at a trade-off that is acceptable to you.

## Usage

Once DocuBot is running, you can start asking questions. Simply type your question and press Enter. To quit DocuBot, type "quit" or "exit".

## Examples
## DocuBot Examples

Here are some examples of questions you can ask DocuBot:

Expand Down Expand Up @@ -123,6 +142,6 @@ If you'd like to contribute to this project, please open an issue or submit a pu
## Liability Information
DocuBot is released under the MIT license. Please note that while DocuBot is designed to provide useful information, it should not be considered a substitute for professional advice. The developers and contributors of DocuBot shall not be held liable for any damages or losses arising from the use of this application.
DocuBot and DocuSense are released under the MIT license. Please note that while they are designed to provide useful information, it should not be considered a substitute for professional advice. The developers and contributors of DocuBot and DocuSense shall not be held liable for any damages or losses arising from the use of this application.
It is recommended to use DocuBot responsibly and exercise caution when relying on its responses. If in doubt, it is always a good idea to consult with domain experts or refer to trusted sources for accurate information.
It is recommended to use DocuBot and DocuSense responsibly and exercise caution when relying on its responses. If in doubt, it is always a good idea to consult with domain experts or refer to trusted sources for accurate information.
13 changes: 13 additions & 0 deletions document_loaders/document_loaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,3 +167,16 @@ def chunk_data(
)
chunks = text_splitter.split_documents(data)
return chunks


def merge_document(document: List[T]) -> str:
"""
Merge a list of documents into a single string.
Args:
document (List[T]): A list of documents to merge.
Returns:
str: A single string containing the merged documents.
"""
return "\n\n".join([page.page_content for page in document])
147 changes: 147 additions & 0 deletions docusense.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
#!/usr/bin/env python3
"""
This script provides functionality for summarizing a given document using OpenAI's GPT-3.5-turbo model.
It includes the 'init()' function to initialize environment variables, the 'summarize()' function to generate summaries,
and the 'docusense()' function as the entry point for the script. The 'docusense()' function takes command-line arguments
for the document path, chunk size, and chunk overlap. It utilizes prompts and chains to perform the summarization process.
"""


def init():
"""
Initializes the environment variables by loading the .env file.
Returns:
None
"""
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv(), override=True)


def summarize(
document: str,
summary_file: str,
chunk_size: int,
chunk_overlap: int,
max_single_shot_num_tokens: int = 2048,
) -> None:
"""
Summarizes a given document using OpenAI's GPT-3.5-turbo model.
Args:
document (str): The path to the document to be summarized.
chunk_size (int): The size of each chunk of the document to be summarized.
chunk_overlap (int): The amount of overlap between each chunk of the document.
max_single_shot_num_tokens (int, optional): The maximum number of tokens allowed for a single-shot summarization. Defaults to 2048.
Returns:
None
Raises:
FileNotFoundError: If the specified document path does not exist.
"""
from langchain.chat_models import ChatOpenAI
from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain.chains.summarize import load_summarize_chain
from document_loaders.document_loaders import (
load_document,
merge_document,
chunk_data,
)
from text_utils.text_utils import num_tokens_and_cost

llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

map_prompt = """
Write a concise summary of the following:
Text: `{text}`
CONCISE SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])
combine_prompt = """
Write a concise summary of the following text that covers key points.
Add a title to the summary.
Start the summary with an INTRODUCTION PARAGRAPH that gives an overview of the topic FOLLOWED
by BULLET POINTS if possible AND end the summary with a CONCLUSION.
Text: `{text}`
"""
combine_prompt_template = PromptTemplate(
template=combine_prompt, input_variables=["text"]
)

doc = load_document(document)
num_tokens, cost = num_tokens_and_cost(doc)
print(f"Approximate summarization cost: ${cost:.4f}")
if num_tokens <= max_single_shot_num_tokens:
chain = LLMChain(llm=llm, prompt=combine_prompt_template)
print("Running single-shot summarization")
summary = chain.run({"text": merge_document(doc)})
print(f"Writing summary to {summary_file}... ", end="")
with open(summary_file, "w") as f:
f.write(summary)
print("Done")
else:
chain = load_summarize_chain(
llm=llm,
chain_type="map_reduce",
map_prompt=map_prompt_template,
combine_prompt=combine_prompt_template,
)
print("Running multi-shot summarization")
chain = load_summarize_chain(
llm=llm,
chain_type="map_reduce",
map_prompt=map_prompt_template,
combine_prompt=combine_prompt_template,
)
summary = chain.run(
chunk_data(data=doc, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
)
print(f"Writing summary to {summary_file}... ", end="")
with open(summary_file, "w") as f:
f.write(summary)
print("Done")


def docusense() -> None:
"""
This function takes in a document path and summary file path and summarizes it using DocuSense.
It also takes in optional arguments for chunk size and overlap.
Returns:
None
"""
import argparse

parser = argparse.ArgumentParser(description="DocuSense")
parser.add_argument(
"document", type=str, help="Path to the document to be summarized."
)
parser.add_argument(
"summary_file",
type=str,
help="Path to the file where summary will be written to.",
)
parser.add_argument(
"--chunk_size", type=int, default=3300, help="Chunk size in tokens."
)
parser.add_argument(
"--chunk_overlap", type=int, default=100, help="Chunk overlap in tokens."
)
args = parser.parse_args()
document = args.document
summary_file = args.summary_file
chunk_size = args.chunk_size
chunk_overlap = args.chunk_overlap
print(f"Instantiating DocuSense for {document}")
init()
try:
summarize(document, summary_file, chunk_size, chunk_overlap)
except FileNotFoundError:
print(f"File {document} not found")


if __name__ == "__main__":
docusense()
5 changes: 3 additions & 2 deletions pinecone_utils/pinecone_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,12 +56,13 @@ def create_vector_store(index_name: str, chunks: List[T]) -> Pinecone:
import pinecone
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from text_utils.text_utils import embedding_cost
from text_utils.text_utils import num_tokens_and_cost

num_tokens, cost = num_tokens_and_cost(chunks)
# Prompt user whether they want to continue, quit if they don't
while True:
user_input = input(
f"Cost Estimate: ${embedding_cost(chunks):.4f}\n"
f"Cost Estimate: ${cost:.4f} for {num_tokens} tokens\n"
f"Would you like to continue? (y/n)\n"
)
if user_input.lower() == "y":
Expand Down
32 changes: 29 additions & 3 deletions tests/test_document_loaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,12 @@
load_from_wikipedia,
load_document,
chunk_data,
merge_document,
)
from text_utils.text_utils import tiktoken_len
from typing import List, TypeVar

T = TypeVar("T")


class TestDocumentLoaders(unittest.TestCase):
Expand Down Expand Up @@ -96,21 +100,43 @@ def test_load_document(self):
url_contents = load_document(url)
self.assertIsInstance(url_contents, list)

def test_chunk_data(self):
def chunk_txt_file(self, chunk_size: int) -> List[T]:
txt_file = "test_files/test.txt"
txt_contents = load_document(txt_file)
# Chunk size in tokens (not characters)
chunk_size = 10
# Number of tokens to overlap between chunks
chunk_overlap = 5
chunks = chunk_data(
txt_contents, chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
return chunks

def test_chunk_data(self):
# Chunk size in tokens
chunk_size = 10
chunks = self.chunk_txt_file(chunk_size=chunk_size)
self.assertIsInstance(chunks, list)
self.assertGreater(len(chunks), 1)
for chunk in chunks:
self.assertLessEqual(tiktoken_len(chunk.page_content), chunk_size)

def test_merge_single_document(self):
txt_file = "test_files/test.txt"
document = load_txt_document(txt_file)
expected_output = "This is a text file that has more than ten characters.\n"
self.assertEqual(merge_document(document), expected_output)

def test_merge_multiple_documents(self):
# Chunk size in tokens
chunk_size = 10
chunks = self.chunk_txt_file(chunk_size=chunk_size)
expected_output = "This is a text file that has more than ten\n\nthat has more than ten characters."
self.assertEqual(merge_document(chunks), expected_output)

def test_merge_empty_document(self):
document = []
expected_output = ""
self.assertEqual(merge_document(document), expected_output)


if __name__ == "__main__":
unittest.main()
22 changes: 9 additions & 13 deletions tests/test_text_utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
from text_utils.text_utils import tiktoken_len, embedding_cost, return_url_extension
from text_utils.text_utils import (
tiktoken_len,
num_tokens_and_cost,
return_url_extension,
)
import unittest

# As of 2021-10-20, the cost of embedding a single token using OpenAI is $0.0000001
Expand All @@ -21,26 +25,18 @@ def __init__(self, page_content):
Page("This is the second page."),
Page("This is the third page."),
]
num_tokens = 0
for page in document:
num_tokens += tiktoken_len(page.page_content)
num_tokens, cost = num_tokens_and_cost(document)
self.assertEqual(num_tokens, 18)
self.assertAlmostEquals(
embedding_cost(document), num_tokens * EMBEDDING_COST_PER_TOKEN
)
self.assertAlmostEquals(cost, num_tokens * EMBEDDING_COST_PER_TOKEN)

num_tokens = 0
document = [
Page("This is a short page."),
Page("This is a longer page with more words."),
Page("This is the longest page of them all, with many many words."),
]
for page in document:
num_tokens += tiktoken_len(page.page_content)
num_tokens, cost = num_tokens_and_cost(document)
self.assertEqual(num_tokens, 29)
self.assertAlmostEqual(
embedding_cost(document), (num_tokens * EMBEDDING_COST_PER_TOKEN)
)
self.assertAlmostEqual(cost, (num_tokens * EMBEDDING_COST_PER_TOKEN))

def test_return_url_extension(self):
self.assertEqual(
Expand Down
Loading

0 comments on commit edd3ff4

Please sign in to comment.