Add DocuSense, refactor, update tests and README.

bshastry · Oct 10, 2023 · edd3ff4 · edd3ff4
1 parent fd3e451
commit edd3ff4
Show file tree

Hide file tree

Showing 7 changed files with 236 additions and 33 deletions.
diff --git a/README.md b/README.md
@@ -1,21 +1,23 @@
-# DocuBot
+# DocuBot and DocuSense
 
 [![Run Tests](https://github.com/bshastry/docubot/actions/workflows/tests.yml/badge.svg?branch=main)](https://github.com/bshastry/docubot/actions/workflows/tests.yml)[![Bandit Security Scan](https://github.com/bshastry/docubot/actions/workflows/bandit.yaml/badge.svg?branch=main)](https://github.com/bshastry/docubot/actions/workflows/bandit.yaml)[![Run Coverage](https://github.com/bshastry/docubot/actions/workflows/coverage.yml/badge.svg?branch=main)](https://github.com/bshastry/docubot/actions/workflows/coverage.yml)
 
-DocuBot is a command-line chatbot that answers questions using a knowledge base of documents provided by you.
+DocuBot and DocuSense are command-line tools.
+DocuBot is a chatbot that answers questions using a knowledge base of documents provided by you.
 It allows you to interactively get answers to questions with citations from the documents provided.
-It is written in Python3.
+DocuSense summarizes the document provided by you.
+They are written in Python3.
 
 ## Supported Document Types
 
-DocuBot supports the following document types:
+DocuBot and DocuSense supports the following document types:
 
 - .pdf: Portable Document Format
 - .docx: Microsoft Word Document
 - .md: Markdown Document
 - .txt: Plain Text Document
 
-## Features
+## DocuBot Features
 
 - Session based: DocuBot remembers previous interactions within the current session.
 - Citations provided: DocuBot generates answers based on information from specific documents. It provides citations to these documents, including page numbers if available.
@@ -34,10 +36,12 @@ To avoid OpenAI rate-limiting issues, it is recommended to preload funds into yo
 
 **Note:** DocuBot provides an estimated cost of indexing documents at the beginning of the process. This helps you understand the potential cost implications before proceeding. Please review the estimated cost and ensure that you have sufficient funds in your OpenAI account to cover the indexing process.
 
+For using DocuSense, pinecone API and ENVIRONMENT keys are not required.
+
 
 ## Installation
 
-To use DocuBot, follow these steps:
+To use DocuBot and DocuSense, follow these steps:
 
 1. Clone the repository:
 
@@ -60,6 +64,7 @@ To use DocuBot, follow these steps:
      - `OPENAI_API_KEY`: Your OpenAI API key
 
    Make sure to replace the placeholder values with your actual API keys and ENV variables.
+   If you are only going to use DocuSense, providing an `OPENAI_API_KEY` is sufficient.
 
 
 4. Collect documents you want DocuBot to work with in a local sub-directory:
@@ -77,19 +82,33 @@ To use DocuBot, follow these steps:
 
    You could create a similar script for your specific use-case.
 
-5. Run the `docubot.py` script:
+   For using DocuSense, you need to provide a single document, so please ignore this step.
+
+5. Run the script:
+
+   To use DocuBot, run
 
    ```bash
    python3 docubot.py /path/to/documents/directory
    ```
 
    Please replace `/path/to/documents/directory` with the path to the directory that holds documents you want DocuBot to interface with (e.g., `ethereum-docs` from the previous step)
 
+   To use DocuSense, run
+
+   ```
+   python3 docusense.py /path/to/document /path/to/summary.txt [--chunk_size <chunk_size>] [--chunk_overlap <chunk_overlap>]
+   ```
+   `--chunk_size` and `--chunk_overlap` are optional arguments that accept the size of chunk of document and the overlap between chunks (both measured in OpenAI tokens).
+   `--chunk_size` defaults to 3300 tokens, and `--chunk_overlap` defaults to 100 tokens.
+
+**Note:** DocuSense splits a large document into smaller chunks if it may not be summarized in one shot. The chunk size and overlap impact of large documents are summarized. For example, smaller chunk sizes and larger chunk overlaps may result in an increased number of OpenAPI calls but offer finer granularity. The defaults have been chosen as a balance between summarization cost and accuracy. The defaults may not work for every document, so you can use these parameters to arrive at a trade-off that is acceptable to you.
+
 ## Usage
 
 Once DocuBot is running, you can start asking questions. Simply type your question and press Enter. To quit DocuBot, type "quit" or "exit".
 
-## Examples
+## DocuBot Examples
 
 Here are some examples of questions you can ask DocuBot:
 
@@ -123,6 +142,6 @@ If you'd like to contribute to this project, please open an issue or submit a pu
 
 ## Liability Information
 
-DocuBot is released under the MIT license. Please note that while DocuBot is designed to provide useful information, it should not be considered a substitute for professional advice. The developers and contributors of DocuBot shall not be held liable for any damages or losses arising from the use of this application.
+DocuBot and DocuSense are released under the MIT license. Please note that while they are designed to provide useful information, it should not be considered a substitute for professional advice. The developers and contributors of DocuBot and DocuSense shall not be held liable for any damages or losses arising from the use of this application.
 
-It is recommended to use DocuBot responsibly and exercise caution when relying on its responses. If in doubt, it is always a good idea to consult with domain experts or refer to trusted sources for accurate information.
+It is recommended to use DocuBot and DocuSense responsibly and exercise caution when relying on its responses. If in doubt, it is always a good idea to consult with domain experts or refer to trusted sources for accurate information.
diff --git a/document_loaders/document_loaders.py b/document_loaders/document_loaders.py
@@ -167,3 +167,16 @@ def chunk_data(
     )
     chunks = text_splitter.split_documents(data)
     return chunks
+
+
+def merge_document(document: List[T]) -> str:
+    """
+    Merge a list of documents into a single string.
+
+    Args:
+        document (List[T]): A list of documents to merge.
+
+    Returns:
+        str: A single string containing the merged documents.
+    """
+    return "\n\n".join([page.page_content for page in document])
diff --git a/docusense.py b/docusense.py
@@ -0,0 +1,147 @@
+#!/usr/bin/env python3
+"""
+This script provides functionality for summarizing a given document using OpenAI's GPT-3.5-turbo model.
+It includes the 'init()' function to initialize environment variables, the 'summarize()' function to generate summaries,
+and the 'docusense()' function as the entry point for the script. The 'docusense()' function takes command-line arguments
+for the document path, chunk size, and chunk overlap. It utilizes prompts and chains to perform the summarization process.
+"""
+
+
+def init():
+    """
+    Initializes the environment variables by loading the .env file.
+
+    Returns:
+    None
+    """
+    from dotenv import load_dotenv, find_dotenv
+
+    load_dotenv(find_dotenv(), override=True)
+
+
+def summarize(
+    document: str,
+    summary_file: str,
+    chunk_size: int,
+    chunk_overlap: int,
+    max_single_shot_num_tokens: int = 2048,
+) -> None:
+    """
+    Summarizes a given document using OpenAI's GPT-3.5-turbo model.
+
+    Args:
+        document (str): The path to the document to be summarized.
+        chunk_size (int): The size of each chunk of the document to be summarized.
+        chunk_overlap (int): The amount of overlap between each chunk of the document.
+        max_single_shot_num_tokens (int, optional): The maximum number of tokens allowed for a single-shot summarization. Defaults to 2048.
+
+    Returns:
+        None
+
+    Raises:
+        FileNotFoundError: If the specified document path does not exist.
+    """
+    from langchain.chat_models import ChatOpenAI
+    from langchain import PromptTemplate
+    from langchain.chains import LLMChain
+    from langchain.chains.summarize import load_summarize_chain
+    from document_loaders.document_loaders import (
+        load_document,
+        merge_document,
+        chunk_data,
+    )
+    from text_utils.text_utils import num_tokens_and_cost
+
+    llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
+
+    map_prompt = """
+Write a concise summary of the following:
+Text: `{text}`
+CONCISE SUMMARY:
+"""
+    map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])
+    combine_prompt = """
+Write a concise summary of the following text that covers key points.
+Add a title to the summary.
+Start the summary with an INTRODUCTION PARAGRAPH that gives an overview of the topic FOLLOWED
+by BULLET POINTS if possible AND end the summary with a CONCLUSION.
+Text: `{text}`
+"""
+    combine_prompt_template = PromptTemplate(
+        template=combine_prompt, input_variables=["text"]
+    )
+
+    doc = load_document(document)
+    num_tokens, cost = num_tokens_and_cost(doc)
+    print(f"Approximate summarization cost: ${cost:.4f}")
+    if num_tokens <= max_single_shot_num_tokens:
+        chain = LLMChain(llm=llm, prompt=combine_prompt_template)
+        print("Running single-shot summarization")
+        summary = chain.run({"text": merge_document(doc)})
+        print(f"Writing summary to {summary_file}... ", end="")
+        with open(summary_file, "w") as f:
+            f.write(summary)
+        print("Done")
+    else:
+        chain = load_summarize_chain(
+            llm=llm,
+            chain_type="map_reduce",
+            map_prompt=map_prompt_template,
+            combine_prompt=combine_prompt_template,
+        )
+        print("Running multi-shot summarization")
+        chain = load_summarize_chain(
+            llm=llm,
+            chain_type="map_reduce",
+            map_prompt=map_prompt_template,
+            combine_prompt=combine_prompt_template,
+        )
+        summary = chain.run(
+            chunk_data(data=doc, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
+        )
+        print(f"Writing summary to {summary_file}... ", end="")
+        with open(summary_file, "w") as f:
+            f.write(summary)
+        print("Done")
+
+
+def docusense() -> None:
+    """
+    This function takes in a document path and summary file path and summarizes it using DocuSense.
+    It also takes in optional arguments for chunk size and overlap.
+
+    Returns:
+        None
+    """
+    import argparse
+
+    parser = argparse.ArgumentParser(description="DocuSense")
+    parser.add_argument(
+        "document", type=str, help="Path to the document to be summarized."
+    )
+    parser.add_argument(
+        "summary_file",
+        type=str,
+        help="Path to the file where summary will be written to.",
+    )
+    parser.add_argument(
+        "--chunk_size", type=int, default=3300, help="Chunk size in tokens."
+    )
+    parser.add_argument(
+        "--chunk_overlap", type=int, default=100, help="Chunk overlap in tokens."
+    )
+    args = parser.parse_args()
+    document = args.document
+    summary_file = args.summary_file
+    chunk_size = args.chunk_size
+    chunk_overlap = args.chunk_overlap
+    print(f"Instantiating DocuSense for {document}")
+    init()
+    try:
+        summarize(document, summary_file, chunk_size, chunk_overlap)
+    except FileNotFoundError:
+        print(f"File {document} not found")
+
+
+if __name__ == "__main__":
+    docusense()
diff --git a/pinecone_utils/pinecone_utils.py b/pinecone_utils/pinecone_utils.py
@@ -56,12 +56,13 @@ def create_vector_store(index_name: str, chunks: List[T]) -> Pinecone:
     import pinecone
     from langchain.vectorstores import Pinecone
     from langchain.embeddings.openai import OpenAIEmbeddings
-    from text_utils.text_utils import embedding_cost
+    from text_utils.text_utils import num_tokens_and_cost
 
+    num_tokens, cost = num_tokens_and_cost(chunks)
     # Prompt user whether they want to continue, quit if they don't
     while True:
         user_input = input(
-            f"Cost Estimate: ${embedding_cost(chunks):.4f}\n"
+            f"Cost Estimate: ${cost:.4f} for {num_tokens} tokens\n"
             f"Would you like to continue? (y/n)\n"
         )
         if user_input.lower() == "y":

diff --git a/tests/test_document_loaders.py b/tests/test_document_loaders.py
@@ -35,8 +35,12 @@
     load_from_wikipedia,
     load_document,
     chunk_data,
+    merge_document,
 )
 from text_utils.text_utils import tiktoken_len
+from typing import List, TypeVar
+
+T = TypeVar("T")
 
 
 class TestDocumentLoaders(unittest.TestCase):
@@ -96,21 +100,43 @@ def test_load_document(self):
         url_contents = load_document(url)
         self.assertIsInstance(url_contents, list)
 
-    def test_chunk_data(self):
+    def chunk_txt_file(self, chunk_size: int) -> List[T]:
         txt_file = "test_files/test.txt"
         txt_contents = load_document(txt_file)
-        # Chunk size in tokens (not characters)
-        chunk_size = 10
         # Number of tokens to overlap between chunks
         chunk_overlap = 5
         chunks = chunk_data(
             txt_contents, chunk_size=chunk_size, chunk_overlap=chunk_overlap
         )
+        return chunks
+
+    def test_chunk_data(self):
+        # Chunk size in tokens
+        chunk_size = 10
+        chunks = self.chunk_txt_file(chunk_size=chunk_size)
         self.assertIsInstance(chunks, list)
         self.assertGreater(len(chunks), 1)
         for chunk in chunks:
             self.assertLessEqual(tiktoken_len(chunk.page_content), chunk_size)
 
+    def test_merge_single_document(self):
+        txt_file = "test_files/test.txt"
+        document = load_txt_document(txt_file)
+        expected_output = "This is a text file that has more than ten characters.\n"
+        self.assertEqual(merge_document(document), expected_output)
+
+    def test_merge_multiple_documents(self):
+        # Chunk size in tokens
+        chunk_size = 10
+        chunks = self.chunk_txt_file(chunk_size=chunk_size)
+        expected_output = "This is a text file that has more than ten\n\nthat has more than ten characters."
+        self.assertEqual(merge_document(chunks), expected_output)
+
+    def test_merge_empty_document(self):
+        document = []
+        expected_output = ""
+        self.assertEqual(merge_document(document), expected_output)
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/tests/test_text_utils.py b/tests/test_text_utils.py
@@ -1,4 +1,8 @@
-from text_utils.text_utils import tiktoken_len, embedding_cost, return_url_extension
+from text_utils.text_utils import (
+    tiktoken_len,
+    num_tokens_and_cost,
+    return_url_extension,
+)
 import unittest
 
 # As of 2021-10-20, the cost of embedding a single token using OpenAI is $0.0000001
@@ -21,26 +25,18 @@ def __init__(self, page_content):
             Page("This is the second page."),
             Page("This is the third page."),
         ]
-        num_tokens = 0
-        for page in document:
-            num_tokens += tiktoken_len(page.page_content)
+        num_tokens, cost = num_tokens_and_cost(document)
         self.assertEqual(num_tokens, 18)
-        self.assertAlmostEquals(
-            embedding_cost(document), num_tokens * EMBEDDING_COST_PER_TOKEN
-        )
+        self.assertAlmostEquals(cost, num_tokens * EMBEDDING_COST_PER_TOKEN)
 
-        num_tokens = 0
         document = [
             Page("This is a short page."),
             Page("This is a longer page with more words."),
             Page("This is the longest page of them all, with many many words."),
         ]
-        for page in document:
-            num_tokens += tiktoken_len(page.page_content)
+        num_tokens, cost = num_tokens_and_cost(document)
         self.assertEqual(num_tokens, 29)
-        self.assertAlmostEqual(
-            embedding_cost(document), (num_tokens * EMBEDDING_COST_PER_TOKEN)
-        )
+        self.assertAlmostEqual(cost, (num_tokens * EMBEDDING_COST_PER_TOKEN))
 
     def test_return_url_extension(self):
         self.assertEqual(