diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml new file mode 100644 index 0000000..850ef84 --- /dev/null +++ b/.github/workflows/docs.yml @@ -0,0 +1,27 @@ +name: documentation + +on: [ push, pull_request, workflow_dispatch ] + +permissions: + contents: write + +jobs: + docs: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - uses: actions/setup-python@v3 + - name: Install dependencies + run: | + pip install sphinx sphinx_rtd_theme myst_parser + - name: Sphinx build + run: | + sphinx-build doc _build + - name: Deploy to GitHub Pages + uses: peaceiris/actions-gh-pages@v3 + if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }} + with: + publish_branch: gh-pages + # github_token: ${{ secrets.GITHUB_TOKEN }} + publish_dir: _build/ + force_orphan: true \ No newline at end of file diff --git a/src/docs/_build/doctrees/environment.pickle b/src/docs/_build/doctrees/environment.pickle index 975ad23..4a888b4 100644 Binary files a/src/docs/_build/doctrees/environment.pickle and b/src/docs/_build/doctrees/environment.pickle differ diff --git a/src/docs/_build/doctrees/get_started.doctree b/src/docs/_build/doctrees/get_started.doctree index ad0888a..97c993a 100644 Binary files a/src/docs/_build/doctrees/get_started.doctree and b/src/docs/_build/doctrees/get_started.doctree differ diff --git a/src/docs/_build/doctrees/get_started.introduction.doctree b/src/docs/_build/doctrees/get_started.introduction.doctree index f980c11..0597075 100644 Binary files a/src/docs/_build/doctrees/get_started.introduction.doctree and b/src/docs/_build/doctrees/get_started.introduction.doctree differ diff --git a/src/docs/_build/doctrees/get_started.llms.doctree b/src/docs/_build/doctrees/get_started.llms.doctree index 887c658..9a20c82 100644 Binary files a/src/docs/_build/doctrees/get_started.llms.doctree and b/src/docs/_build/doctrees/get_started.llms.doctree differ diff --git a/src/docs/_build/doctrees/get_started.parse_pdf.doctree b/src/docs/_build/doctrees/get_started.parse_pdf.doctree new file mode 100644 index 0000000..4f8da94 Binary files /dev/null and b/src/docs/_build/doctrees/get_started.parse_pdf.doctree differ diff --git a/src/docs/_build/doctrees/get_started.vectordb.doctree b/src/docs/_build/doctrees/get_started.vectordb.doctree index 7ab4197..7bb2883 100644 Binary files a/src/docs/_build/doctrees/get_started.vectordb.doctree and b/src/docs/_build/doctrees/get_started.vectordb.doctree differ diff --git a/src/docs/_build/doctrees/grag.components.doctree b/src/docs/_build/doctrees/grag.components.doctree index 96d86fd..2bccbfc 100644 Binary files a/src/docs/_build/doctrees/grag.components.doctree and b/src/docs/_build/doctrees/grag.components.doctree differ diff --git a/src/docs/_build/doctrees/grag.components.vectordb.doctree b/src/docs/_build/doctrees/grag.components.vectordb.doctree index 95d2186..f0de757 100644 Binary files a/src/docs/_build/doctrees/grag.components.vectordb.doctree and b/src/docs/_build/doctrees/grag.components.vectordb.doctree differ diff --git a/src/docs/_build/doctrees/grag.rag.doctree b/src/docs/_build/doctrees/grag.rag.doctree index 71390fd..bb7cf58 100644 Binary files a/src/docs/_build/doctrees/grag.rag.doctree and b/src/docs/_build/doctrees/grag.rag.doctree differ diff --git a/src/docs/_build/html/.buildinfo b/src/docs/_build/html/.buildinfo index ff14325..071ad00 100644 --- a/src/docs/_build/html/.buildinfo +++ b/src/docs/_build/html/.buildinfo @@ -1,4 +1,4 @@ # Sphinx build info version 1 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. -config: 33176d1a0fbc2e489b6d5201070d328e +config: 1ced34aae86d195057701cf655c56180 tags: 645f666f9bcd5a90fca523b33c5a78b7 diff --git a/src/docs/_build/html/_sources/get_started.introduction.rst.txt b/src/docs/_build/html/_sources/get_started.introduction.rst.txt index c72307f..d3c4197 100644 --- a/src/docs/_build/html/_sources/get_started.introduction.rst.txt +++ b/src/docs/_build/html/_sources/get_started.introduction.rst.txt @@ -3,9 +3,22 @@ GRAG Overview GRAG provides an implementation of Retrieval-Augmented Generation that is completely open-sourced. Since it does not use any external services or APIs, this enables a cost-saving solution as well a solution to data privacy concerns. -For more information, refer to :ref:`Test `. +For more information, refer to `our readme `_. -Retrieval-Augmented Generation -############################## +Retrieval-Augmented Generation (RAG) +#################################### -Re \ No newline at end of file +Retrieval-Augmented Generation (RAG) is a technique in machine learning that helps to enhance large-language models (LLM) by incorporating external data. + +In RAG, a model first retrieves relevant documents or data from a large corpus and then uses this information to guide the generation of new text. This approach allows the model to produce more informed, accurate, and contextually appropriate responses. + +By leveraging both the retrieval of existing knowledge and the generative capabilities of neural networks, RAG models can improve over traditional generation methods, particularly in tasks requiring deep domain-specific knowledge or factual accuracy. + +.. figure:: ../../_static/basic_RAG_pipeline.png + :width: 800 + :alt: Basic-RAG Pipeline + :align: center + + Illustration of a basic RAG pipeline + +Traditionally, it uses a vector database/vector store for both retrieval and generation processes. diff --git a/src/docs/_build/html/_sources/get_started.llms.rst.txt b/src/docs/_build/html/_sources/get_started.llms.rst.txt index c284755..b79074e 100644 --- a/src/docs/_build/html/_sources/get_started.llms.rst.txt +++ b/src/docs/_build/html/_sources/get_started.llms.rst.txt @@ -1,4 +1,4 @@ - `LLMs +LLMs ===== GRAG offers two ways to run LLMs locally: @@ -17,10 +17,10 @@ provide an auth token* To run LLMs using LlamaCPP ############################# LlamaCPP requires models in the form of `.gguf` file. You can either download these model files online, -or +or **quantize** the model yourself following the instructions below. -How to quantize models. -************************ +How to quantize models +*********************** To quantize the model, run: ``python -m grag.quantize.quantize`` @@ -34,4 +34,4 @@ After running the above command, user will be prompted with the following: * If the user has the model downloaded locally, then user will be instructed to copy the model and input the name of the model directory. -3.Finally, the user will be prompted to enter **quantization** settings (recommended Q5_K_M or Q4_K_M, etc.). For more details, check `llama.cpp/examples/quantize/quantize.cpp `_. +3. Finally, the user will be prompted to enter **quantization** settings (recommended Q5_K_M or Q4_K_M, etc.). For more details, check `llama.cpp/examples/quantize/quantize.cpp `_. diff --git a/src/docs/_build/html/_sources/get_started.parse_pdf.rst.txt b/src/docs/_build/html/_sources/get_started.parse_pdf.rst.txt new file mode 100644 index 0000000..4ace62b --- /dev/null +++ b/src/docs/_build/html/_sources/get_started.parse_pdf.rst.txt @@ -0,0 +1,61 @@ +Parse PDF +========= + +The parsing and partitioning were primarily done using the unstructured.io library, which is designed for this purpose. However, for PDFs with complex layouts, such as nested tables or tax forms, the pdfplumber and pytesseract libraries were employed to improve the parsing accuracy. + +The class has several attributes that control the behavior of the parsing and partitioning process. + +Attributes +########## + +- single_text_out (bool): If True, all text elements are combined into a single output document. The default value is True. + +- strategy (str): The strategy for PDF partitioning. The default is "hi_res" for better accuracy + +- extract_image_block_types (list): A list of elements to be extracted as image blocks. By default, it includes "Image" and "Table".The default value is True. + +- infer_table_structure (bool): Whether to extract tables during partitioning. The default value is True. + +- extract_images (bool): Whether to extract images. The default value is True. + +- image_output_dir (str): The directory to save extracted images, if any. + +- add_captions_to_text (bool): Whether to include figure captions in the text output. The default value is True. + +- add_captions_to_blocks (bool): Whether to add captions to table and image blocks. The default value is True. + +- add_caption_first (bool): Whether to place captions before their corresponding image or table in the output. The default value is True. + +- table_as_html (bool): Whether to represent tables as HTML. + +Parsing Complex PDF Layouts +########################### + +While unstructured.io performed well in parsing PDFs with straightforward layouts, PDFs with complex layouts, such as nested tables or tax forms, were not parsed accurately. To address this issue, the pdfplumber and pytesseract libraries were employed. + +Table Parsing Methodology +========================= + +For each page in the PDF file, the find_tables method is called with specific table settings to find the tables on that page. The table settings used are: + +- ``"vertical_strategy": "text"``: This setting tells the function to detect tables based on the text content. + +- ``"horizontal_strategy": "lines"``: This setting tells the function to detect tables based on the horizontal lines. + +- ``"min_words_vertical": 3``: This setting specifies the minimum number of words required to consider a row as part of a table. + +**For each table found on the page, the following steps are performed:** + +1. The table area is cropped from the page using the crop method and the bbox (bounding box) of the table. + +2. The text content of the cropped table area is extracted using the `extract_text` method with `layout=True`. + +3. A dictionary is created with the `table_number` and `extracted_text` of the table, and it is appended to the `extracted_tables_in_page` list. +After processing all the tables on the page, a dictionary is created with the `page_number` and the list of `extracted_tables_in_page`, and it is appended to the `extracted_tables` list. +Finally, the extracted_tables list is returned, which contains all the extracted tables from the PDF file, organized by page and table number. + +Limitations +=========== + +While the table parsing methodology using `pdfplumber` could process most tables, it could not parse every table layout accurately. The table settings need to be adjusted for different types of table layouts. Additionally, pdfplumber could not extract figure captions, whereas `unstructured.io` could. +Future work may involve developing a more robust and flexible table parsing algorithm that can handle a wider range of table layouts and integrate seamlessly with the ParsePDF class to leverage the strengths of both unstructured.io and pdfplumber libraries. diff --git a/src/docs/_build/html/_sources/get_started.rst.txt b/src/docs/_build/html/_sources/get_started.rst.txt index ca80073..19a2d99 100644 --- a/src/docs/_build/html/_sources/get_started.rst.txt +++ b/src/docs/_build/html/_sources/get_started.rst.txt @@ -5,6 +5,7 @@ Get Started get_started.introduction get_started.installation + get_started.parse_pdf get_started.llms get_started.vectordb diff --git a/src/docs/_build/html/_sources/get_started.vectordb.rst.txt b/src/docs/_build/html/_sources/get_started.vectordb.rst.txt index f6f749a..02f83c9 100644 --- a/src/docs/_build/html/_sources/get_started.vectordb.rst.txt +++ b/src/docs/_build/html/_sources/get_started.vectordb.rst.txt @@ -1,5 +1,3 @@ -.. _Vector Stores: - Vector Stores =============== @@ -28,7 +26,14 @@ Since Chroma is a server-client based vector database, make sure to run the serv * If Chroma is not run locally, change ``host`` and ``port`` under ``chroma`` in `src/config.ini`, or provide the arguments explicitly. -For non-supported vectorstores, (...) +Once you have chroma running, just use the Chroma Client class. + +DeepLake +********* +Since DeepLake is not a server based vector store, it is much easier to get started. + +Just make sure you have DeepLake installed and use the DeepLake Client class. + Embeddings ########### @@ -52,4 +57,3 @@ For more details on data ingestion, refer to our `cookbook Get Started @@ -91,13 +93,20 @@

Get Started
  • GRAG Overview
  • Installation
  • -
  • To run LLMs using HuggingFace
  • -
  • To run LLMs using LlamaCPP
      -
    • How to quantize models.
    • +
    • Parse PDF +
    • +
    • Table Parsing Methodology
    • +
    • Limitations
    • +
    • LLMs
    • Vector Stores
        diff --git a/src/docs/_build/html/get_started.installation.html b/src/docs/_build/html/get_started.installation.html index 206e94c..d810412 100644 --- a/src/docs/_build/html/get_started.installation.html +++ b/src/docs/_build/html/get_started.installation.html @@ -25,7 +25,7 @@ - + @@ -53,8 +53,10 @@
      • Get Started
      • @@ -103,7 +105,7 @@

        Installation - +
        diff --git a/src/docs/_build/html/get_started.introduction.html b/src/docs/_build/html/get_started.introduction.html index b3881fc..459f5ca 100644 --- a/src/docs/_build/html/get_started.introduction.html +++ b/src/docs/_build/html/get_started.introduction.html @@ -52,12 +52,14 @@
        • Get Started
        • @@ -94,10 +96,19 @@

          GRAG Overview

          GRAG provides an implementation of Retrieval-Augmented Generation that is completely open-sourced. Since it does not use any external services or APIs, this enables a cost-saving solution as well a solution to data privacy concerns. -For more information, refer to Test.

          -
          -

          Retrieval-Augmented Generation

          -

          Re

          +For more information, refer to our readme.

          +
          +

          Retrieval-Augmented Generation (RAG)

          +

          Retrieval-Augmented Generation (RAG) is a technique in machine learning that helps to enhance large-language models (LLM) by incorporating external data.

          +

          In RAG, a model first retrieves relevant documents or data from a large corpus and then uses this information to guide the generation of new text. This approach allows the model to produce more informed, accurate, and contextually appropriate responses.

          +

          By leveraging both the retrieval of existing knowledge and the generative capabilities of neural networks, RAG models can improve over traditional generation methods, particularly in tasks requiring deep domain-specific knowledge or factual accuracy.

          +
          +Basic-RAG Pipeline +
          +

          Illustration of a basic RAG pipeline

          +
          +
          +

          Traditionally, it uses a vector database/vector store for both retrieval and generation processes.

          diff --git a/src/docs/_build/html/get_started.llms.html b/src/docs/_build/html/get_started.llms.html index 0fb9a66..1d94f64 100644 --- a/src/docs/_build/html/get_started.llms.html +++ b/src/docs/_build/html/get_started.llms.html @@ -4,7 +4,7 @@ - To run LLMs using HuggingFace — GRAG 0.0.1 documentation + LLMs — GRAG 0.0.1 documentation @@ -26,7 +26,7 @@ - + @@ -53,9 +53,15 @@
        • Get Started
          • GRAG Overview
          • Installation
          • -
          • To run LLMs using HuggingFace
          • -
          • To run LLMs using LlamaCPP
              -
            • How to quantize models.
            • +
            • Parse PDF
            • +
            • Table Parsing Methodology
            • +
            • Limitations
            • +
            • LLMs
            • Vector Stores
            • @@ -80,7 +86,7 @@
              • - +
              • Edit on GitHub
              • @@ -90,17 +96,15 @@
                -
                -

                `LLMs

                -
                -
                +
                +

                LLMs

                GRAG offers two ways to run LLMs locally:

                1. LlamaCPP

                2. HuggingFace

                -

                To run LLMs using HuggingFace

                +

                To run LLMs using HuggingFace

                This is the easiest way to get started, but does not offer as much flexibility. If using a config file (config.ini), just change the model_name to @@ -108,11 +112,11 @@

                To run LLMs using HuggingFace -

                To run LLMs using LlamaCPP

                +

                To run LLMs using LlamaCPP

                LlamaCPP requires models in the form of .gguf file. You can either download these model files online, -or

                +or quantize the model yourself following the instructions below.

                -

                How to quantize models.

                +

                How to quantize models

                To quantize the model, run:

                python -m grag.quantize.quantize

                @@ -126,7 +130,10 @@

                How to quantize models.

                If user wants to download a model from HuggingFace, the user should provide the repository path from HuggingFace.

              • If the user has the model downloaded locally, then user will be instructed to copy the model and input the name of the model directory.

              -

              3.Finally, the user will be prompted to enter quantization settings (recommended Q5_K_M or Q4_K_M, etc.). For more details, check llama.cpp/examples/quantize/quantize.cpp.

              +
                +
              1. Finally, the user will be prompted to enter quantization settings (recommended Q5_K_M or Q4_K_M, etc.). For more details, check llama.cpp/examples/quantize/quantize.cpp.

              2. +
              + @@ -134,7 +141,7 @@

              How to quantize models.