Merge pull request #84 from enoch3712/83-github-page

83 GitHub page
enoch3712 · Nov 25, 2024 · b3ca396 · b3ca396
2 parents 35d9882 + db2c4fe
commit b3ca396
Show file tree

Hide file tree

Showing 44 changed files with 2,478 additions and 2 deletions.
diff --git a/.github/workflows/documentation.yml b/.github/workflows/documentation.yml
@@ -0,0 +1,28 @@
+name: documentation
+on:
+  push:
+    branches:
+      - main
+permissions:
+  contents: write
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Configure Git Credentials
+        run: |
+          git config user.name github-actions[bot]
+          git config user.email 41898282+github-actions[bot]@users.noreply.github.com
+      - uses: actions/setup-python@v4
+        with:
+          python-version: 3.x
+      - run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV 
+      - uses: actions/cache@v3
+        with:
+          key: mkdocs-material-${{ env.cache_id }}
+          path: .cache
+          restore-keys: |
+            mkdocs-material-
+      - run: pip install mkdocs-material 
+      - run: mkdocs gh-deploy --force --docs-dir docs
diff --git a/docs/assets/Logo.png b/docs/assets/Logo.png
diff --git a/docs/assets/azure_image.png b/docs/assets/azure_image.png
diff --git a/docs/assets/chart_and_images.png b/docs/assets/chart_and_images.png
diff --git a/docs/assets/classification_image.png b/docs/assets/classification_image.png
diff --git a/docs/assets/classification_tree_image.png b/docs/assets/classification_tree_image.png
diff --git a/docs/assets/document_loader.png b/docs/assets/document_loader.png
diff --git a/docs/assets/extract-thinker-overview.png b/docs/assets/extract-thinker-overview.png
diff --git a/docs/assets/extractor.png b/docs/assets/extractor.png
diff --git a/docs/assets/favicon.png b/docs/assets/favicon.png
diff --git a/docs/assets/llm_image.png b/docs/assets/llm_image.png
diff --git a/docs/assets/process_image.png b/docs/assets/process_image.png
diff --git a/docs/assets/resume_image.png b/docs/assets/resume_image.png
diff --git a/docs/assets/splitter_image.png b/docs/assets/splitter_image.png
diff --git a/docs/core-concepts/classification/basic.md b/docs/core-concepts/classification/basic.md
@@ -0,0 +1,88 @@
+# Basic Classification
+
+When classifying documents, the process involves extracting the content of the document and adding it to the prompt with several possible classifications. ExtractThinker simplifies this process using Pydantic models and instructor.
+
+## Simple Classification
+
+The most straightforward way to classify documents:
+
+```python
+from extract_thinker import Classification, Extractor
+from extract_thinker.document_loader import DocumentLoaderTesseract
+
+# Define classifications
+classifications = [
+    Classification(
+        name="Driver License",
+        description="This is a driver license",
+        contract=DriverLicense,  # optional. Will be added to the prompt
+    ),
+    Classification(
+        name="Invoice",
+        description="This is an invoice",
+        contract=InvoiceContract,  # optional. Will be added to the prompt
+    ),
+]
+
+# Initialize extractor
+tesseract_path = os.getenv("TESSERACT_PATH")
+document_loader = DocumentLoaderTesseract(tesseract_path)
+extractor = Extractor(document_loader)
+extractor.load_llm("gpt-4o")
+
+# Classify document
+result = extractor.classify(INVOICE_FILE_PATH, classifications)
+print(f"Document type: {result.name}, Confidence: {result.confidence}")
+```
+
+## Type Mapping with Contract
+
+Adding contract structure to the classification improves accuracy:
+
+```python
+from typing import List
+from extract_thinker.models.contract import Contract
+
+class InvoiceContract(Contract):
+    invoice_number: str
+    invoice_date: str
+    lines: List[LineItem]
+    total_amount: float
+
+class DriverLicense(Contract):
+    name: str
+    age: int
+    license_number: str
+```
+
+The contract structure is automatically added to the prompt, helping the model understand the expected document structure.
+
+## Classification Response
+
+All classifications return a standardized response:
+
+```python
+from typing import Optional
+from pydantic import BaseModel, Field
+
+class ClassificationResponse(BaseModel):
+    name: str
+    confidence: Optional[int] = Field(
+        description="From 1 to 10. 10 being the highest confidence",
+        ge=1, 
+        le=10
+    )
+```
+
+## Best Practices
+
+- Provide clear, distinctive descriptions for each classification
+- Use contract structures when possible
+- Consider using image classification for visual documents
+- Monitor confidence scores
+- Handle low-confidence cases appropriately
+
+For more advanced classification techniques, see:
+- [Mixture of Models (MoM)](mom.md)
+- [Tree-Based Classification](tree.md)
+- [Vision Classification](vision.md) 
diff --git a/docs/core-concepts/classification/index.md b/docs/core-concepts/classification/index.md
@@ -0,0 +1,63 @@
+# Document Classification
+
+In document intelligence, classification is often the crucial first step. It sets the stage for subsequent processes like data extraction and analysis. Before the rise of LLMs, this used to be accomplished (and still is) with AI models training in-house for certain use cases. Services such as Azure Document Intelligence give you this feature, but they are not dynamic and will set you up for "Vendor lock-in".
+
+LLMs may not be the most efficient for this task, but they are agnostic and near-perfect for it.
+
+<div align="center">
+  <img src="../../assets/classification_image.png" alt="Classification Overview">
+</div>
+
+## Classification Techniques
+
+<div class="grid cards">
+    <ul>
+        <li>
+            <p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2m0 16H5V5h14v14m-2-2H7v-2h10v2m-10-4h10v2H7v-2m10-6v2H7V7h10Z"></path></svg></span> <strong>Basic Classification</strong></p>
+            <p>Simple yet powerful classification using a single LLM with contract mapping.</p>
+            <p><a href="basic"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
+        </li>
+        <li>
+            <p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M16 17v2H2v-2s0-4 7-4 7 4 7 4m-7-6a4 4 0 0 0 4-4 4 4 0 0 0-4-4 4 4 0 0 0-4 4 4 4 0 0 0 4 4m8.8 4c1.2.7 2.2 1.7 2.2 3v2h3v-2s0-2.9-5.2-3M15 4a4 4 0 0 0 1.8 3.3A4 4 0 0 1 19 11c1.9 0 3-1.3 3-3a4 4 0 0 0-4-4h-3Z"></path></svg></span> <strong>Mixture of Models (MoM)</strong></p>
+            <p>Enhance accuracy by combining multiple models with different strategies.</p>
+            <p><a href="mom"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
+        </li>
+        <li>
+            <p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M3 3h18v2H3V3m0 16h18v2H3v-2m0-8h18v2H3v-2m0 4h8v2H3v-2m0-8h8v2H3V7m8 4h10v2H11v-2m0 8h10v2H11v-2m0-8h10v2H11V7"></path></svg></span> <strong>Tree-Based Classification</strong></p>
+            <p>Handle complex hierarchies and similar document types efficiently.</p>
+            <p><a href="tree"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
+        </li>
+        <li>
+            <p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M12 9a3 3 0 0 0-3 3 3 3 0 0 0 3 3 3 3 0 0 0 3-3 3 3 0 0 0-3-3m0 8a5 5 0 0 1-5-5 5 5 0 0 1 5-5 5 5 0 0 1 5 5 5 5 0 0 1-5 5m0-12.5C7 4.5 2.73 7.61 1 12c1.73 4.39 6 7.5 11 7.5s9.27-3.11 11-7.5c-1.73-4.39-6-7.5-11-7.5Z"></path></svg></span> <strong>Vision Classification</strong></p>
+            <p>Leverage visual features for better accuracy.</p>
+            <p><a href="vision"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
+        </li>
+    </ul>
+</div>
+
+## Classification Response
+
+All classification methods return a standardized response:
+
+```python
+from typing import Optional
+from pydantic import BaseModel, Field
+
+class ClassificationResponse(BaseModel):
+    name: str
+    confidence: Optional[int] = Field(
+        description="From 1 to 10. 10 being the highest confidence",
+        ge=1, 
+        le=10
+    )
+```
+
+## Available Strategies
+
+ExtractThinker supports three main classification strategies:
+
+- **CONSENSUS**: All models must agree on the classification
+- **HIGHER_ORDER**: Uses the result with highest confidence
+- **CONSENSUS_WITH_THRESHOLD**: Requires consensus and minimum confidence
+
+For detailed implementation of each technique, visit their respective pages.
diff --git a/docs/core-concepts/classification/mom.md b/docs/core-concepts/classification/mom.md
@@ -0,0 +1,97 @@
+# Mixture of Models (MoM)
+
+The Mixture of Models (MoM) is a pattern that increases classification confidence by combining multiple models in parallel. This approach is particularly effective when using instructor for structured outputs.
+
+## Basic Usage
+
+```python
+from extract_thinker import Process, Classification, ClassificationStrategy
+from extract_thinker.document_loader import DocumentLoaderTesseract
+
+# Define classifications
+classifications = [
+    Classification(
+        name="Driver License",
+        description="This is a driver license",
+    ),
+    Classification(
+        name="Invoice",
+        description="This is an invoice",
+    ),
+]
+
+# Initialize document loader
+tesseract_path = os.getenv("TESSERACT_PATH")
+document_loader = DocumentLoaderTesseract(tesseract_path)
+
+# Initialize multiple extractors with different models
+gpt_35_extractor = Extractor(document_loader)
+gpt_35_extractor.load_llm("gpt-3.5-turbo")
+
+claude_extractor = Extractor(document_loader)
+claude_extractor.load_llm("claude-3-haiku-20240307")
+
+gpt4_extractor = Extractor(document_loader)
+gpt4_extractor.load_llm("gpt-4o")
+
+# Create process with multiple extractors
+process = Process()
+process.add_classify_extractor([
+    [gpt_35_extractor, claude_3_haiku_extractor],  # First layer
+    [gpt4_extractor],                              # Second layer
+])
+
+# Classify with consensus strategy
+result = process.classify(
+    "document.pdf",
+    classifications,
+    strategy=ClassificationStrategy.CONSENSUS_WITH_THRESHOLD,
+    threshold=9
+)
+```
+
+## Available Strategies
+
+#### CONSENSUS
+All models must agree on the classification:
+
+```python
+result = process.classify(
+    "document.pdf",
+    classifications,
+    strategy=ClassificationStrategy.CONSENSUS
+)
+```
+
+#### HIGHER_ORDER
+Uses the result with the highest confidence score:
+
+```python
+result = process.classify(
+    "document.pdf",
+    classifications,
+    strategy=ClassificationStrategy.HIGHER_ORDER
+)
+```
+
+#### CONSENSUS_WITH_THRESHOLD
+Requires both consensus and minimum confidence:
+
+```python
+result = process.classify(
+    "document.pdf",
+    classifications,
+    strategy=ClassificationStrategy.CONSENSUS_WITH_THRESHOLD,
+    threshold=9
+)
+```
+
+## Best Practices
+
+- Use smaller models in the first layer for cost efficiency
+- Reserve larger models for cases where consensus isn't reached
+- Set appropriate confidence thresholds based on your use case
+- Consider using different model providers for better diversity
+- Monitor and log classification results for each model
+
+For more examples and advanced usage, check out the [examples directory](examples/) in the repository. 
diff --git a/docs/core-concepts/classification/tree.md b/docs/core-concepts/classification/tree.md
@@ -0,0 +1,101 @@
+# Tree-Based Classification
+
+<div align="center">
+  <img src="../../../assets/classification_tree_image.png" alt="Classification Overview">
+</div>
+
+In document intelligence, challenges often arise when dealing with a large number of similar document types. Tree-based classification organizes classifications into a hierarchical structure, breaking down the task into smaller, more manageable batches.
+
+## Basic Concept
+
+Tree-based classification offers:
+- **Increased Accuracy**: By narrowing down options at each step
+- **Scalability**: Easy addition of new document types
+- **Reduced Context**: Smaller context windows at each level
+
+## Implementation
+
+Here's how to implement a classification tree:
+
+```python
+from extract_thinker import Classification, ClassificationNode, ClassificationTree
+from extract_thinker.models.contract import Contract
+
+# Define contracts for each level
+class FinancialContract(Contract):
+    total_amount: int
+    document_number: str
+    document_date: str
+
+class InvoiceContract(Contract):
+    invoice_number: str
+    invoice_date: str
+    lines: List[LineItem]
+    total_amount: float
+
+class CreditNoteContract(Contract):
+    credit_note_number: str
+    credit_note_date: str
+    lines: List[LineItem]
+    total_amount: float
+
+# Create the classification tree
+financial_docs = ClassificationNode(
+    classification=Classification(
+        name="Financial Documents",
+        description="This is a financial document",
+        contract=FinancialContract,
+    ),
+    children=[
+        ClassificationNode(
+            classification=Classification(
+                name="Invoice",
+                description="This is an invoice",
+                contract=InvoiceContract,
+            )
+        ),
+        ClassificationNode(
+            classification=Classification(
+                name="Credit Note",
+                description="This is a credit note",
+                contract=CreditNoteContract,
+            )
+        )
+    ]
+)
+
+# Create the tree
+classification_tree = ClassificationTree(
+    nodes=[financial_docs]
+)
+
+# Initialize process
+process = Process()
+process.add_classify_extractor([[extractor]])
+
+# Classify using tree
+result = process.classify(
+    "document.pdf",
+    classification_tree,
+    threshold=0.95
+)
+```
+
+## Level-Based Contracts
+
+When implementing tree-based classification, consider contract complexity at each level:
+
+- **First Level**: Use minimal fields for broad categorization
+  ```python
+  class FinancialContract(Contract):
+      total_amount: int  # Just key identifying fields
+  ```
+
+- **Second Level**: Include full field set for precise classification
+  ```python
+  class InvoiceContract(Contract):
+      invoice_number: str
+      invoice_date: str
+      lines: List[LineItem]  # Complete field set
+      total_amount: float
+  ```