Merge branch 'docs' into 83-github-page

enoch3712 · Nov 25, 2024 · 82b8d66 · 82b8d66
2 parents 35d9882 + da2ba36
commit 82b8d66
Show file tree

Hide file tree

Showing 35 changed files with 1,968 additions and 2 deletions.
diff --git a/.github/workflows/documentation.yml b/.github/workflows/documentation.yml
@@ -0,0 +1,28 @@
+name: documentation
+on:
+  push:
+    branches:
+      - main
+permissions:
+  contents: write
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Configure Git Credentials
+        run: |
+          git config user.name github-actions[bot]
+          git config user.email 41898282+github-actions[bot]@users.noreply.github.com
+      - uses: actions/setup-python@v4
+        with:
+          python-version: 3.x
+      - run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV 
+      - uses: actions/cache@v3
+        with:
+          key: mkdocs-material-${{ env.cache_id }}
+          path: .cache
+          restore-keys: |
+            mkdocs-material-
+      - run: pip install mkdocs-material 
+      - run: mkdocs gh-deploy --force 
diff --git a/docs/assets/Logo.png b/docs/assets/Logo.png
diff --git a/docs/assets/chart_and_images.png b/docs/assets/chart_and_images.png
diff --git a/docs/assets/document_loader.png b/docs/assets/document_loader.png
diff --git a/docs/assets/extract-thinker-overview.png b/docs/assets/extract-thinker-overview.png
diff --git a/docs/assets/extractor.png b/docs/assets/extractor.png
diff --git a/docs/assets/favicon.png b/docs/assets/favicon.png
diff --git a/docs/assets/llm_image.png b/docs/assets/llm_image.png
diff --git a/docs/assets/process_image.png b/docs/assets/process_image.png
diff --git a/docs/assets/splitter_image.png b/docs/assets/splitter_image.png
diff --git a/docs/core-concepts/classification/basic.md b/docs/core-concepts/classification/basic.md
@@ -0,0 +1,88 @@
+# Basic Classification
+
+When classifying documents, the process involves extracting the content of the document and adding it to the prompt with several possible classifications. ExtractThinker simplifies this process using Pydantic models and instructor.
+
+## Simple Classification
+
+The most straightforward way to classify documents:
+
+```python
+from extract_thinker import Classification, Extractor
+from extract_thinker.document_loader import DocumentLoaderTesseract
+
+# Define classifications
+classifications = [
+    Classification(
+        name="Driver License",
+        description="This is a driver license",
+        contract=DriverLicense,  # optional. Will be added to the prompt
+    ),
+    Classification(
+        name="Invoice",
+        description="This is an invoice",
+        contract=InvoiceContract,  # optional. Will be added to the prompt
+    ),
+]
+
+# Initialize extractor
+tesseract_path = os.getenv("TESSERACT_PATH")
+document_loader = DocumentLoaderTesseract(tesseract_path)
+extractor = Extractor(document_loader)
+extractor.load_llm("gpt-4o")
+
+# Classify document
+result = extractor.classify(INVOICE_FILE_PATH, classifications)
+print(f"Document type: {result.name}, Confidence: {result.confidence}")
+```
+
+## Type Mapping with Contract
+
+Adding contract structure to the classification improves accuracy:
+
+```python
+from typing import List
+from extract_thinker.models.contract import Contract
+
+class InvoiceContract(Contract):
+    invoice_number: str
+    invoice_date: str
+    lines: List[LineItem]
+    total_amount: float
+
+class DriverLicense(Contract):
+    name: str
+    age: int
+    license_number: str
+```
+
+The contract structure is automatically added to the prompt, helping the model understand the expected document structure.
+
+## Classification Response
+
+All classifications return a standardized response:
+
+```python
+from typing import Optional
+from pydantic import BaseModel, Field
+
+class ClassificationResponse(BaseModel):
+    name: str
+    confidence: Optional[int] = Field(
+        description="From 1 to 10. 10 being the highest confidence",
+        ge=1, 
+        le=10
+    )
+```
+
+## Best Practices
+
+- Provide clear, distinctive descriptions for each classification
+- Use contract structures when possible
+- Consider using image classification for visual documents
+- Monitor confidence scores
+- Handle low-confidence cases appropriately
+
+For more advanced classification techniques, see:
+- [Mixture of Models (MoM)](mom.md)
+- [Tree-Based Classification](tree.md)
+- [Vision Classification](vision.md) 
diff --git a/docs/core-concepts/classification/index.md b/docs/core-concepts/classification/index.md
@@ -0,0 +1,79 @@
+# Document Classification
+
+In document intelligence, classification is often the crucial first step. It sets the stage for subsequent processes like data extraction and analysis. Before the rise of LLMs, this used to be accomplished (and still is) with AI models training in-house for certain use cases. Services such as Azure Document Intelligence give you this feature, but they are not dynamic and will set you up for "Vendor lock-in".
+
+LLMs may not be the most efficient for this task, but they are agnostic and near-perfect for it.
+
+<div align="center">
+  <img src="../../../assets/classification_overview.png" alt="Classification Overview">
+</div>
+
+## Classification Techniques
+
+<div class="grid cards">
+    <ul>
+        <li>
+            <p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2m0 16H5V5h14v14m-2-2H7v-2h10v2m-10-4h10v2H7v-2m10-6v2H7V7h10Z"></path></svg></span> <strong>Basic Classification</strong></p>
+            <p>Simple yet powerful classification using a single LLM with contract mapping.</p>
+            <p><a href="basic"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
+        </li>
+        <li>
+            <p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M16 17v2H2v-2s0-4 7-4 7 4 7 4m-7-6a4 4 0 0 0 4-4 4 4 0 0 0-4-4 4 4 0 0 0-4 4 4 4 0 0 0 4 4m8.8 4c1.2.7 2.2 1.7 2.2 3v2h3v-2s0-2.9-5.2-3M15 4a4 4 0 0 0 1.8 3.3A4 4 0 0 1 19 11c1.9 0 3-1.3 3-3a4 4 0 0 0-4-4h-3Z"></path></svg></span> <strong>Mixture of Models (MoM)</strong></p>
+            <p>Enhance accuracy by combining multiple models with different strategies.</p>
+            <p><a href="mom"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
+        </li>
+        <li>
+            <p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M3 3h18v2H3V3m0 16h18v2H3v-2m0-8h18v2H3v-2m0 4h8v2H3v-2m0-8h8v2H3V7m8 4h10v2H11v-2m0 8h10v2H11v-2m0-8h10v2H11V7"></path></svg></span> <strong>Tree-Based Classification</strong></p>
+            <p>Handle complex hierarchies and similar document types efficiently.</p>
+            <p><a href="tree"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
+        </li>
+        <li>
+            <p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M12 9a3 3 0 0 0-3 3 3 3 0 0 0 3 3 3 3 0 0 0 3-3 3 3 0 0 0-3-3m0 8a5 5 0 0 1-5-5 5 5 0 0 1 5-5 5 5 0 0 1 5 5 5 5 0 0 1-5 5m0-12.5C7 4.5 2.73 7.61 1 12c1.73 4.39 6 7.5 11 7.5s9.27-3.11 11-7.5c-1.73-4.39-6-7.5-11-7.5Z"></path></svg></span> <strong>Vision Classification</strong></p>
+            <p>Leverage visual features for better accuracy.</p>
+            <p><a href="vision"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
+        </li>
+    </ul>
+</div>
+
+## Classification Response
+
+All classification methods return a standardized response:
+
+```python
+from typing import Optional
+from pydantic import BaseModel, Field
+
+class ClassificationResponse(BaseModel):
+    name: str
+    confidence: Optional[int] = Field(
+        description="From 1 to 10. 10 being the highest confidence",
+        ge=1, 
+        le=10
+    )
+```
+
+## Available Strategies
+
+ExtractThinker supports three main classification strategies:
+
+- **CONSENSUS**: All models must agree on the classification
+- **HIGHER_ORDER**: Uses the result with highest confidence
+- **CONSENSUS_WITH_THRESHOLD**: Requires consensus and minimum confidence
+
+## Common Challenges
+
+1. **Large Context Windows**: More classifications mean larger contexts
+2. **Similar Documents**: Distinguishing between similar document types
+3. **Confidence Levels**: Ensuring high confidence in classifications
+4. **Scalability**: Handling growing number of document types
+
+## Best Practices
+
+- Start with basic classification for simple cases
+- Use MoM for critical classifications
+- Implement tree-based approach for similar documents
+- Consider vision classification for complex layouts
+- Set appropriate confidence thresholds
+- Monitor and log classification results
+
+For detailed implementation of each technique, visit their respective pages. 
diff --git a/docs/core-concepts/classification/mom.md b/docs/core-concepts/classification/mom.md
@@ -0,0 +1,97 @@
+# Mixture of Models (MoM)
+
+The Mixture of Models (MoM) is a pattern that increases classification confidence by combining multiple models in parallel. This approach is particularly effective when using instructor for structured outputs.
+
+## Basic Usage
+
+```python
+from extract_thinker import Process, Classification, ClassificationStrategy
+from extract_thinker.document_loader import DocumentLoaderTesseract
+
+# Define classifications
+classifications = [
+    Classification(
+        name="Driver License",
+        description="This is a driver license",
+    ),
+    Classification(
+        name="Invoice",
+        description="This is an invoice",
+    ),
+]
+
+# Initialize document loader
+tesseract_path = os.getenv("TESSERACT_PATH")
+document_loader = DocumentLoaderTesseract(tesseract_path)
+
+# Initialize multiple extractors with different models
+gpt_35_extractor = Extractor(document_loader)
+gpt_35_extractor.load_llm("gpt-3.5-turbo")
+
+claude_extractor = Extractor(document_loader)
+claude_extractor.load_llm("claude-3-haiku-20240307")
+
+gpt4_extractor = Extractor(document_loader)
+gpt4_extractor.load_llm("gpt-4o")
+
+# Create process with multiple extractors
+process = Process()
+process.add_classify_extractor([
+    [gpt_35_extractor, claude_3_haiku_extractor],  # First layer
+    [gpt4_extractor],                              # Second layer
+])
+
+# Classify with consensus strategy
+result = process.classify(
+    "document.pdf",
+    classifications,
+    strategy=ClassificationStrategy.CONSENSUS_WITH_THRESHOLD,
+    threshold=9
+)
+```
+
+## Available Strategies
+
+#### CONSENSUS
+All models must agree on the classification:
+
+```python
+result = process.classify(
+    "document.pdf",
+    classifications,
+    strategy=ClassificationStrategy.CONSENSUS
+)
+```
+
+#### HIGHER_ORDER
+Uses the result with the highest confidence score:
+
+```python
+result = process.classify(
+    "document.pdf",
+    classifications,
+    strategy=ClassificationStrategy.HIGHER_ORDER
+)
+```
+
+#### CONSENSUS_WITH_THRESHOLD
+Requires both consensus and minimum confidence:
+
+```python
+result = process.classify(
+    "document.pdf",
+    classifications,
+    strategy=ClassificationStrategy.CONSENSUS_WITH_THRESHOLD,
+    threshold=9
+)
+```
+
+## Best Practices
+
+- Use smaller models in the first layer for cost efficiency
+- Reserve larger models for cases where consensus isn't reached
+- Set appropriate confidence thresholds based on your use case
+- Consider using different model providers for better diversity
+- Monitor and log classification results for each model
+
+For more examples and advanced usage, check out the [examples directory](examples/) in the repository.