Skip to content

Commit

Permalink
Merge pull request #84 from enoch3712/83-github-page
Browse files Browse the repository at this point in the history
83 GitHub page
  • Loading branch information
enoch3712 authored Nov 25, 2024
2 parents 35d9882 + db2c4fe commit b3ca396
Show file tree
Hide file tree
Showing 44 changed files with 2,478 additions and 2 deletions.
28 changes: 28 additions & 0 deletions .github/workflows/documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: documentation
on:
push:
branches:
- main
permissions:
contents: write
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure Git Credentials
run: |
git config user.name github-actions[bot]
git config user.email 41898282+github-actions[bot]@users.noreply.github.com
- uses: actions/setup-python@v4
with:
python-version: 3.x
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
- uses: actions/cache@v3
with:
key: mkdocs-material-${{ env.cache_id }}
path: .cache
restore-keys: |
mkdocs-material-
- run: pip install mkdocs-material
- run: mkdocs gh-deploy --force --docs-dir docs
Binary file added docs/assets/Logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/azure_image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/chart_and_images.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/classification_image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/classification_tree_image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/document_loader.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/extract-thinker-overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/extractor.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/favicon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/llm_image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/process_image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/resume_image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/splitter_image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
88 changes: 88 additions & 0 deletions docs/core-concepts/classification/basic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Basic Classification

When classifying documents, the process involves extracting the content of the document and adding it to the prompt with several possible classifications. ExtractThinker simplifies this process using Pydantic models and instructor.

## Simple Classification

The most straightforward way to classify documents:

```python
from extract_thinker import Classification, Extractor
from extract_thinker.document_loader import DocumentLoaderTesseract

# Define classifications
classifications = [
Classification(
name="Driver License",
description="This is a driver license",
contract=DriverLicense, # optional. Will be added to the prompt
),
Classification(
name="Invoice",
description="This is an invoice",
contract=InvoiceContract, # optional. Will be added to the prompt
),
]

# Initialize extractor
tesseract_path = os.getenv("TESSERACT_PATH")
document_loader = DocumentLoaderTesseract(tesseract_path)
extractor = Extractor(document_loader)
extractor.load_llm("gpt-4o")

# Classify document
result = extractor.classify(INVOICE_FILE_PATH, classifications)
print(f"Document type: {result.name}, Confidence: {result.confidence}")
```

## Type Mapping with Contract

Adding contract structure to the classification improves accuracy:

```python
from typing import List
from extract_thinker.models.contract import Contract

class InvoiceContract(Contract):
invoice_number: str
invoice_date: str
lines: List[LineItem]
total_amount: float

class DriverLicense(Contract):
name: str
age: int
license_number: str
```

The contract structure is automatically added to the prompt, helping the model understand the expected document structure.

## Classification Response

All classifications return a standardized response:

```python
from typing import Optional
from pydantic import BaseModel, Field

class ClassificationResponse(BaseModel):
name: str
confidence: Optional[int] = Field(
description="From 1 to 10. 10 being the highest confidence",
ge=1,
le=10
)
```

## Best Practices

- Provide clear, distinctive descriptions for each classification
- Use contract structures when possible
- Consider using image classification for visual documents
- Monitor confidence scores
- Handle low-confidence cases appropriately

For more advanced classification techniques, see:
- [Mixture of Models (MoM)](mom.md)
- [Tree-Based Classification](tree.md)
- [Vision Classification](vision.md)
63 changes: 63 additions & 0 deletions docs/core-concepts/classification/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Document Classification

In document intelligence, classification is often the crucial first step. It sets the stage for subsequent processes like data extraction and analysis. Before the rise of LLMs, this used to be accomplished (and still is) with AI models training in-house for certain use cases. Services such as Azure Document Intelligence give you this feature, but they are not dynamic and will set you up for "Vendor lock-in".

LLMs may not be the most efficient for this task, but they are agnostic and near-perfect for it.

<div align="center">
<img src="../../assets/classification_image.png" alt="Classification Overview">
</div>

## Classification Techniques

<div class="grid cards">
<ul>
<li>
<p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2m0 16H5V5h14v14m-2-2H7v-2h10v2m-10-4h10v2H7v-2m10-6v2H7V7h10Z"></path></svg></span> <strong>Basic Classification</strong></p>
<p>Simple yet powerful classification using a single LLM with contract mapping.</p>
<p><a href="basic"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
</li>
<li>
<p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M16 17v2H2v-2s0-4 7-4 7 4 7 4m-7-6a4 4 0 0 0 4-4 4 4 0 0 0-4-4 4 4 0 0 0-4 4 4 4 0 0 0 4 4m8.8 4c1.2.7 2.2 1.7 2.2 3v2h3v-2s0-2.9-5.2-3M15 4a4 4 0 0 0 1.8 3.3A4 4 0 0 1 19 11c1.9 0 3-1.3 3-3a4 4 0 0 0-4-4h-3Z"></path></svg></span> <strong>Mixture of Models (MoM)</strong></p>
<p>Enhance accuracy by combining multiple models with different strategies.</p>
<p><a href="mom"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
</li>
<li>
<p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M3 3h18v2H3V3m0 16h18v2H3v-2m0-8h18v2H3v-2m0 4h8v2H3v-2m0-8h8v2H3V7m8 4h10v2H11v-2m0 8h10v2H11v-2m0-8h10v2H11V7"></path></svg></span> <strong>Tree-Based Classification</strong></p>
<p>Handle complex hierarchies and similar document types efficiently.</p>
<p><a href="tree"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
</li>
<li>
<p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M12 9a3 3 0 0 0-3 3 3 3 0 0 0 3 3 3 3 0 0 0 3-3 3 3 0 0 0-3-3m0 8a5 5 0 0 1-5-5 5 5 0 0 1 5-5 5 5 0 0 1 5 5 5 5 0 0 1-5 5m0-12.5C7 4.5 2.73 7.61 1 12c1.73 4.39 6 7.5 11 7.5s9.27-3.11 11-7.5c-1.73-4.39-6-7.5-11-7.5Z"></path></svg></span> <strong>Vision Classification</strong></p>
<p>Leverage visual features for better accuracy.</p>
<p><a href="vision"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
</li>
</ul>
</div>

## Classification Response

All classification methods return a standardized response:

```python
from typing import Optional
from pydantic import BaseModel, Field

class ClassificationResponse(BaseModel):
name: str
confidence: Optional[int] = Field(
description="From 1 to 10. 10 being the highest confidence",
ge=1,
le=10
)
```

## Available Strategies

ExtractThinker supports three main classification strategies:

- **CONSENSUS**: All models must agree on the classification
- **HIGHER_ORDER**: Uses the result with highest confidence
- **CONSENSUS_WITH_THRESHOLD**: Requires consensus and minimum confidence

For detailed implementation of each technique, visit their respective pages.
97 changes: 97 additions & 0 deletions docs/core-concepts/classification/mom.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Mixture of Models (MoM)

The Mixture of Models (MoM) is a pattern that increases classification confidence by combining multiple models in parallel. This approach is particularly effective when using instructor for structured outputs.

## Basic Usage

```python
from extract_thinker import Process, Classification, ClassificationStrategy
from extract_thinker.document_loader import DocumentLoaderTesseract

# Define classifications
classifications = [
Classification(
name="Driver License",
description="This is a driver license",
),
Classification(
name="Invoice",
description="This is an invoice",
),
]

# Initialize document loader
tesseract_path = os.getenv("TESSERACT_PATH")
document_loader = DocumentLoaderTesseract(tesseract_path)

# Initialize multiple extractors with different models
gpt_35_extractor = Extractor(document_loader)
gpt_35_extractor.load_llm("gpt-3.5-turbo")

claude_extractor = Extractor(document_loader)
claude_extractor.load_llm("claude-3-haiku-20240307")

gpt4_extractor = Extractor(document_loader)
gpt4_extractor.load_llm("gpt-4o")

# Create process with multiple extractors
process = Process()
process.add_classify_extractor([
[gpt_35_extractor, claude_3_haiku_extractor], # First layer
[gpt4_extractor], # Second layer
])

# Classify with consensus strategy
result = process.classify(
"document.pdf",
classifications,
strategy=ClassificationStrategy.CONSENSUS_WITH_THRESHOLD,
threshold=9
)
```

## Available Strategies

#### CONSENSUS
All models must agree on the classification:

```python
result = process.classify(
"document.pdf",
classifications,
strategy=ClassificationStrategy.CONSENSUS
)
```

#### HIGHER_ORDER
Uses the result with the highest confidence score:

```python
result = process.classify(
"document.pdf",
classifications,
strategy=ClassificationStrategy.HIGHER_ORDER
)
```

#### CONSENSUS_WITH_THRESHOLD
Requires both consensus and minimum confidence:

```python
result = process.classify(
"document.pdf",
classifications,
strategy=ClassificationStrategy.CONSENSUS_WITH_THRESHOLD,
threshold=9
)
```

## Best Practices

- Use smaller models in the first layer for cost efficiency
- Reserve larger models for cases where consensus isn't reached
- Set appropriate confidence thresholds based on your use case
- Consider using different model providers for better diversity
- Monitor and log classification results for each model

For more examples and advanced usage, check out the [examples directory](examples/) in the repository.
101 changes: 101 additions & 0 deletions docs/core-concepts/classification/tree.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Tree-Based Classification

<div align="center">
<img src="../../../assets/classification_tree_image.png" alt="Classification Overview">
</div>

In document intelligence, challenges often arise when dealing with a large number of similar document types. Tree-based classification organizes classifications into a hierarchical structure, breaking down the task into smaller, more manageable batches.

## Basic Concept

Tree-based classification offers:
- **Increased Accuracy**: By narrowing down options at each step
- **Scalability**: Easy addition of new document types
- **Reduced Context**: Smaller context windows at each level

## Implementation

Here's how to implement a classification tree:

```python
from extract_thinker import Classification, ClassificationNode, ClassificationTree
from extract_thinker.models.contract import Contract

# Define contracts for each level
class FinancialContract(Contract):
total_amount: int
document_number: str
document_date: str

class InvoiceContract(Contract):
invoice_number: str
invoice_date: str
lines: List[LineItem]
total_amount: float

class CreditNoteContract(Contract):
credit_note_number: str
credit_note_date: str
lines: List[LineItem]
total_amount: float

# Create the classification tree
financial_docs = ClassificationNode(
classification=Classification(
name="Financial Documents",
description="This is a financial document",
contract=FinancialContract,
),
children=[
ClassificationNode(
classification=Classification(
name="Invoice",
description="This is an invoice",
contract=InvoiceContract,
)
),
ClassificationNode(
classification=Classification(
name="Credit Note",
description="This is a credit note",
contract=CreditNoteContract,
)
)
]
)

# Create the tree
classification_tree = ClassificationTree(
nodes=[financial_docs]
)

# Initialize process
process = Process()
process.add_classify_extractor([[extractor]])

# Classify using tree
result = process.classify(
"document.pdf",
classification_tree,
threshold=0.95
)
```

## Level-Based Contracts

When implementing tree-based classification, consider contract complexity at each level:

- **First Level**: Use minimal fields for broad categorization
```python
class FinancialContract(Contract):
total_amount: int # Just key identifying fields
```

- **Second Level**: Include full field set for precise classification
```python
class InvoiceContract(Contract):
invoice_number: str
invoice_date: str
lines: List[LineItem] # Complete field set
total_amount: float
```
Loading

0 comments on commit b3ca396

Please sign in to comment.