Skip to content

Commit

Permalink
Merge branch 'docs' into 83-github-page
Browse files Browse the repository at this point in the history
  • Loading branch information
enoch3712 committed Nov 25, 2024
2 parents 35d9882 + da2ba36 commit 82b8d66
Show file tree
Hide file tree
Showing 35 changed files with 1,968 additions and 2 deletions.
28 changes: 28 additions & 0 deletions .github/workflows/documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: documentation
on:
push:
branches:
- main
permissions:
contents: write
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure Git Credentials
run: |
git config user.name github-actions[bot]
git config user.email 41898282+github-actions[bot]@users.noreply.github.com
- uses: actions/setup-python@v4
with:
python-version: 3.x
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
- uses: actions/cache@v3
with:
key: mkdocs-material-${{ env.cache_id }}
path: .cache
restore-keys: |
mkdocs-material-
- run: pip install mkdocs-material
- run: mkdocs gh-deploy --force
Binary file added docs/assets/Logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/chart_and_images.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/document_loader.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/extract-thinker-overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/extractor.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/favicon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/llm_image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/process_image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/splitter_image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
88 changes: 88 additions & 0 deletions docs/core-concepts/classification/basic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Basic Classification

When classifying documents, the process involves extracting the content of the document and adding it to the prompt with several possible classifications. ExtractThinker simplifies this process using Pydantic models and instructor.

## Simple Classification

The most straightforward way to classify documents:

```python
from extract_thinker import Classification, Extractor
from extract_thinker.document_loader import DocumentLoaderTesseract

# Define classifications
classifications = [
Classification(
name="Driver License",
description="This is a driver license",
contract=DriverLicense, # optional. Will be added to the prompt
),
Classification(
name="Invoice",
description="This is an invoice",
contract=InvoiceContract, # optional. Will be added to the prompt
),
]

# Initialize extractor
tesseract_path = os.getenv("TESSERACT_PATH")
document_loader = DocumentLoaderTesseract(tesseract_path)
extractor = Extractor(document_loader)
extractor.load_llm("gpt-4o")

# Classify document
result = extractor.classify(INVOICE_FILE_PATH, classifications)
print(f"Document type: {result.name}, Confidence: {result.confidence}")
```

## Type Mapping with Contract

Adding contract structure to the classification improves accuracy:

```python
from typing import List
from extract_thinker.models.contract import Contract

class InvoiceContract(Contract):
invoice_number: str
invoice_date: str
lines: List[LineItem]
total_amount: float

class DriverLicense(Contract):
name: str
age: int
license_number: str
```

The contract structure is automatically added to the prompt, helping the model understand the expected document structure.

## Classification Response

All classifications return a standardized response:

```python
from typing import Optional
from pydantic import BaseModel, Field

class ClassificationResponse(BaseModel):
name: str
confidence: Optional[int] = Field(
description="From 1 to 10. 10 being the highest confidence",
ge=1,
le=10
)
```

## Best Practices

- Provide clear, distinctive descriptions for each classification
- Use contract structures when possible
- Consider using image classification for visual documents
- Monitor confidence scores
- Handle low-confidence cases appropriately

For more advanced classification techniques, see:
- [Mixture of Models (MoM)](mom.md)
- [Tree-Based Classification](tree.md)
- [Vision Classification](vision.md)
79 changes: 79 additions & 0 deletions docs/core-concepts/classification/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Document Classification

In document intelligence, classification is often the crucial first step. It sets the stage for subsequent processes like data extraction and analysis. Before the rise of LLMs, this used to be accomplished (and still is) with AI models training in-house for certain use cases. Services such as Azure Document Intelligence give you this feature, but they are not dynamic and will set you up for "Vendor lock-in".

LLMs may not be the most efficient for this task, but they are agnostic and near-perfect for it.

<div align="center">
<img src="../../../assets/classification_overview.png" alt="Classification Overview">
</div>

## Classification Techniques

<div class="grid cards">
<ul>
<li>
<p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2m0 16H5V5h14v14m-2-2H7v-2h10v2m-10-4h10v2H7v-2m10-6v2H7V7h10Z"></path></svg></span> <strong>Basic Classification</strong></p>
<p>Simple yet powerful classification using a single LLM with contract mapping.</p>
<p><a href="basic"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
</li>
<li>
<p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M16 17v2H2v-2s0-4 7-4 7 4 7 4m-7-6a4 4 0 0 0 4-4 4 4 0 0 0-4-4 4 4 0 0 0-4 4 4 4 0 0 0 4 4m8.8 4c1.2.7 2.2 1.7 2.2 3v2h3v-2s0-2.9-5.2-3M15 4a4 4 0 0 0 1.8 3.3A4 4 0 0 1 19 11c1.9 0 3-1.3 3-3a4 4 0 0 0-4-4h-3Z"></path></svg></span> <strong>Mixture of Models (MoM)</strong></p>
<p>Enhance accuracy by combining multiple models with different strategies.</p>
<p><a href="mom"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
</li>
<li>
<p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M3 3h18v2H3V3m0 16h18v2H3v-2m0-8h18v2H3v-2m0 4h8v2H3v-2m0-8h8v2H3V7m8 4h10v2H11v-2m0 8h10v2H11v-2m0-8h10v2H11V7"></path></svg></span> <strong>Tree-Based Classification</strong></p>
<p>Handle complex hierarchies and similar document types efficiently.</p>
<p><a href="tree"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
</li>
<li>
<p><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M12 9a3 3 0 0 0-3 3 3 3 0 0 0 3 3 3 3 0 0 0 3-3 3 3 0 0 0-3-3m0 8a5 5 0 0 1-5-5 5 5 0 0 1 5-5 5 5 0 0 1 5 5 5 5 0 0 1-5 5m0-12.5C7 4.5 2.73 7.61 1 12c1.73 4.39 6 7.5 11 7.5s9.27-3.11 11-7.5c-1.73-4.39-6-7.5-11-7.5Z"></path></svg></span> <strong>Vision Classification</strong></p>
<p>Leverage visual features for better accuracy.</p>
<p><a href="vision"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M8.22 2.97a.75.75 0 0 1 1.06 0l4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.75.75 0 0 1-1.042-.018.75.75 0 0 1-.018-1.042l2.97-2.97H3.75a.75.75 0 0 1 0-1.5h7.44L8.22 4.03a.75.75 0 0 1 0-1.06"></path></svg></span> Learn More</a></p>
</li>
</ul>
</div>

## Classification Response

All classification methods return a standardized response:

```python
from typing import Optional
from pydantic import BaseModel, Field

class ClassificationResponse(BaseModel):
name: str
confidence: Optional[int] = Field(
description="From 1 to 10. 10 being the highest confidence",
ge=1,
le=10
)
```

## Available Strategies

ExtractThinker supports three main classification strategies:

- **CONSENSUS**: All models must agree on the classification
- **HIGHER_ORDER**: Uses the result with highest confidence
- **CONSENSUS_WITH_THRESHOLD**: Requires consensus and minimum confidence

## Common Challenges

1. **Large Context Windows**: More classifications mean larger contexts
2. **Similar Documents**: Distinguishing between similar document types
3. **Confidence Levels**: Ensuring high confidence in classifications
4. **Scalability**: Handling growing number of document types

## Best Practices

- Start with basic classification for simple cases
- Use MoM for critical classifications
- Implement tree-based approach for similar documents
- Consider vision classification for complex layouts
- Set appropriate confidence thresholds
- Monitor and log classification results

For detailed implementation of each technique, visit their respective pages.
97 changes: 97 additions & 0 deletions docs/core-concepts/classification/mom.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Mixture of Models (MoM)

The Mixture of Models (MoM) is a pattern that increases classification confidence by combining multiple models in parallel. This approach is particularly effective when using instructor for structured outputs.

## Basic Usage

```python
from extract_thinker import Process, Classification, ClassificationStrategy
from extract_thinker.document_loader import DocumentLoaderTesseract

# Define classifications
classifications = [
Classification(
name="Driver License",
description="This is a driver license",
),
Classification(
name="Invoice",
description="This is an invoice",
),
]

# Initialize document loader
tesseract_path = os.getenv("TESSERACT_PATH")
document_loader = DocumentLoaderTesseract(tesseract_path)

# Initialize multiple extractors with different models
gpt_35_extractor = Extractor(document_loader)
gpt_35_extractor.load_llm("gpt-3.5-turbo")

claude_extractor = Extractor(document_loader)
claude_extractor.load_llm("claude-3-haiku-20240307")

gpt4_extractor = Extractor(document_loader)
gpt4_extractor.load_llm("gpt-4o")

# Create process with multiple extractors
process = Process()
process.add_classify_extractor([
[gpt_35_extractor, claude_3_haiku_extractor], # First layer
[gpt4_extractor], # Second layer
])

# Classify with consensus strategy
result = process.classify(
"document.pdf",
classifications,
strategy=ClassificationStrategy.CONSENSUS_WITH_THRESHOLD,
threshold=9
)
```

## Available Strategies

#### CONSENSUS
All models must agree on the classification:

```python
result = process.classify(
"document.pdf",
classifications,
strategy=ClassificationStrategy.CONSENSUS
)
```

#### HIGHER_ORDER
Uses the result with the highest confidence score:

```python
result = process.classify(
"document.pdf",
classifications,
strategy=ClassificationStrategy.HIGHER_ORDER
)
```

#### CONSENSUS_WITH_THRESHOLD
Requires both consensus and minimum confidence:

```python
result = process.classify(
"document.pdf",
classifications,
strategy=ClassificationStrategy.CONSENSUS_WITH_THRESHOLD,
threshold=9
)
```

## Best Practices

- Use smaller models in the first layer for cost efficiency
- Reserve larger models for cases where consensus isn't reached
- Set appropriate confidence thresholds based on your use case
- Consider using different model providers for better diversity
- Monitor and log classification results for each model

For more examples and advanced usage, check out the [examples directory](examples/) in the repository.
Loading

0 comments on commit 82b8d66

Please sign in to comment.