Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

done #86

Merged
merged 1 commit into from
Nov 25, 2024
Merged

done #86

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions docs/core-concepts/document-loaders/aws-textract.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# AWS Textract Document Loader

> AWS Textract provides advanced OCR and document analysis capabilities, extracting text, forms, and tables from documents.

## Prerequisite

You need AWS credentials with access to Textract service. You will need:
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`
- `AWS_DEFAULT_REGION`

```python
%pip install --upgrade --quiet extract_thinker boto3
```

## Basic Usage

Here's a simple example of using the AWS Textract loader:

```python
from extract_thinker import DocumentLoaderTextract

# Initialize the loader
loader = DocumentLoaderTextract(
aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
region_name=os.getenv('AWS_DEFAULT_REGION')
)

# Load document content
result = loader.load_content_from_file("document.pdf")
```

## Response Structure

The loader returns a dictionary with the following structure:

```python
{
"pages": [
{
"paragraphs": ["text content..."],
"lines": ["line1", "line2"],
"words": ["word1", "word2"]
}
],
"tables": [
[["cell1", "cell2"], ["cell3", "cell4"]]
],
"forms": [
{"key": "value"}
],
"layout": {
# Document layout information
}
}
```

## Best Practices

1. **Document Preparation**
- Use high-quality scans
- Support formats: PDF, JPEG, PNG
- Consider file size limits

2. **Performance**
- Cache results when possible
- Process pages individually for large documents
- Monitor API quotas and costs

For more examples and implementation details, check out the [examples directory](examples/) in the repository.
82 changes: 82 additions & 0 deletions docs/core-concepts/document-loaders/google-document-ai.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Google Document AI Document Loader

> Google Document AI transforms unstructured document data into structured, actionable insights using machine learning.

## Prerequisite

You need Google Cloud credentials and a Document AI processor. You will need:
- `DOCUMENTAI_GOOGLE_CREDENTIALS`
- `DOCUMENTAI_LOCATION`
- `DOCUMENTAI_PROCESSOR_NAME`

```python
%pip install --upgrade --quiet extract_thinker google-cloud-documentai
```

## Basic Usage

Here's a simple example of using the Google Document AI loader:

```python
from extract_thinker import DocumentLoaderDocumentAI

# Initialize the loader
loader = DocumentLoaderDocumentAI(
credentials=os.getenv("DOCUMENTAI_GOOGLE_CREDENTIALS"),
location=os.getenv("DOCUMENTAI_LOCATION"),
processor_name=os.getenv("DOCUMENTAI_PROCESSOR_NAME")
)

# Load CV/Resume content
content = loader.load_content_from_file("CV_Candidate.pdf")
```

## Response Structure

The loader returns a dictionary containing:
```python
{
"pages": [
{
"content": "Full text content of the page",
"paragraphs": ["Paragraph 1", "Paragraph 2"],
"tables": [
[
["Header 1", "Header 2"],
["Value 1", "Value 2"]
]
]
}
]
}
```

## Processing Different Document Types

```python
# Process forms with tables
content = loader.load_content_from_file("form_with_tables.pdf")

# Process from stream
with open("document.pdf", "rb") as f:
content = loader.load_content_from_stream(
stream=f,
mime_type="application/pdf"
)
```

## Best Practices

1. **Document Types**
- Use appropriate processor for document type
- Ensure correct MIME type for streams
- Validate content structure

2. **Performance**
- Process in batches when possible
- Cache results for repeated access
- Monitor API quotas

Document AI supports PDF, TIFF, GIF, JPEG, PNG with a maximum file size of 20MB or 2000 pages.

For more examples and implementation details, check out the [examples directory](examples/) in the repository.
43 changes: 43 additions & 0 deletions docs/core-concepts/document-loaders/pdf-plumber.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# PDF Plumber Document Loader

PDF Plumber is a Python library for extracting text and tables from PDFs. ExtractThinker's PDF Plumber loader provides a simple interface for working with this library.

## Basic Usage

Here's how to use the PDF Plumber loader:

```python
from extract_thinker import Extractor
from extract_thinker.document_loader import DocumentLoaderPdfPlumber

# Initialize the loader
loader = DocumentLoaderPdfPlumber()

# Load document content
result = loader.load_content_from_file("document.pdf")

# Access extracted content
text = result["text"] # List of text content by page
tables = result["tables"] # List of tables found in document
```

## Features

- Text extraction with positioning
- Table detection and extraction
- Image location detection
- Character-level text properties

## Best Practices

1. **Document Preparation**
- Ensure PDFs are not scanned images
- Use well-structured PDFs
- Check for text encoding issues

2. **Performance**
- Process pages individually for large documents
- Cache results for repeated access
- Consider memory usage for large files

For more examples and implementation details, check out the [examples directory](examples/) in the repository.
42 changes: 42 additions & 0 deletions docs/core-concepts/document-loaders/pypdf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# PyPDF Document Loader

PyPDF is a pure-Python library for reading and writing PDFs. ExtractThinker's PyPDF loader provides a simple interface for text extraction.

## Basic Usage

Here's how to use the PyPDF loader:

```python
from extract_thinker import Extractor
from extract_thinker.document_loader import DocumentLoaderPyPdf

# Initialize the loader
loader = DocumentLoaderPyPdf()

# Load document content
content = loader.load_content_from_file("document.pdf")

# Access text content
text = content["text"] # List of text content by page
```

## Features

- Basic text extraction
- Page-by-page processing
- Metadata extraction
- Low memory footprint

## Best Practices

1. **Document Handling**
- Use for text-based PDFs
- Consider alternatives for scanned documents
- Check PDF version compatibility

2. **Performance**
- Process large documents in chunks
- Cache results when appropriate
- Monitor memory usage

For more examples and implementation details, check out the [examples directory](examples/) in the repository.
42 changes: 42 additions & 0 deletions docs/core-concepts/document-loaders/spreadsheet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Spreadsheet Document Loader

The Spreadsheet loader in ExtractThinker handles Excel, CSV, and other tabular data formats.

## Basic Usage

Here's how to use the Spreadsheet loader:

```python
from extract_thinker import Extractor
from extract_thinker.document_loader import DocumentLoaderSpreadsheet

# Initialize the loader
loader = DocumentLoaderSpreadsheet()

# Load Excel file
excel_content = loader.load_content_from_file("data.xlsx")

# Load CSV file
csv_content = loader.load_content_from_file("data.csv")
```

## Features

- Excel file support (.xlsx, .xls)
- CSV file support
- Multiple sheet handling
- Data type preservation

## Best Practices

1. **Data Preparation**
- Use consistent data formats
- Clean data before processing
- Handle missing values appropriately

2. **Performance**
- Process large files in chunks
- Use appropriate data types
- Consider memory limitations

For more examples and implementation details, check out the [examples directory](examples/) in the repository.
44 changes: 44 additions & 0 deletions docs/core-concepts/document-loaders/web-loader.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Web Document Loader

The Web loader in ExtractThinker uses BeautifulSoup to extract content from web pages and HTML documents.

## Basic Usage

Here's how to use the Web loader:

```python
from extract_thinker import Extractor
from extract_thinker.document_loader import DocumentLoaderBeautifulSoup

# Initialize the loader
loader = DocumentLoaderBeautifulSoup(
header_handling="summarize" # Options: summarize, extract, ignore
)

# Load content from URL
content = loader.load_content_from_file("https://example.com")

# Access extracted content
text = content["content"]
```

## Features

- HTML content extraction
- Header/footer handling
- Link extraction
- Image reference extraction

## Best Practices

1. **URL Handling**
- Validate URLs before processing
- Handle redirects appropriately
- Respect robots.txt

2. **Content Processing**
- Clean HTML before extraction
- Handle different character encodings
- Consider rate limiting for multiple URLs

For more examples and implementation details, check out the [examples directory](examples/) in the repository.
1 change: 0 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,6 @@ nav:
- PyPDF: core-concepts/document-loaders/pypdf.md
- Spreadsheet: core-concepts/document-loaders/spreadsheet.md
- Web Loader: core-concepts/document-loaders/web-loader.md
- Docling: core-concepts/document-loaders/docling.md
- Adobe PDF Services: '#'
- ABBYY FineReader: '#'
- PaddleOCR: '#'
Expand Down