diff --git a/docs/core-concepts/document-loaders/aws-textract.md b/docs/core-concepts/document-loaders/aws-textract.md new file mode 100644 index 0000000..266fb78 --- /dev/null +++ b/docs/core-concepts/document-loaders/aws-textract.md @@ -0,0 +1,71 @@ +# AWS Textract Document Loader + +> AWS Textract provides advanced OCR and document analysis capabilities, extracting text, forms, and tables from documents. + +## Prerequisite + +You need AWS credentials with access to Textract service. You will need: +- `AWS_ACCESS_KEY_ID` +- `AWS_SECRET_ACCESS_KEY` +- `AWS_DEFAULT_REGION` + +```python +%pip install --upgrade --quiet extract_thinker boto3 +``` + +## Basic Usage + +Here's a simple example of using the AWS Textract loader: + +```python +from extract_thinker import DocumentLoaderTextract + +# Initialize the loader +loader = DocumentLoaderTextract( + aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'), + aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'), + region_name=os.getenv('AWS_DEFAULT_REGION') +) + +# Load document content +result = loader.load_content_from_file("document.pdf") +``` + +## Response Structure + +The loader returns a dictionary with the following structure: + +```python +{ + "pages": [ + { + "paragraphs": ["text content..."], + "lines": ["line1", "line2"], + "words": ["word1", "word2"] + } + ], + "tables": [ + [["cell1", "cell2"], ["cell3", "cell4"]] + ], + "forms": [ + {"key": "value"} + ], + "layout": { + # Document layout information + } +} +``` + +## Best Practices + +1. **Document Preparation** + - Use high-quality scans + - Support formats: PDF, JPEG, PNG + - Consider file size limits + +2. **Performance** + - Cache results when possible + - Process pages individually for large documents + - Monitor API quotas and costs + +For more examples and implementation details, check out the [examples directory](examples/) in the repository. \ No newline at end of file diff --git a/docs/core-concepts/document-loaders/google-document-ai.md b/docs/core-concepts/document-loaders/google-document-ai.md new file mode 100644 index 0000000..1c2e5ef --- /dev/null +++ b/docs/core-concepts/document-loaders/google-document-ai.md @@ -0,0 +1,82 @@ +# Google Document AI Document Loader + +> Google Document AI transforms unstructured document data into structured, actionable insights using machine learning. + +## Prerequisite + +You need Google Cloud credentials and a Document AI processor. You will need: +- `DOCUMENTAI_GOOGLE_CREDENTIALS` +- `DOCUMENTAI_LOCATION` +- `DOCUMENTAI_PROCESSOR_NAME` + +```python +%pip install --upgrade --quiet extract_thinker google-cloud-documentai +``` + +## Basic Usage + +Here's a simple example of using the Google Document AI loader: + +```python +from extract_thinker import DocumentLoaderDocumentAI + +# Initialize the loader +loader = DocumentLoaderDocumentAI( + credentials=os.getenv("DOCUMENTAI_GOOGLE_CREDENTIALS"), + location=os.getenv("DOCUMENTAI_LOCATION"), + processor_name=os.getenv("DOCUMENTAI_PROCESSOR_NAME") +) + +# Load CV/Resume content +content = loader.load_content_from_file("CV_Candidate.pdf") +``` + +## Response Structure + +The loader returns a dictionary containing: +```python +{ + "pages": [ + { + "content": "Full text content of the page", + "paragraphs": ["Paragraph 1", "Paragraph 2"], + "tables": [ + [ + ["Header 1", "Header 2"], + ["Value 1", "Value 2"] + ] + ] + } + ] +} +``` + +## Processing Different Document Types + +```python +# Process forms with tables +content = loader.load_content_from_file("form_with_tables.pdf") + +# Process from stream +with open("document.pdf", "rb") as f: + content = loader.load_content_from_stream( + stream=f, + mime_type="application/pdf" + ) +``` + +## Best Practices + +1. **Document Types** + - Use appropriate processor for document type + - Ensure correct MIME type for streams + - Validate content structure + +2. **Performance** + - Process in batches when possible + - Cache results for repeated access + - Monitor API quotas + +Document AI supports PDF, TIFF, GIF, JPEG, PNG with a maximum file size of 20MB or 2000 pages. + +For more examples and implementation details, check out the [examples directory](examples/) in the repository. \ No newline at end of file diff --git a/docs/core-concepts/document-loaders/pdf-plumber.md b/docs/core-concepts/document-loaders/pdf-plumber.md new file mode 100644 index 0000000..8faf73b --- /dev/null +++ b/docs/core-concepts/document-loaders/pdf-plumber.md @@ -0,0 +1,43 @@ +# PDF Plumber Document Loader + +PDF Plumber is a Python library for extracting text and tables from PDFs. ExtractThinker's PDF Plumber loader provides a simple interface for working with this library. + +## Basic Usage + +Here's how to use the PDF Plumber loader: + +```python +from extract_thinker import Extractor +from extract_thinker.document_loader import DocumentLoaderPdfPlumber + +# Initialize the loader +loader = DocumentLoaderPdfPlumber() + +# Load document content +result = loader.load_content_from_file("document.pdf") + +# Access extracted content +text = result["text"] # List of text content by page +tables = result["tables"] # List of tables found in document +``` + +## Features + +- Text extraction with positioning +- Table detection and extraction +- Image location detection +- Character-level text properties + +## Best Practices + +1. **Document Preparation** + - Ensure PDFs are not scanned images + - Use well-structured PDFs + - Check for text encoding issues + +2. **Performance** + - Process pages individually for large documents + - Cache results for repeated access + - Consider memory usage for large files + +For more examples and implementation details, check out the [examples directory](examples/) in the repository. \ No newline at end of file diff --git a/docs/core-concepts/document-loaders/pypdf.md b/docs/core-concepts/document-loaders/pypdf.md new file mode 100644 index 0000000..22fb9d2 --- /dev/null +++ b/docs/core-concepts/document-loaders/pypdf.md @@ -0,0 +1,42 @@ +# PyPDF Document Loader + +PyPDF is a pure-Python library for reading and writing PDFs. ExtractThinker's PyPDF loader provides a simple interface for text extraction. + +## Basic Usage + +Here's how to use the PyPDF loader: + +```python +from extract_thinker import Extractor +from extract_thinker.document_loader import DocumentLoaderPyPdf + +# Initialize the loader +loader = DocumentLoaderPyPdf() + +# Load document content +content = loader.load_content_from_file("document.pdf") + +# Access text content +text = content["text"] # List of text content by page +``` + +## Features + +- Basic text extraction +- Page-by-page processing +- Metadata extraction +- Low memory footprint + +## Best Practices + +1. **Document Handling** + - Use for text-based PDFs + - Consider alternatives for scanned documents + - Check PDF version compatibility + +2. **Performance** + - Process large documents in chunks + - Cache results when appropriate + - Monitor memory usage + +For more examples and implementation details, check out the [examples directory](examples/) in the repository. \ No newline at end of file diff --git a/docs/core-concepts/document-loaders/spreadsheet.md b/docs/core-concepts/document-loaders/spreadsheet.md new file mode 100644 index 0000000..f62a2c8 --- /dev/null +++ b/docs/core-concepts/document-loaders/spreadsheet.md @@ -0,0 +1,42 @@ +# Spreadsheet Document Loader + +The Spreadsheet loader in ExtractThinker handles Excel, CSV, and other tabular data formats. + +## Basic Usage + +Here's how to use the Spreadsheet loader: + +```python +from extract_thinker import Extractor +from extract_thinker.document_loader import DocumentLoaderSpreadsheet + +# Initialize the loader +loader = DocumentLoaderSpreadsheet() + +# Load Excel file +excel_content = loader.load_content_from_file("data.xlsx") + +# Load CSV file +csv_content = loader.load_content_from_file("data.csv") +``` + +## Features + +- Excel file support (.xlsx, .xls) +- CSV file support +- Multiple sheet handling +- Data type preservation + +## Best Practices + +1. **Data Preparation** + - Use consistent data formats + - Clean data before processing + - Handle missing values appropriately + +2. **Performance** + - Process large files in chunks + - Use appropriate data types + - Consider memory limitations + +For more examples and implementation details, check out the [examples directory](examples/) in the repository. \ No newline at end of file diff --git a/docs/core-concepts/document-loaders/web-loader.md b/docs/core-concepts/document-loaders/web-loader.md new file mode 100644 index 0000000..75f4f7e --- /dev/null +++ b/docs/core-concepts/document-loaders/web-loader.md @@ -0,0 +1,44 @@ +# Web Document Loader + +The Web loader in ExtractThinker uses BeautifulSoup to extract content from web pages and HTML documents. + +## Basic Usage + +Here's how to use the Web loader: + +```python +from extract_thinker import Extractor +from extract_thinker.document_loader import DocumentLoaderBeautifulSoup + +# Initialize the loader +loader = DocumentLoaderBeautifulSoup( + header_handling="summarize" # Options: summarize, extract, ignore +) + +# Load content from URL +content = loader.load_content_from_file("https://example.com") + +# Access extracted content +text = content["content"] +``` + +## Features + +- HTML content extraction +- Header/footer handling +- Link extraction +- Image reference extraction + +## Best Practices + +1. **URL Handling** + - Validate URLs before processing + - Handle redirects appropriately + - Respect robots.txt + +2. **Content Processing** + - Clean HTML before extraction + - Handle different character encodings + - Consider rate limiting for multiple URLs + +For more examples and implementation details, check out the [examples directory](examples/) in the repository. \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 0839bf7..41f7e51 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -38,7 +38,6 @@ nav: - PyPDF: core-concepts/document-loaders/pypdf.md - Spreadsheet: core-concepts/document-loaders/spreadsheet.md - Web Loader: core-concepts/document-loaders/web-loader.md - - Docling: core-concepts/document-loaders/docling.md - Adobe PDF Services: '#' - ABBYY FineReader: '#' - PaddleOCR: '#'