DocsIngest

Turn any document directory into a prompt-friendly text ingest for LLMs, with a focus on compliance and comprehensive context generation.

🚀 Features

Multi-Format Document Support
- Ingests PDF, DOCX, Markdown, TXT files
- Automatic encoding detection
- Intelligent file type handling
- NEW: Extended support for .xlsx, .xls, .pptx, .json, .csv, .xml
Compliance-Focused Ingestion
- Pre-configured Compliance Officer prompt
- Customizable AI agent context
- Designed for compliance in mind
Smart File Processing
- Skips system and configuration files
- Handles temporary and hidden files
- Supports complex directory structures
Metadata and Reporting
- Generates comprehensive directory structure tree
- Counts total files and tokens
- Provides summary statistics
Semantic Compression (NEW)
- Intelligently reduce document size while maintaining core meaning
- Configurable compression levels
- Preserves full original content
- Optional compressed view for AI processing
Flexible Usage
- Command-line interface
- Importable as a Python package
- Configurable output options

📦 Installation

# Clone the repository
git clone https://github.com/marc-shade/docsingest.git
cd docsingest

# Recommended: Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

# Install dependencies
pip install -r requirements.txt

# Install the package in editable mode
pip install -e .

Requirements

Python Version: 3.7 - 3.12 recommended
Dependencies: All dependencies will be automatically installed via pip
System Requirements:
- Basic Python development tools
- pip package manager
- Internet connection for initial setup

🚀 Usage

Basic Document Ingestion

# Basic usage
docsingest /path/to/documents

# Output to a specific file
docsingest /path/to/documents -o my_report.md

# Verbose mode for detailed logging
docsingest /path/to/documents -v

Advanced Features

Complete Options List

usage: docsingest [-h] [-o OUTPUT] [--agent AGENT] [-p PROMPT] [--no-pii-analysis] [-v] [--compress] [--compression-level COMPRESSION_LEVEL] directory

Ingest documents from a directory for AI context.

positional arguments:
  directory             Path to the directory containing documents

options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output markdown file path (default: document_context.md)
  --agent AGENT         Initial AI agent prompt (default: Comprehensive Compliance Prompt)
  -p PROMPT, --prompt PROMPT
                        Alternate initial AI agent prompt
  --no-pii-analysis     Disable PII analysis
  -v, --verbose         Enable verbose output
  --compress            Compress document content
  --compression-level COMPRESSION_LEVEL
                        Compression level (0-1)

Content Compression

# Enable content compression
docsingest /path/to/documents --compress

# Specify compression level (0.0 to 1.0)
docsingest /path/to/documents --compress --compression-level 0.7

Ignore Files and Directories

Create a .docsingest_ignore file in your document directory to exclude specific files and directories:

# Example .docsingest_ignore
*.log       # Ignore all log files
.git/       # Ignore git directories
node_modules/  # Ignore dependency directories

Ignore Pattern Features

Support for regex-based file and directory exclusion
Flexible pattern matching
Supports comments with #
Ignore system, hidden, and temporary files
Prevent processing of unnecessary directories

Compliance and PII Analysis

# Disable PII analysis
docsingest /path/to/documents --no-pii-analysis

# Custom analysis prompt
docsingest /path/to/documents -p "Analyze these documents for project research"

🐛 Python Package Usage

from docsingest import ingest

# Basic usage
summary, tree, content = ingest("/path/to/documents")

# Custom agent prompt
summary, tree, content = ingest(
    "/path/to/documents", 
    agent_prompt="Specialized Compliance Analyst"
)

🛠️ Supported File Types

PDF
Microsoft Word (.docx)
Microsoft Excel (.xlsx, .xls)
Microsoft PowerPoint (.pptx)
Markdown (.md)
Plain Text (.txt)
CSV
XML
JSON

🚫 Automatically Skipped Files

.DS_Store
Temporary Office files (~$)
Temporary files (.tmp)
Log files
Git-related files and directories
IDE configuration directories
Python cache and virtual environment files

🔍 Regulatory Compliance Framework

DocsIngest provides a robust, multi-layered approach to regulatory compliance and document risk management:

🛡️ Comprehensive Compliance Features

Regulatory Compliance Overview

Multi-Jurisdiction Support: Designed to handle compliance requirements across various regulatory landscapes
Adaptive Compliance Scanning: Intelligent detection of sensitive information and potential regulatory risks
Configurable Compliance Profiles: Customizable settings for different industry standards and regulations

Risk Assessment Workflow

Document Ingestion Analysis
- Automatic classification of document types
- Identification of sensitive and regulated content
- Contextual risk scoring
Compliance Risk Evaluation
- Detect potential regulatory violations
- Flag documents with high-risk content
- Generate detailed compliance reports
Proactive Monitoring
- Continuous document scanning
- Real-time alerts for compliance breaches
- Audit trail generation

🔒 Supported Compliance Domains

GDPR (General Data Protection Regulation)
HIPAA (Health Insurance Portability and Accountability Act)
CCPA (California Consumer Privacy Act)
SOX (Sarbanes-Oxley Act)
PCI DSS (Payment Card Industry Data Security Standard)
NIST Framework
ISO 27001 Information Security Management

🚨 Key Compliance Capabilities

Advanced PII Detection
- Identify sensitive personal information
- Support for multiple PII categories:
  - Names
  - Email addresses
  - Phone numbers
  - Social Security Numbers
  - Credit card numbers
Intelligent Redaction
- Automatic masking of sensitive information
- Configurable redaction levels
Comprehensive Compliance Reporting
- Detailed risk assessment
- Actionable compliance recommendations
Multi-Regulation Support
- Compliance checks for GDPR, FERPA, COPPA
- Proactive regulatory alignment

🔍 Compliance Verification Process

Document Ingestion
Automated PII Scanning
Risk Assessment and Scoring
Compliance Reporting
Optional Redaction

Note: While DocsIngest provides powerful compliance tools, it is not a substitute for professional legal or compliance advice. Always consult with compliance experts for your specific regulatory requirements.

📦 Version Information

Current Version: 1.1.1 Last Updated: 2025-01-06 Maintained by: Marc Shade (marc@2acrestudios.com)

🚀 Roadmap

Support more file types
Cloud storage integration
Advanced AI prompt customization
Support for additional specialized file formats (e.g., .rtf, .odt)

🔧 Development

# Clone the repository
git clone https://github.com/marc-shade/docsingest.git
cd docsingest

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt

# Run tests
pytest tests/

🤝 Contributing

Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.

📄 License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DocsIngest

🚀 Features

📦 Installation

Requirements

🚀 Usage

Basic Document Ingestion

Advanced Features

Complete Options List

Content Compression

Ignore Files and Directories

Ignore Pattern Features

Compliance and PII Analysis

🐛 Python Package Usage

🛠️ Supported File Types

🚫 Automatically Skipped Files

🔍 Regulatory Compliance Framework

🛡️ Comprehensive Compliance Features

Regulatory Compliance Overview

Risk Assessment Workflow

🔒 Supported Compliance Domains

🚨 Key Compliance Capabilities

🔍 Compliance Verification Process

📦 Version Information

🚀 Roadmap

🔧 Development

🤝 Contributing

📄 License

Files

README.md

Latest commit

History

README.md

File metadata and controls

DocsIngest

🚀 Features

📦 Installation

Requirements

🚀 Usage

Basic Document Ingestion

Advanced Features

Complete Options List

Content Compression

Ignore Files and Directories

Ignore Pattern Features

Compliance and PII Analysis

🐛 Python Package Usage

🛠️ Supported File Types

🚫 Automatically Skipped Files

🔍 Regulatory Compliance Framework

🛡️ Comprehensive Compliance Features

Regulatory Compliance Overview

Risk Assessment Workflow

🔒 Supported Compliance Domains

🚨 Key Compliance Capabilities

🔍 Compliance Verification Process

📦 Version Information

🚀 Roadmap

🔧 Development

🤝 Contributing

📄 License