Fake Data Generator for Email Client Database

# Fake Data Generator for Email Client Database

## Overview
Create a fake data generator to populate a SQLite3 database with realistic email data simulating 4-5 email accounts with thousands of emails and related data points.

---

## Database Location

The SQLite database will be stored in the user's OS configuration directory:

- **Linux/macOS**: `~/.config/dwata/db.sqlite3`
- **Windows**: `%APPDATA%\dwata\db.sqlite3`

Example path: `/home/username/.config/dwata/db.sqlite3`

### Directory Creation
- The generator **must create** the `dwata` directory if it doesn't exist
- Use Rust's `std::fs::create_dir_all()` or `dirs` crate to get platform-specific config directory
- Ensure proper error handling if directory creation fails (permissions, disk space, etc.)

---

## Requirements

### Data Volume
- **Accounts**: 4-5 email accounts (e.g., `user1@gmail.com`, `user2@yahoo.com`, etc.)
- **Emails**: ~2,000 emails total across all accounts
- **Folders**: Standard folders (Inbox, Sent, Drafts, Trash, Archive) + custom folders per account
- **Labels**: ~20-30 labels across accounts
- **Contacts**: ~200-500 contacts with realistic data
- **Attachments**: Metadata only (filename, size, mime-type, hash) - no actual file content

---

## Database Schema

### Tables Structure

#### accounts
```sql
CREATE TABLE accounts (
    id INTEGER PRIMARY KEY,
    email TEXT NOT NULL UNIQUE,
    display_name TEXT,
    provider TEXT,
    created_at INTEGER
);
```
- `provider`: gmail, yahoo, outlook, etc.
- `created_at`: unix timestamp

#### folders
```sql
CREATE TABLE folders (
    id INTEGER PRIMARY KEY,
    account_id INTEGER NOT NULL,
    name TEXT NOT NULL,
    folder_type TEXT,
    parent_folder_id INTEGER,
    FOREIGN KEY (account_id) REFERENCES accounts(id),
    FOREIGN KEY (parent_folder_id) REFERENCES folders(id),
    UNIQUE(account_id, name)
);
```
- `folder_type`: inbox, sent, drafts, trash, archive, custom
- `parent_folder_id`: nullable, for nested folders

#### labels
```sql
CREATE TABLE labels (
    id INTEGER PRIMARY KEY,
    account_id INTEGER NOT NULL,
    name TEXT NOT NULL,
    color TEXT,
    FOREIGN KEY (account_id) REFERENCES accounts(id),
    UNIQUE(account_id, name)
);
```
- `color`: hex color code (e.g., #FF5733)

#### contacts
```sql
CREATE TABLE contacts (
    id INTEGER PRIMARY KEY,
    email TEXT NOT NULL,
    display_name TEXT,
    first_name TEXT,
    last_name TEXT,
    account_id INTEGER,
    created_at INTEGER,
    updated_at INTEGER,
    FOREIGN KEY (account_id) REFERENCES accounts(id),
    UNIQUE(account_id, email)
);
```
- `account_id`: which account owns this contact

#### emails
```sql
CREATE TABLE emails (
    id INTEGER PRIMARY KEY,
    account_id INTEGER NOT NULL,
    folder_id INTEGER NOT NULL,
    message_id TEXT UNIQUE,
    subject TEXT NOT NULL,
    from_email TEXT NOT NULL,
    from_name TEXT,
    to_emails TEXT,
    cc_emails TEXT,
    bcc_emails TEXT,
    body_plain TEXT,
    body_html TEXT,
    is_read INTEGER,
    is_starred INTEGER,
    is_draft INTEGER,
    received_at INTEGER,
    sent_at INTEGER,
    created_at INTEGER,
    FOREIGN KEY (account_id) REFERENCES accounts(id),
    FOREIGN KEY (folder_id) REFERENCES folders(id)
);

CREATE INDEX idx_emails_account_folder_received ON emails(account_id, folder_id, received_at);
CREATE INDEX idx_emails_subject ON emails(subject);
```
- `message_id`: RFC 5322 Message-ID format
- `to_emails`, `cc_emails`, `bcc_emails`: JSON arrays
- `is_read`, `is_starred`, `is_draft`: boolean (0/1)
- `received_at`, `sent_at`, `created_at`: unix timestamps

#### email_labels
```sql
CREATE TABLE email_labels (
    email_id INTEGER NOT NULL,
    label_id INTEGER NOT NULL,
    FOREIGN KEY (email_id) REFERENCES emails(id),
    FOREIGN KEY (label_id) REFERENCES labels(id),
    PRIMARY KEY (email_id, label_id)
);
```
- Junction table for many-to-many relationship

#### attachments
```sql
CREATE TABLE attachments (
    id INTEGER PRIMARY KEY,
    email_id INTEGER NOT NULL,
    filename TEXT NOT NULL,
    mime_type TEXT,
    size_bytes INTEGER,
    content_hash TEXT,
    created_at INTEGER,
    FOREIGN KEY (email_id) REFERENCES emails(id)
);

CREATE INDEX idx_attachments_email ON attachments(email_id);
```
- `content_hash`: SHA256 hash (simulated)
- `mime_type`: e.g., application/pdf, image/jpeg

#### file_index
```sql
CREATE TABLE file_index (
    id INTEGER PRIMARY KEY,
    attachment_id INTEGER UNIQUE,
    indexed_at INTEGER,
    search_tokens TEXT,
    FOREIGN KEY (attachment_id) REFERENCES attachments(id)
);
```
- `search_tokens`: tokenized filename for search functionality
- Separate indexing for all attachments

---

## Data Generation Strategy

### Realistic Data Patterns
- **Temporal Distribution**: Emails spread over last 6-12 months
  - More recent emails (last 30 days: ~30% of emails)
  - Gradual decrease in older periods
  
- **Email Distribution per Account**:
  - Accounts should have varying activity levels
  - Some accounts busier than others (realistic usage)
  
- **Folder Distribution**:
  - Inbox: 40-50% of emails
  - Sent: 20-30%
  - Other folders: remaining
  
- **Conversation Threads**:
  - Group related emails using subject patterns (Re: Fwd:)
  - Simulate email chains with reply timestamps
  
- **Attachments**:
  - ~30% of emails have attachments
  - 1-3 attachments per email (when present)
  - Realistic file types: .pdf, .docx, .xlsx, .jpg, .png, .zip
  - File sizes: 10KB - 10MB range
  
- **Labels**:
  - 2-5 labels per email (for ~40% of emails)
  - Common labels: Work, Personal, Important, Follow-up, etc.

### Data Sources
- Use Rust faker libraries (e.g., `fake` crate) for:
  - Names (first, last, full)
  - Email addresses
  - Email subjects (sentences, phrases)
  - Email bodies (paragraphs)
  - Companies/organizations
  - File names
  
- Generate realistic patterns:
  - Business hours for sent emails (9 AM - 6 PM weighted)
  - Weekend vs weekday patterns
  - Common email subject patterns

---

## Implementation Requirements

### Code Structure
```
src/
├── db/
│   ├── mod.rs
│   ├── schema.rs      # SQL schema definitions and table creation
│   ├── models.rs      # Rust structs for each entity
│   └── fake_data.rs   # Data generation logic
└── main.rs            # CLI to run generator
```

### Database Path Helper
```rust
use std::path::PathBuf;

fn get_db_path() -> Result<PathBuf, Box<dyn std::error::Error>> {
    // Use dirs crate or similar for cross-platform config dir
    let config_dir = dirs::config_dir()
        .ok_or("Could not determine config directory")?;
    
    let dwata_dir = config_dir.join("dwata");
    
    // Create directory if it doesn't exist
    std::fs::create_dir_all(&dwata_dir)?;
    
    Ok(dwata_dir.join("db.sqlite3"))
}
```

### Rust Structs (models.rs)
```rust
pub struct Account {
    pub id: Option<i64>,
    pub email: String,
    pub display_name: String,
    pub provider: String,
    pub created_at: i64,
}

pub struct Email {
    pub id: Option<i64>,
    pub account_id: i64,
    pub folder_id: i64,
    pub message_id: String,
    pub subject: String,
    pub from_email: String,
    pub from_name: Option<String>,
    pub to_emails: Vec<String>,
    pub cc_emails: Vec<String>,
    pub bcc_emails: Vec<String>,
    pub body_plain: String,
    pub body_html: Option<String>,
    pub is_read: bool,
    pub is_starred: bool,
    pub is_draft: bool,
    pub received_at: i64,
    pub sent_at: Option<i64>,
    pub created_at: i64,
}

// Similar structs for Folder, Label, Contact, Attachment, etc.
```

### Generator Features
- CLI interface to run data generation
  ```bash
  cargo run --bin fake-data-generator -- --accounts 5 --emails 2000
  ```
  
- Configurable parameters:
  - Number of accounts
  - Number of emails per account
  - Date range for emails
  - Custom database path (optional, defaults to config dir)
  
- Idempotent: Can safely re-run (drops and recreates tables)
  
- Progress reporting:
  - Show progress during generation
  - Display database path being used
  - Report statistics when complete (total emails, attachments, etc.)

### Database Setup
- Use `rusqlite` crate for SQLite operations
- Create all tables with proper schema
- Enable foreign key constraints
- Create indexes after data insertion (for performance)

---

## Deliverables

- [ ] Complete SQL schema implementation
- [ ] Rust data models (structs) for all entities
- [ ] Platform-specific config directory path handling
- [ ] Directory creation logic with error handling
- [ ] Fake data generator with realistic patterns
- [ ] CLI tool to run generator with configurable options
- [ ] Generated SQLite database file at `~/.config/dwata/db.sqlite3`
- [ ] README documentation:
  - How to run the generator
  - Where database is stored
  - Database schema documentation
  - Sample queries to verify data
  
---

## Testing & Verification

### Sample Queries
```sql
-- Verify account count
SELECT COUNT(*) FROM accounts;

-- Verify email distribution by account
SELECT a.email, COUNT(e.id) as email_count 
FROM accounts a 
LEFT JOIN emails e ON a.id = e.account_id 
GROUP BY a.id;

-- Verify attachments
SELECT COUNT(*) FROM attachments;

-- Check email date range
SELECT 
    datetime(MIN(received_at), 'unixepoch') as earliest,
    datetime(MAX(received_at), 'unixepoch') as latest 
FROM emails;

-- Verify labels
SELECT l.name, COUNT(el.email_id) as email_count 
FROM labels l 
LEFT JOIN email_labels el ON l.id = el.label_id 
GROUP BY l.id 
ORDER BY email_count DESC 
LIMIT 10;
```

---

## Success Criteria
- [ ] Config directory created at correct platform-specific location
- [ ] Database file created at `~/.config/dwata/db.sqlite3` (Linux/macOS)
- [ ] Database created with all tables and constraints
- [ ] 4-5 accounts with realistic email addresses
- [ ] ~2,000 emails with proper temporal distribution
- [ ] 200-500 contacts
- [ ] ~600 attachments (30% of emails)
- [ ] All foreign key relationships valid
- [ ] Indexes created for performance
- [ ] Generator completes in <30 seconds
- [ ] Database file size reasonable (<200MB)
- [ ] Proper error handling for directory/file creation failures

---

## Technical Stack
- **Language**: Rust (latest stable)
- **Database**: SQLite3
- **Crates**:
  - `rusqlite` - SQLite interface
  - `fake` or `faker_rand` - Fake data generation
  - `chrono` - Date/time handling
  - `serde_json` - JSON serialization for email arrays
  - `rand` - Random number generation
  - `clap` - CLI argument parsing
  - `dirs` - Platform-specific directory paths

Fake Data Generator for Email Client Database #173

Description

Fake Data Generator for Email Client Database

Overview

Database Location

Directory Creation

Requirements

Data Volume

Database Schema

Tables Structure

accounts

folders

labels

contacts

emails

email_labels

attachments

file_index

Data Generation Strategy

Realistic Data Patterns

Data Sources

Implementation Requirements

Code Structure

Database Path Helper

Rust Structs (models.rs)

Generator Features

Database Setup

Deliverables

Testing & Verification

Sample Queries

Success Criteria

Technical Stack

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions