Skip to content

Fake Data Generator for Email Client Database #173

@brainless

Description

@brainless

Fake Data Generator for Email Client Database

Overview

Create a fake data generator to populate a SQLite3 database with realistic email data simulating 4-5 email accounts with thousands of emails and related data points.


Database Location

The SQLite database will be stored in the user's OS configuration directory:

  • Linux/macOS: ~/.config/dwata/db.sqlite3
  • Windows: %APPDATA%\dwata\db.sqlite3

Example path: /home/username/.config/dwata/db.sqlite3

Directory Creation

  • The generator must create the dwata directory if it doesn't exist
  • Use Rust's std::fs::create_dir_all() or dirs crate to get platform-specific config directory
  • Ensure proper error handling if directory creation fails (permissions, disk space, etc.)

Requirements

Data Volume

  • Accounts: 4-5 email accounts (e.g., user1@gmail.com, user2@yahoo.com, etc.)
  • Emails: ~2,000 emails total across all accounts
  • Folders: Standard folders (Inbox, Sent, Drafts, Trash, Archive) + custom folders per account
  • Labels: ~20-30 labels across accounts
  • Contacts: ~200-500 contacts with realistic data
  • Attachments: Metadata only (filename, size, mime-type, hash) - no actual file content

Database Schema

Tables Structure

accounts

CREATE TABLE accounts (
    id INTEGER PRIMARY KEY,
    email TEXT NOT NULL UNIQUE,
    display_name TEXT,
    provider TEXT,
    created_at INTEGER
);
  • provider: gmail, yahoo, outlook, etc.
  • created_at: unix timestamp

folders

CREATE TABLE folders (
    id INTEGER PRIMARY KEY,
    account_id INTEGER NOT NULL,
    name TEXT NOT NULL,
    folder_type TEXT,
    parent_folder_id INTEGER,
    FOREIGN KEY (account_id) REFERENCES accounts(id),
    FOREIGN KEY (parent_folder_id) REFERENCES folders(id),
    UNIQUE(account_id, name)
);
  • folder_type: inbox, sent, drafts, trash, archive, custom
  • parent_folder_id: nullable, for nested folders

labels

CREATE TABLE labels (
    id INTEGER PRIMARY KEY,
    account_id INTEGER NOT NULL,
    name TEXT NOT NULL,
    color TEXT,
    FOREIGN KEY (account_id) REFERENCES accounts(id),
    UNIQUE(account_id, name)
);
  • color: hex color code (e.g., #FF5733)

contacts

CREATE TABLE contacts (
    id INTEGER PRIMARY KEY,
    email TEXT NOT NULL,
    display_name TEXT,
    first_name TEXT,
    last_name TEXT,
    account_id INTEGER,
    created_at INTEGER,
    updated_at INTEGER,
    FOREIGN KEY (account_id) REFERENCES accounts(id),
    UNIQUE(account_id, email)
);
  • account_id: which account owns this contact

emails

CREATE TABLE emails (
    id INTEGER PRIMARY KEY,
    account_id INTEGER NOT NULL,
    folder_id INTEGER NOT NULL,
    message_id TEXT UNIQUE,
    subject TEXT NOT NULL,
    from_email TEXT NOT NULL,
    from_name TEXT,
    to_emails TEXT,
    cc_emails TEXT,
    bcc_emails TEXT,
    body_plain TEXT,
    body_html TEXT,
    is_read INTEGER,
    is_starred INTEGER,
    is_draft INTEGER,
    received_at INTEGER,
    sent_at INTEGER,
    created_at INTEGER,
    FOREIGN KEY (account_id) REFERENCES accounts(id),
    FOREIGN KEY (folder_id) REFERENCES folders(id)
);

CREATE INDEX idx_emails_account_folder_received ON emails(account_id, folder_id, received_at);
CREATE INDEX idx_emails_subject ON emails(subject);
  • message_id: RFC 5322 Message-ID format
  • to_emails, cc_emails, bcc_emails: JSON arrays
  • is_read, is_starred, is_draft: boolean (0/1)
  • received_at, sent_at, created_at: unix timestamps

email_labels

CREATE TABLE email_labels (
    email_id INTEGER NOT NULL,
    label_id INTEGER NOT NULL,
    FOREIGN KEY (email_id) REFERENCES emails(id),
    FOREIGN KEY (label_id) REFERENCES labels(id),
    PRIMARY KEY (email_id, label_id)
);
  • Junction table for many-to-many relationship

attachments

CREATE TABLE attachments (
    id INTEGER PRIMARY KEY,
    email_id INTEGER NOT NULL,
    filename TEXT NOT NULL,
    mime_type TEXT,
    size_bytes INTEGER,
    content_hash TEXT,
    created_at INTEGER,
    FOREIGN KEY (email_id) REFERENCES emails(id)
);

CREATE INDEX idx_attachments_email ON attachments(email_id);
  • content_hash: SHA256 hash (simulated)
  • mime_type: e.g., application/pdf, image/jpeg

file_index

CREATE TABLE file_index (
    id INTEGER PRIMARY KEY,
    attachment_id INTEGER UNIQUE,
    indexed_at INTEGER,
    search_tokens TEXT,
    FOREIGN KEY (attachment_id) REFERENCES attachments(id)
);
  • search_tokens: tokenized filename for search functionality
  • Separate indexing for all attachments

Data Generation Strategy

Realistic Data Patterns

  • Temporal Distribution: Emails spread over last 6-12 months

    • More recent emails (last 30 days: ~30% of emails)
    • Gradual decrease in older periods
  • Email Distribution per Account:

    • Accounts should have varying activity levels
    • Some accounts busier than others (realistic usage)
  • Folder Distribution:

    • Inbox: 40-50% of emails
    • Sent: 20-30%
    • Other folders: remaining
  • Conversation Threads:

    • Group related emails using subject patterns (Re: Fwd:)
    • Simulate email chains with reply timestamps
  • Attachments:

    • ~30% of emails have attachments
    • 1-3 attachments per email (when present)
    • Realistic file types: .pdf, .docx, .xlsx, .jpg, .png, .zip
    • File sizes: 10KB - 10MB range
  • Labels:

    • 2-5 labels per email (for ~40% of emails)
    • Common labels: Work, Personal, Important, Follow-up, etc.

Data Sources

  • Use Rust faker libraries (e.g., fake crate) for:

    • Names (first, last, full)
    • Email addresses
    • Email subjects (sentences, phrases)
    • Email bodies (paragraphs)
    • Companies/organizations
    • File names
  • Generate realistic patterns:

    • Business hours for sent emails (9 AM - 6 PM weighted)
    • Weekend vs weekday patterns
    • Common email subject patterns

Implementation Requirements

Code Structure

src/
├── db/
│   ├── mod.rs
│   ├── schema.rs      # SQL schema definitions and table creation
│   ├── models.rs      # Rust structs for each entity
│   └── fake_data.rs   # Data generation logic
└── main.rs            # CLI to run generator

Database Path Helper

use std::path::PathBuf;

fn get_db_path() -> Result<PathBuf, Box<dyn std::error::Error>> {
    // Use dirs crate or similar for cross-platform config dir
    let config_dir = dirs::config_dir()
        .ok_or("Could not determine config directory")?;
    
    let dwata_dir = config_dir.join("dwata");
    
    // Create directory if it doesn't exist
    std::fs::create_dir_all(&dwata_dir)?;
    
    Ok(dwata_dir.join("db.sqlite3"))
}

Rust Structs (models.rs)

pub struct Account {
    pub id: Option<i64>,
    pub email: String,
    pub display_name: String,
    pub provider: String,
    pub created_at: i64,
}

pub struct Email {
    pub id: Option<i64>,
    pub account_id: i64,
    pub folder_id: i64,
    pub message_id: String,
    pub subject: String,
    pub from_email: String,
    pub from_name: Option<String>,
    pub to_emails: Vec<String>,
    pub cc_emails: Vec<String>,
    pub bcc_emails: Vec<String>,
    pub body_plain: String,
    pub body_html: Option<String>,
    pub is_read: bool,
    pub is_starred: bool,
    pub is_draft: bool,
    pub received_at: i64,
    pub sent_at: Option<i64>,
    pub created_at: i64,
}

// Similar structs for Folder, Label, Contact, Attachment, etc.

Generator Features

  • CLI interface to run data generation

    cargo run --bin fake-data-generator -- --accounts 5 --emails 2000
  • Configurable parameters:

    • Number of accounts
    • Number of emails per account
    • Date range for emails
    • Custom database path (optional, defaults to config dir)
  • Idempotent: Can safely re-run (drops and recreates tables)

  • Progress reporting:

    • Show progress during generation
    • Display database path being used
    • Report statistics when complete (total emails, attachments, etc.)

Database Setup

  • Use rusqlite crate for SQLite operations
  • Create all tables with proper schema
  • Enable foreign key constraints
  • Create indexes after data insertion (for performance)

Deliverables

  • Complete SQL schema implementation
  • Rust data models (structs) for all entities
  • Platform-specific config directory path handling
  • Directory creation logic with error handling
  • Fake data generator with realistic patterns
  • CLI tool to run generator with configurable options
  • Generated SQLite database file at ~/.config/dwata/db.sqlite3
  • README documentation:
    • How to run the generator
    • Where database is stored
    • Database schema documentation
    • Sample queries to verify data

Testing & Verification

Sample Queries

-- Verify account count
SELECT COUNT(*) FROM accounts;

-- Verify email distribution by account
SELECT a.email, COUNT(e.id) as email_count 
FROM accounts a 
LEFT JOIN emails e ON a.id = e.account_id 
GROUP BY a.id;

-- Verify attachments
SELECT COUNT(*) FROM attachments;

-- Check email date range
SELECT 
    datetime(MIN(received_at), 'unixepoch') as earliest,
    datetime(MAX(received_at), 'unixepoch') as latest 
FROM emails;

-- Verify labels
SELECT l.name, COUNT(el.email_id) as email_count 
FROM labels l 
LEFT JOIN email_labels el ON l.id = el.label_id 
GROUP BY l.id 
ORDER BY email_count DESC 
LIMIT 10;

Success Criteria

  • Config directory created at correct platform-specific location
  • Database file created at ~/.config/dwata/db.sqlite3 (Linux/macOS)
  • Database created with all tables and constraints
  • 4-5 accounts with realistic email addresses
  • ~2,000 emails with proper temporal distribution
  • 200-500 contacts
  • ~600 attachments (30% of emails)
  • All foreign key relationships valid
  • Indexes created for performance
  • Generator completes in <30 seconds
  • Database file size reasonable (<200MB)
  • Proper error handling for directory/file creation failures

Technical Stack

  • Language: Rust (latest stable)
  • Database: SQLite3
  • Crates:
    • rusqlite - SQLite interface
    • fake or faker_rand - Fake data generation
    • chrono - Date/time handling
    • serde_json - JSON serialization for email arrays
    • rand - Random number generation
    • clap - CLI argument parsing
    • dirs - Platform-specific directory paths

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions