Embucket
diff --git a/‎benchmark/README.md‎
Lines changed: 86 additions & 31 deletions b/‎benchmark/README.md‎
Lines changed: 86 additions & 31 deletions
diff --git a/‎benchmark/benchmark.py‎
Lines changed: 4 additions & 1 deletion b/‎benchmark/benchmark.py‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎benchmark/clickbench/__init__.py‎
Lines changed: 45 additions & 0 deletions b/‎benchmark/clickbench/__init__.py‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎benchmark/clickbench/clickbench_ddl.py‎
Lines changed: 134 additions & 0 deletions b/‎benchmark/clickbench/clickbench_ddl.py‎
Lines changed: 134 additions & 0 deletions
@@ -1,6 +1,6 @@
 ## Overview
 
-This benchmark tool executes queries derived from TPC-H against both Snowflake and Embucket with cache-clearing operations to ensure clean, cache-free performance measurements. For Snowflake, it uses warehouse suspend/resume operations. For Embucket, it restarts the Docker container before each query to eliminate internal caching. It provides detailed timing metrics including compilation time, execution time, and total elapsed time.
+This benchmark tool executes queries from multiple benchmark suites (TPC-H, ClickBench, TPC-DS) against both Snowflake and Embucket with cache-clearing operations to ensure clean, cache-free performance measurements. For Snowflake, it uses warehouse suspend/resume operations. For Embucket, it restarts the Docker container before each query to eliminate internal caching. It provides detailed timing metrics including compilation time, execution time, and total elapsed time.
 
 ## TPC Legal Considerations
 
@@ -14,9 +14,12 @@ Throughout this document and when talking about these benchmarks, you will see t
 
 ## Features
 
+- **Multiple Benchmark Types**: Supports TPC-H, ClickBench, and TPC-DS benchmark suites
 - **Cache Isolation**:
   - **Snowflake**: Suspends and resumes warehouse before each query
   - **Embucket**: Restarts Docker container before each query to clear internal cache
+- **Flexible Caching Options**: Can run with or without cache clearing (`--no-cache` flag)
+- **Command Line Interface**: Full CLI support for system selection, benchmark type, and run configuration
 - **Result Cache Disabled**: Ensures no result caching affects benchmark results
 - **Comprehensive Metrics**: Tracks compilation time, execution time, and row counts
 - **CSV Export**: Saves results to CSV files for further analysis
@@ -51,37 +54,79 @@ SNOWFLAKE_WAREHOUSE=your_warehouse
 
 **For Embucket (when using infrastructure):**
 ```bash
-EMBUCKET_SQL_HOST=your_ec2_instance_ip
-EMBUCKET_SQL_PORT=3000
-EMBUCKET_SQL_PROTOCOL=http
+EMBUCKET_HOST=your_ec2_instance_ip
+EMBUCKET_PORT=3000
+EMBUCKET_PROTOCOL=http
 EMBUCKET_USER=embucket
 EMBUCKET_PASSWORD=embucket
 EMBUCKET_ACCOUNT=embucket
-EMBUCKET_DATABASE=embucket
-EMBUCKET_SCHEMA=public
+EMBUCKET_DATABASE=benchmark_database
+EMBUCKET_SCHEMA=benchmark_schema
 EMBUCKET_INSTANCE=your_instance_name
-EMBUCKET_DATASET=your_dataset_name
 SSH_KEY_PATH=~/.ssh/id_rsa
 ```
 
+**Benchmark Configuration:**
+```bash
+BENCHMARK_TYPE=tpch  # Options: tpch, clickbench, tpcds
+DATASET_S3_BUCKET=embucket-testdata
+DATASET_PATH=tpch/01  # Path within S3 bucket
+SNOWFLAKE_WAREHOUSE_SIZE=XSMALL
+AWS_ACCESS_KEY_ID=your_aws_access_key_id
+AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
+```
+
 ## Usage
 
-Run the benchmark:
+### Command Line Interface
+
+The benchmark supports comprehensive command-line options:
+
 ```bash
+# Run both Snowflake and Embucket with TPC-H (default)
 python benchmark.py
+
+# Run only Embucket with TPC-H
+python benchmark.py --system embucket
+
+# Run only Snowflake with TPC-H
+python benchmark.py --system snowflake
+
+# Run ClickBench on both systems
+python benchmark.py --benchmark-type clickbench
+
+# Run TPC-DS on Embucket only
+python benchmark.py --system embucket --benchmark-type tpcds
+
+# Run with caching enabled (no container restarts/warehouse suspends)
+python benchmark.py --system embucket
+
+# Run with caching disabled (force cache clearing)
+python benchmark.py --system embucket --no-cache
+
+# Custom number of runs and dataset path
+python benchmark.py --runs 5 --dataset-path tpch/100
 ```
 
-**Current Behavior**: By default, the benchmark runs **only Embucket** benchmarks for 3 iterations. To run both Snowflake and Embucket with comparisons, you need to modify the `__main__` section in `benchmark.py` to call `run_benchmark(i + 1)` instead of `run_embucket_benchmark(i + 1)`.
+### Command Line Arguments
+
+- `--system`: Choose platform (`snowflake`, `embucket`, `both`) - default: `both`
+- `--runs`: Number of benchmark runs - default: `3`
+- `--benchmark-type`: Benchmark suite (`tpch`, `clickbench`, `tpcds`) - default: `tpch`
+- `--dataset-path`: Override DATASET_PATH environment variable
+- `--no-cache`: Force cache clearing (warehouse suspend for Snowflake, container restart for Embucket)
+
+### Benchmark Process
 
 The benchmark will:
-1. Connect to the configured platform (Embucket by default, or both if modified)
-2. Execute each query derived from TPC-H with cache-clearing operations:
-   - **Snowflake**: Warehouse suspend/resume before each query
-   - **Embucket**: Docker container restart before each query
+1. Connect to the configured platform(s)
+2. Execute each query from the selected benchmark suite with cache-clearing operations:
+   - **Snowflake**: Warehouse suspend/resume before each query (if `--no-cache`)
+   - **Embucket**: Docker container restart before each query (if `--no-cache`)
 3. Collect performance metrics from query history
 4. Display results and comparisons (if both platforms are run)
 5. Save detailed results to CSV files
-6. Calculate averages after 3 runs are completed
+6. Calculate averages after all runs are completed
 
 ## Embucket Container Restart Functionality
 
@@ -95,8 +140,8 @@ For Embucket benchmarks, the system automatically restarts the Docker container
 - Creates a fresh database connection and executes the query
 
 **Requirements:**
-- `EMBUCKET_SQL_HOST` set to your EC2 instance IP
-- `EMBUCKET_INSTANCE` and `EMBUCKET_DATASET` for result organization
+- `EMBUCKET_HOST` set to your EC2 instance IP
+- `EMBUCKET_INSTANCE` for result organization
 - `SSH_KEY_PATH` pointing to your private key (default: `~/.ssh/id_rsa`)
 - SSH access to the EC2 instance running Embucket
 
@@ -115,37 +160,47 @@ The benchmark provides:
 - **Total Times**: Aggregated compilation and execution times
 
 **File Organization:**
-- Snowflake results: `snowflake_tpch_results/{schema}/{warehouse}/`
-- Embucket results: `embucket_tpch_results/{dataset}/{instance}/`
+- Snowflake results: `snowflake_{benchmark_type}_results/{schema}/{warehouse}/`
+- Embucket results: `embucket_{benchmark_type}_results/{dataset}/{instance}/`
+
+Where `{benchmark_type}` is one of: `tpch`, `clickbench`, or `tpcds`
 
 ## Files
 
 - `benchmark.py` - Main benchmark script with restart functionality
 - `docker_manager.py` - Docker container management for Embucket restarts
 - `utils.py` - Connection utilities for Snowflake and Embucket
-- `tpch_queries.py` - Query definitions derived from TPC-H
-- `tpcds_queries.py` - Query definitions derived from TPC-DS (for future use)
+- `tpch/` - TPC-H benchmark utilities package (queries, DDL, table names)
+- `clickbench/` - ClickBench benchmark utilities package (queries, DDL, table names)
+- `tpcds/` - TPC-DS benchmark utilities package (queries, DDL, table names)
 - `calculate_average.py` - Result averaging and analysis
 - `config.py` - Configuration utilities
 - `data_preparation.py` - Data preparation utilities
 - `requirements.txt` - Python dependencies
 - `env_example` - Example environment configuration file
 - `infrastructure/` - Terraform infrastructure for EC2/Embucket deployment
 - `tpch-datagen/` - TPC-H data generation infrastructure
-- `tpch/` - TPC-H benchmark utilities package (queries, DDL, table names)
-- `tpcds_ddl/` - TPC-DS table definitions for Embucket
 
-## Customizing Benchmark Behavior
+## Benchmark Types
 
-**Default**: The benchmark runs only Embucket tests for 3 iterations.
+### TPC-H (Default)
+Derived from the TPC-H decision support benchmark. Includes 22 complex analytical queries testing various aspects of data warehousing performance.
 
-**To run both Snowflake and Embucket with comparisons**: Modify the `__main__` section in `benchmark.py`:
-```python
-if __name__ == "__main__":
-    for i in range(3):
-        print(f"Run {i + 1} of 3")
-        run_benchmark(i + 1)  # Change from run_embucket_benchmark(i + 1)
-```
+### ClickBench
+Single-table analytical benchmark focusing on aggregation performance. Uses the `hits` table with web analytics data.
+
+### TPC-DS
+Derived from the TPC-DS decision support benchmark. More complex than TPC-H with 99 queries testing advanced analytical scenarios.
+
+## Environment Variables
+
+The benchmark behavior can be controlled through environment variables in your `.env` file:
+
+- `BENCHMARK_TYPE`: Default benchmark type (`tpch`, `clickbench`, `tpcds`)
+- `DATASET_PATH`: Path within S3 bucket for dataset location
+- `DATASET_S3_BUCKET`: S3 bucket containing benchmark datasets
+- `EMBUCKET_HOST`: EC2 instance IP for Embucket connection
+- `SSH_KEY_PATH`: Path to SSH private key for container restarts
 
 ## Requirements
 
 
@@ -7,6 +7,7 @@
 from utils import create_snowflake_connection
 from utils import create_embucket_connection
 from tpch import parametrize_tpch_queries
+from clickbench import parametrize_clickbench_queries
 from docker_manager import create_docker_manager
 from constants import SystemType
 
@@ -286,6 +287,8 @@ def get_queries_for_benchmark(benchmark_type: str, for_embucket: bool) -> List[T
     """Get appropriate queries based on the benchmark type."""
     if benchmark_type == "tpch":
         return parametrize_tpch_queries(fully_qualified_names_for_embucket=for_embucket)
+    elif benchmark_type == "clickbench":
+        return parametrize_clickbench_queries(fully_qualified_names_for_embucket=for_embucket)
     elif benchmark_type == "tpcds":
         raise NotImplementedError("TPC-DS benchmarks not yet implemented")
     else:
@@ -433,7 +436,7 @@ def parse_args():
     parser = argparse.ArgumentParser(description="Run benchmarks on Snowflake and/or Embucket")
     parser.add_argument("--system", choices=["snowflake", "embucket", "both"], default="both")
     parser.add_argument("--runs", type=int, default=3)
-    parser.add_argument("--benchmark-type", choices=["tpch", "tpcds"], default=os.environ.get("BENCHMARK_TYPE", "tpch"))
+    parser.add_argument("--benchmark-type", choices=["tpch", "clickbench", "tpcds"], default=os.environ.get("BENCHMARK_TYPE", "tpch"))
     parser.add_argument("--dataset-path", help="Override the DATASET_PATH environment variable")
     parser.add_argument("--no-cache", action="store_true", help="Disable caching (force warehouse suspend and USE_CACHED_RESULT=False for Snowflake, force container restart for Embucket)")
     return parser.parse_args()
 
@@ -0,0 +1,45 @@
+"""
+ClickBench benchmark utilities package.
+
+This package contains all ClickBench related functionality including:
+- Table name configuration and parametrization
+- Query definitions with parametrized table names
+- DDL statements with parametrized table names
+
+Main exports:
+- parametrize_clickbench_queries: Parametrize ClickBench queries (requires explicit parameter)
+- parametrize_clickbench_ddl: Parametrize ClickBench DDL statements (requires explicit parameter)
+- CLICKBENCH_TABLE_NAMES: Raw table name mappings
+- get_table_names: Get parametrized table names (requires explicit parameter)
+- parametrize_clickbench_statements: Generic parametrization function (requires explicit parameter)
+
+Note: All functions require explicit fully_qualified_names_for_embucket parameter.
+No pre-computed constants are provided to enforce explicit parameter usage.
+"""
+
+from .clickbench_table_names import (
+    CLICKBENCH_TABLE_NAMES,
+    get_table_names,
+    parametrize_clickbench_statements
+)
+
+from .clickbench_queries import (
+    parametrize_clickbench_queries,
+)
+
+from .clickbench_ddl import (
+    parametrize_clickbench_ddl,
+)
+
+__all__ = [
+    # Table names and core functions
+    'CLICKBENCH_TABLE_NAMES',
+    'get_table_names',
+    'parametrize_clickbench_statements',
+
+    # Query functions
+    'parametrize_clickbench_queries',
+
+    # DDL functions
+    'parametrize_clickbench_ddl',
+]
@@ -0,0 +1,134 @@
+import os
+
+from .clickbench_table_names import parametrize_clickbench_statements
+
+# ClickBench DDL statement with parametrized table name
+_CLICKBENCH_DDL_RAW = [
+    (
+        "hits",
+        """
+        -- Snowflake-like DDL for ClickBench hits table
+        CREATE OR REPLACE TABLE {HITS_TABLE} (
+          WatchID BIGINT,
+          JavaEnable SMALLINT,
+          Title VARCHAR,
+          GoodEvent SMALLINT,
+          EventTime BIGINT,
+          EventDate SMALLINT,
+          CounterID INTEGER,
+          ClientIP INTEGER,
+          RegionID INTEGER,
+          UserID BIGINT,
+          CounterClass SMALLINT,
+          OS SMALLINT,
+          UserAgent SMALLINT,
+          URL VARCHAR,
+          Referer VARCHAR,
+          IsRefresh SMALLINT,
+          RefererCategoryID SMALLINT,
+          RefererRegionID INTEGER,
+          URLCategoryID SMALLINT,
+          URLRegionID INTEGER,
+          ResolutionWidth SMALLINT,
+          ResolutionHeight SMALLINT,
+          ResolutionDepth SMALLINT,
+          FlashMajor SMALLINT,
+          FlashMinor SMALLINT,
+          FlashMinor2 VARCHAR,
+          NetMajor SMALLINT,
+          NetMinor SMALLINT,
+          UserAgentMajor SMALLINT,
+          UserAgentMinor VARCHAR,
+          CookieEnable SMALLINT,
+          JavascriptEnable SMALLINT,
+          IsMobile SMALLINT,
+          MobilePhone SMALLINT,
+          MobilePhoneModel VARCHAR,
+          Params VARCHAR,
+          IPNetworkID INTEGER,
+          TraficSourceID SMALLINT,
+          SearchEngineID SMALLINT,
+          SearchPhrase VARCHAR,
+          AdvEngineID SMALLINT,
+          IsArtifical SMALLINT,
+          WindowClientWidth SMALLINT,
+          WindowClientHeight SMALLINT,
+          ClientTimeZone SMALLINT,
+          ClientEventTime BIGINT,
+          SilverlightVersion1 SMALLINT,
+          SilverlightVersion2 SMALLINT,
+          SilverlightVersion3 INTEGER,
+          SilverlightVersion4 SMALLINT,
+          PageCharset VARCHAR,
+          CodeVersion INTEGER,
+          IsLink SMALLINT,
+          IsDownload SMALLINT,
+          IsNotBounce SMALLINT,
+          FUniqID BIGINT,
+          OriginalURL VARCHAR,
+          HID INTEGER,
+          IsOldCounter SMALLINT,
+          IsEvent SMALLINT,
+          IsParameter SMALLINT,
+          DontCountHits SMALLINT,
+          WithHash SMALLINT,
+          HitColor VARCHAR,
+          LocalEventTime BIGINT,
+          Age SMALLINT,
+          Sex SMALLINT,
+          Income SMALLINT,
+          Interests SMALLINT,
+          Robotness SMALLINT,
+          RemoteIP INTEGER,
+          WindowName INTEGER,
+          OpenerName INTEGER,
+          HistoryLength SMALLINT,
+          BrowserLanguage VARCHAR,
+          BrowserCountry VARCHAR,
+          SocialNetwork VARCHAR,
+          SocialAction VARCHAR,
+          HTTPError SMALLINT,
+          SendTiming INTEGER,
+          DNSTiming INTEGER,
+          ConnectTiming INTEGER,
+          ResponseStartTiming INTEGER,
+          ResponseEndTiming INTEGER,
+          FetchTiming INTEGER,
+          SocialSourceNetworkID SMALLINT,
+          SocialSourcePage VARCHAR,
+          ParamPrice BIGINT,
+          ParamOrderID VARCHAR,
+          ParamCurrency VARCHAR,
+          ParamCurrencyID SMALLINT,
+          OpenstatServiceName VARCHAR,
+          OpenstatCampaignID VARCHAR,
+          OpenstatAdID VARCHAR,
+          OpenstatSourceID VARCHAR,
+          UTMSource VARCHAR,
+          UTMMedium VARCHAR,
+          UTMCampaign VARCHAR,
+          UTMContent VARCHAR,
+          UTMTerm VARCHAR,
+          FromTag VARCHAR,
+          HasGCLID SMALLINT,
+          RefererHash BIGINT,
+          URLHash BIGINT,
+          CLID INTEGER
+        );
+        """
+    ),
+]
+
+
+def parametrize_clickbench_ddl(fully_qualified_names_for_embucket):
+    """
+    Replace table name placeholders in ClickBench DDL statements with actual table names.
+
+    Args:
+        fully_qualified_names_for_embucket (bool): Required. If True, use EMBUCKET_DATABASE.EMBUCKET_SCHEMA.tablename format.
+                                                   If False, use just the default table names.
+
+    Returns:
+        list: A list of (table_name, parametrized_ddl) tuples.
+    """
+    return parametrize_clickbench_statements(_CLICKBENCH_DDL_RAW, fully_qualified_names_for_embucket)