11## Overview
22
3- This benchmark tool executes queries derived from TPC-H against both Snowflake and Embucket with cache-clearing operations to ensure clean, cache-free performance measurements. For Snowflake, it uses warehouse suspend/resume operations. For Embucket, it restarts the Docker container before each query to eliminate internal caching. It provides detailed timing metrics including compilation time, execution time, and total elapsed time.
3+ This benchmark tool executes queries from multiple benchmark suites ( TPC-H, ClickBench, TPC-DS) against both Snowflake and Embucket with cache-clearing operations to ensure clean, cache-free performance measurements. For Snowflake, it uses warehouse suspend/resume operations. For Embucket, it restarts the Docker container before each query to eliminate internal caching. It provides detailed timing metrics including compilation time, execution time, and total elapsed time.
44
55## TPC Legal Considerations
66
@@ -14,9 +14,12 @@ Throughout this document and when talking about these benchmarks, you will see t
1414
1515## Features
1616
17+ - ** Multiple Benchmark Types** : Supports TPC-H, ClickBench, and TPC-DS benchmark suites
1718- ** Cache Isolation** :
1819 - ** Snowflake** : Suspends and resumes warehouse before each query
1920 - ** Embucket** : Restarts Docker container before each query to clear internal cache
21+ - ** Flexible Caching Options** : Can run with or without cache clearing (` --no-cache ` flag)
22+ - ** Command Line Interface** : Full CLI support for system selection, benchmark type, and run configuration
2023- ** Result Cache Disabled** : Ensures no result caching affects benchmark results
2124- ** Comprehensive Metrics** : Tracks compilation time, execution time, and row counts
2225- ** CSV Export** : Saves results to CSV files for further analysis
@@ -51,37 +54,128 @@ SNOWFLAKE_WAREHOUSE=your_warehouse
5154
5255** For Embucket (when using infrastructure):**
5356``` bash
54- EMBUCKET_SQL_HOST =your_ec2_instance_ip
55- EMBUCKET_SQL_PORT =3000
56- EMBUCKET_SQL_PROTOCOL =http
57+ EMBUCKET_HOST =your_ec2_instance_ip
58+ EMBUCKET_PORT =3000
59+ EMBUCKET_PROTOCOL =http
5760EMBUCKET_USER=embucket
5861EMBUCKET_PASSWORD=embucket
5962EMBUCKET_ACCOUNT=embucket
60- EMBUCKET_DATABASE=embucket
61- EMBUCKET_SCHEMA=public
63+ EMBUCKET_DATABASE=benchmark_database
64+ EMBUCKET_SCHEMA=benchmark_schema
6265EMBUCKET_INSTANCE=your_instance_name
63- EMBUCKET_DATASET=your_dataset_name
6466SSH_KEY_PATH=~ /.ssh/id_rsa
6567```
6668
69+ ** Benchmark Configuration:**
70+ ``` bash
71+ BENCHMARK_TYPE=tpch # Options: tpch, clickbench, tpcds
72+ DATASET_S3_BUCKET=embucket-testdata
73+ DATASET_PATH=tpch/01 # Path within S3 bucket
74+ SNOWFLAKE_WAREHOUSE_SIZE=XSMALL
75+ AWS_ACCESS_KEY_ID=your_aws_access_key_id
76+ AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
77+ ```
78+
6779## Usage
6880
69- Run the benchmark:
81+ ### Command Line Interface
82+
83+ The benchmark supports comprehensive command-line options:
84+
7085``` bash
86+ # Run both Snowflake and Embucket with TPC-H (default)
7187python benchmark.py
88+
89+ # Run only Embucket with TPC-H
90+ python benchmark.py --system embucket
91+
92+ # Run only Snowflake with TPC-H
93+ python benchmark.py --system snowflake
94+
95+ # Run ClickBench on both systems
96+ python benchmark.py --benchmark-type clickbench
97+
98+ # Run TPC-DS on Embucket only
99+ python benchmark.py --system embucket --benchmark-type tpcds
100+
101+ # Run with caching enabled (no container restarts/warehouse suspends)
102+ python benchmark.py --system embucket
103+
104+ # Run with caching disabled (force cache clearing)
105+ python benchmark.py --system embucket --no-cache
106+
107+ # Custom number of runs and dataset path
108+ python benchmark.py --runs 5 --dataset-path tpch/100
72109```
73110
74- ** Current Behavior** : By default, the benchmark runs ** only Embucket** benchmarks for 3 iterations. To run both Snowflake and Embucket with comparisons, you need to modify the ` __main__ ` section in ` benchmark.py ` to call ` run_benchmark(i + 1) ` instead of ` run_embucket_benchmark(i + 1) ` .
111+ ### Command Line Arguments
112+
113+ - ` --system ` : Choose platform (` snowflake ` , ` embucket ` , ` both ` ) - default: ` both `
114+ - ` --runs ` : Number of benchmark runs - default: ` 3 `
115+ - ` --benchmark-type ` : Benchmark suite (` tpch ` , ` clickbench ` , ` tpcds ` ) - default: ` tpch `
116+ - ` --dataset-path ` : Override DATASET_PATH environment variable
117+ - ` --cold-runs ` : Force cache clearing (warehouse suspend for Snowflake, container restart for Embucket)
118+ - ` --disable-result-cache ` : Disable Snowflake's result cache only (USE_CACHED_RESULT=FALSE), no effect on Embucket
119+
120+ ## Caching Configurations
121+
122+ ### Snowflake Caching Options
123+
124+ - ** Cold run** : ` --cold-runs `
125+ - Suspends warehouse between queries
126+ - Automatically disables result cache
127+ - Results stored in ` cold/ ` folder
128+
129+ - ** Warm run with result cache** : * (default, no flags)*
130+ - Keeps warehouse active between queries
131+ - Enables result cache (USE_CACHED_RESULT=TRUE)
132+ - Results stored in ` warm/ ` folder
133+
134+ - ** Warm run without result cache** : ` --disable-result-cache `
135+ - Keeps warehouse active between queries
136+ - Disables result cache (USE_CACHED_RESULT=FALSE)
137+ - Results stored in ` warm_no_result_cache/ ` folder
138+
139+ ### Embucket Caching Options
140+
141+ - ** Cold run** : ` --cold-runs `
142+ - Restarts container between queries
143+ - Results stored in ` cold/ ` folder
144+
145+ - ** Warm run** : * (default, no flags)*
146+ - Keeps container running between queries
147+ - Results stored in ` warm/ ` folder
148+
149+ ### Example Usage
150+
151+ ``` bash
152+ # Default: warm run (caching enabled) for both systems
153+ python benchmark.py
154+
155+ # Cold run (cache clearing) for both systems
156+ python benchmark.py --cold-runs
157+
158+ # Warm run with result cache disabled for Snowflake
159+ python benchmark.py --system snowflake --disable-result-cache
160+
161+ # Cold run for Embucket only
162+ python benchmark.py --system embucket --cold-runs
163+
164+ # Multiple runs with warm caching for both systems
165+ python benchmark.py --runs 5
166+ ```
167+
168+ ### Benchmark Process
75169
76170The benchmark will:
77- 1 . Connect to the configured platform (Embucket by default, or both if modified )
78- 2 . Execute each query derived from TPC-H with cache-clearing operations:
79- - ** Snowflake** : Warehouse suspend/resume before each query
80- - ** Embucket** : Docker container restart before each query
171+ 1 . Connect to the configured platform(s )
172+ 2 . Execute each query from the selected benchmark suite with cache-clearing operations:
173+ - ** Snowflake** : Warehouse suspend/resume before each query (if ` --no-cache ` )
174+ - ** Embucket** : Docker container restart before each query (if ` --no-cache ` )
811753 . Collect performance metrics from query history
821764 . Display results and comparisons (if both platforms are run)
831775 . Save detailed results to CSV files
84- 6 . Calculate averages after 3 runs are completed
178+ 6 . Calculate averages after all runs are completed
85179
86180## Embucket Container Restart Functionality
87181
@@ -95,8 +189,8 @@ For Embucket benchmarks, the system automatically restarts the Docker container
95189- Creates a fresh database connection and executes the query
96190
97191** Requirements:**
98- - ` EMBUCKET_SQL_HOST ` set to your EC2 instance IP
99- - ` EMBUCKET_INSTANCE ` and ` EMBUCKET_DATASET ` for result organization
192+ - ` EMBUCKET_HOST ` set to your EC2 instance IP
193+ - ` EMBUCKET_INSTANCE ` for result organization
100194- ` SSH_KEY_PATH ` pointing to your private key (default: ` ~/.ssh/id_rsa ` )
101195- SSH access to the EC2 instance running Embucket
102196
@@ -115,37 +209,47 @@ The benchmark provides:
115209- ** Total Times** : Aggregated compilation and execution times
116210
117211** File Organization:**
118- - Snowflake results: ` snowflake_tpch_results/{schema}/{warehouse}/ `
119- - Embucket results: ` embucket_tpch_results/{dataset}/{instance}/ `
212+ - Snowflake results: ` snowflake_{benchmark_type}_results/{schema}/{warehouse}/ `
213+ - Embucket results: ` embucket_{benchmark_type}_results/{dataset}/{instance}/ `
214+
215+ Where ` {benchmark_type} ` is one of: ` tpch ` , ` clickbench ` , or ` tpcds `
120216
121217## Files
122218
123219- ` benchmark.py ` - Main benchmark script with restart functionality
124220- ` docker_manager.py ` - Docker container management for Embucket restarts
125221- ` utils.py ` - Connection utilities for Snowflake and Embucket
126- - ` tpch_queries.py ` - Query definitions derived from TPC-H
127- - ` tpcds_queries.py ` - Query definitions derived from TPC-DS (for future use)
222+ - ` tpch/ ` - TPC-H benchmark utilities package (queries, DDL, table names)
223+ - ` clickbench/ ` - ClickBench benchmark utilities package (queries, DDL, table names)
224+ - ` tpcds/ ` - TPC-DS benchmark utilities package (queries, DDL, table names)
128225- ` calculate_average.py ` - Result averaging and analysis
129226- ` config.py ` - Configuration utilities
130227- ` data_preparation.py ` - Data preparation utilities
131228- ` requirements.txt ` - Python dependencies
132229- ` env_example ` - Example environment configuration file
133230- ` infrastructure/ ` - Terraform infrastructure for EC2/Embucket deployment
134231- ` tpch-datagen/ ` - TPC-H data generation infrastructure
135- - ` tpch/ ` - TPC-H benchmark utilities package (queries, DDL, table names)
136- - ` tpcds_ddl/ ` - TPC-DS table definitions for Embucket
137232
138- ## Customizing Benchmark Behavior
233+ ## Benchmark Types
139234
140- ** Default** : The benchmark runs only Embucket tests for 3 iterations.
235+ ### TPC-H (Default)
236+ Derived from the TPC-H decision support benchmark. Includes 22 complex analytical queries testing various aspects of data warehousing performance.
141237
142- ** To run both Snowflake and Embucket with comparisons** : Modify the ` __main__ ` section in ` benchmark.py ` :
143- ``` python
144- if __name__ == " __main__" :
145- for i in range (3 ):
146- print (f " Run { i + 1 } of 3 " )
147- run_benchmark(i + 1 ) # Change from run_embucket_benchmark(i + 1)
148- ```
238+ ### ClickBench
239+ Single-table analytical benchmark focusing on aggregation performance. Uses the ` hits ` table with web analytics data.
240+
241+ ### TPC-DS
242+ Derived from the TPC-DS decision support benchmark. More complex than TPC-H with 99 queries testing advanced analytical scenarios.
243+
244+ ## Environment Variables
245+
246+ The benchmark behavior can be controlled through environment variables in your ` .env ` file:
247+
248+ - ` BENCHMARK_TYPE ` : Default benchmark type (` tpch ` , ` clickbench ` , ` tpcds ` )
249+ - ` DATASET_PATH ` : Path within S3 bucket for dataset location
250+ - ` DATASET_S3_BUCKET ` : S3 bucket containing benchmark datasets
251+ - ` EMBUCKET_HOST ` : EC2 instance IP for Embucket connection
252+ - ` SSH_KEY_PATH ` : Path to SSH private key for container restarts
149253
150254## Requirements
151255
0 commit comments