Skip to content

Commit eb86f5b

Browse files
andheroeDanCodedThis
authored andcommitted
Add ClickBench benchmark + use Embucket experimental build (#1782)
1 parent 1d9f9d0 commit eb86f5b

File tree

14 files changed

+855
-67
lines changed

14 files changed

+855
-67
lines changed

benchmark/README.md

Lines changed: 86 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## Overview
22

3-
This benchmark tool executes queries derived from TPC-H against both Snowflake and Embucket with cache-clearing operations to ensure clean, cache-free performance measurements. For Snowflake, it uses warehouse suspend/resume operations. For Embucket, it restarts the Docker container before each query to eliminate internal caching. It provides detailed timing metrics including compilation time, execution time, and total elapsed time.
3+
This benchmark tool executes queries from multiple benchmark suites (TPC-H, ClickBench, TPC-DS) against both Snowflake and Embucket with cache-clearing operations to ensure clean, cache-free performance measurements. For Snowflake, it uses warehouse suspend/resume operations. For Embucket, it restarts the Docker container before each query to eliminate internal caching. It provides detailed timing metrics including compilation time, execution time, and total elapsed time.
44

55
## TPC Legal Considerations
66

@@ -14,9 +14,12 @@ Throughout this document and when talking about these benchmarks, you will see t
1414

1515
## Features
1616

17+
- **Multiple Benchmark Types**: Supports TPC-H, ClickBench, and TPC-DS benchmark suites
1718
- **Cache Isolation**:
1819
- **Snowflake**: Suspends and resumes warehouse before each query
1920
- **Embucket**: Restarts Docker container before each query to clear internal cache
21+
- **Flexible Caching Options**: Can run with or without cache clearing (`--no-cache` flag)
22+
- **Command Line Interface**: Full CLI support for system selection, benchmark type, and run configuration
2023
- **Result Cache Disabled**: Ensures no result caching affects benchmark results
2124
- **Comprehensive Metrics**: Tracks compilation time, execution time, and row counts
2225
- **CSV Export**: Saves results to CSV files for further analysis
@@ -51,37 +54,79 @@ SNOWFLAKE_WAREHOUSE=your_warehouse
5154

5255
**For Embucket (when using infrastructure):**
5356
```bash
54-
EMBUCKET_SQL_HOST=your_ec2_instance_ip
55-
EMBUCKET_SQL_PORT=3000
56-
EMBUCKET_SQL_PROTOCOL=http
57+
EMBUCKET_HOST=your_ec2_instance_ip
58+
EMBUCKET_PORT=3000
59+
EMBUCKET_PROTOCOL=http
5760
EMBUCKET_USER=embucket
5861
EMBUCKET_PASSWORD=embucket
5962
EMBUCKET_ACCOUNT=embucket
60-
EMBUCKET_DATABASE=embucket
61-
EMBUCKET_SCHEMA=public
63+
EMBUCKET_DATABASE=benchmark_database
64+
EMBUCKET_SCHEMA=benchmark_schema
6265
EMBUCKET_INSTANCE=your_instance_name
63-
EMBUCKET_DATASET=your_dataset_name
6466
SSH_KEY_PATH=~/.ssh/id_rsa
6567
```
6668

69+
**Benchmark Configuration:**
70+
```bash
71+
BENCHMARK_TYPE=tpch # Options: tpch, clickbench, tpcds
72+
DATASET_S3_BUCKET=embucket-testdata
73+
DATASET_PATH=tpch/01 # Path within S3 bucket
74+
SNOWFLAKE_WAREHOUSE_SIZE=XSMALL
75+
AWS_ACCESS_KEY_ID=your_aws_access_key_id
76+
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
77+
```
78+
6779
## Usage
6880

69-
Run the benchmark:
81+
### Command Line Interface
82+
83+
The benchmark supports comprehensive command-line options:
84+
7085
```bash
86+
# Run both Snowflake and Embucket with TPC-H (default)
7187
python benchmark.py
88+
89+
# Run only Embucket with TPC-H
90+
python benchmark.py --system embucket
91+
92+
# Run only Snowflake with TPC-H
93+
python benchmark.py --system snowflake
94+
95+
# Run ClickBench on both systems
96+
python benchmark.py --benchmark-type clickbench
97+
98+
# Run TPC-DS on Embucket only
99+
python benchmark.py --system embucket --benchmark-type tpcds
100+
101+
# Run with caching enabled (no container restarts/warehouse suspends)
102+
python benchmark.py --system embucket
103+
104+
# Run with caching disabled (force cache clearing)
105+
python benchmark.py --system embucket --no-cache
106+
107+
# Custom number of runs and dataset path
108+
python benchmark.py --runs 5 --dataset-path tpch/100
72109
```
73110

74-
**Current Behavior**: By default, the benchmark runs **only Embucket** benchmarks for 3 iterations. To run both Snowflake and Embucket with comparisons, you need to modify the `__main__` section in `benchmark.py` to call `run_benchmark(i + 1)` instead of `run_embucket_benchmark(i + 1)`.
111+
### Command Line Arguments
112+
113+
- `--system`: Choose platform (`snowflake`, `embucket`, `both`) - default: `both`
114+
- `--runs`: Number of benchmark runs - default: `3`
115+
- `--benchmark-type`: Benchmark suite (`tpch`, `clickbench`, `tpcds`) - default: `tpch`
116+
- `--dataset-path`: Override DATASET_PATH environment variable
117+
- `--no-cache`: Force cache clearing (warehouse suspend for Snowflake, container restart for Embucket)
118+
119+
### Benchmark Process
75120

76121
The benchmark will:
77-
1. Connect to the configured platform (Embucket by default, or both if modified)
78-
2. Execute each query derived from TPC-H with cache-clearing operations:
79-
- **Snowflake**: Warehouse suspend/resume before each query
80-
- **Embucket**: Docker container restart before each query
122+
1. Connect to the configured platform(s)
123+
2. Execute each query from the selected benchmark suite with cache-clearing operations:
124+
- **Snowflake**: Warehouse suspend/resume before each query (if `--no-cache`)
125+
- **Embucket**: Docker container restart before each query (if `--no-cache`)
81126
3. Collect performance metrics from query history
82127
4. Display results and comparisons (if both platforms are run)
83128
5. Save detailed results to CSV files
84-
6. Calculate averages after 3 runs are completed
129+
6. Calculate averages after all runs are completed
85130

86131
## Embucket Container Restart Functionality
87132

@@ -95,8 +140,8 @@ For Embucket benchmarks, the system automatically restarts the Docker container
95140
- Creates a fresh database connection and executes the query
96141

97142
**Requirements:**
98-
- `EMBUCKET_SQL_HOST` set to your EC2 instance IP
99-
- `EMBUCKET_INSTANCE` and `EMBUCKET_DATASET` for result organization
143+
- `EMBUCKET_HOST` set to your EC2 instance IP
144+
- `EMBUCKET_INSTANCE` for result organization
100145
- `SSH_KEY_PATH` pointing to your private key (default: `~/.ssh/id_rsa`)
101146
- SSH access to the EC2 instance running Embucket
102147

@@ -115,37 +160,47 @@ The benchmark provides:
115160
- **Total Times**: Aggregated compilation and execution times
116161

117162
**File Organization:**
118-
- Snowflake results: `snowflake_tpch_results/{schema}/{warehouse}/`
119-
- Embucket results: `embucket_tpch_results/{dataset}/{instance}/`
163+
- Snowflake results: `snowflake_{benchmark_type}_results/{schema}/{warehouse}/`
164+
- Embucket results: `embucket_{benchmark_type}_results/{dataset}/{instance}/`
165+
166+
Where `{benchmark_type}` is one of: `tpch`, `clickbench`, or `tpcds`
120167

121168
## Files
122169

123170
- `benchmark.py` - Main benchmark script with restart functionality
124171
- `docker_manager.py` - Docker container management for Embucket restarts
125172
- `utils.py` - Connection utilities for Snowflake and Embucket
126-
- `tpch_queries.py` - Query definitions derived from TPC-H
127-
- `tpcds_queries.py` - Query definitions derived from TPC-DS (for future use)
173+
- `tpch/` - TPC-H benchmark utilities package (queries, DDL, table names)
174+
- `clickbench/` - ClickBench benchmark utilities package (queries, DDL, table names)
175+
- `tpcds/` - TPC-DS benchmark utilities package (queries, DDL, table names)
128176
- `calculate_average.py` - Result averaging and analysis
129177
- `config.py` - Configuration utilities
130178
- `data_preparation.py` - Data preparation utilities
131179
- `requirements.txt` - Python dependencies
132180
- `env_example` - Example environment configuration file
133181
- `infrastructure/` - Terraform infrastructure for EC2/Embucket deployment
134182
- `tpch-datagen/` - TPC-H data generation infrastructure
135-
- `tpch/` - TPC-H benchmark utilities package (queries, DDL, table names)
136-
- `tpcds_ddl/` - TPC-DS table definitions for Embucket
137183

138-
## Customizing Benchmark Behavior
184+
## Benchmark Types
139185

140-
**Default**: The benchmark runs only Embucket tests for 3 iterations.
186+
### TPC-H (Default)
187+
Derived from the TPC-H decision support benchmark. Includes 22 complex analytical queries testing various aspects of data warehousing performance.
141188

142-
**To run both Snowflake and Embucket with comparisons**: Modify the `__main__` section in `benchmark.py`:
143-
```python
144-
if __name__ == "__main__":
145-
for i in range(3):
146-
print(f"Run {i + 1} of 3")
147-
run_benchmark(i + 1) # Change from run_embucket_benchmark(i + 1)
148-
```
189+
### ClickBench
190+
Single-table analytical benchmark focusing on aggregation performance. Uses the `hits` table with web analytics data.
191+
192+
### TPC-DS
193+
Derived from the TPC-DS decision support benchmark. More complex than TPC-H with 99 queries testing advanced analytical scenarios.
194+
195+
## Environment Variables
196+
197+
The benchmark behavior can be controlled through environment variables in your `.env` file:
198+
199+
- `BENCHMARK_TYPE`: Default benchmark type (`tpch`, `clickbench`, `tpcds`)
200+
- `DATASET_PATH`: Path within S3 bucket for dataset location
201+
- `DATASET_S3_BUCKET`: S3 bucket containing benchmark datasets
202+
- `EMBUCKET_HOST`: EC2 instance IP for Embucket connection
203+
- `SSH_KEY_PATH`: Path to SSH private key for container restarts
149204

150205
## Requirements
151206

benchmark/benchmark.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from utils import create_snowflake_connection
88
from utils import create_embucket_connection
99
from tpch import parametrize_tpch_queries
10+
from clickbench import parametrize_clickbench_queries
1011
from docker_manager import create_docker_manager
1112
from constants import SystemType
1213

@@ -286,6 +287,8 @@ def get_queries_for_benchmark(benchmark_type: str, for_embucket: bool) -> List[T
286287
"""Get appropriate queries based on the benchmark type."""
287288
if benchmark_type == "tpch":
288289
return parametrize_tpch_queries(fully_qualified_names_for_embucket=for_embucket)
290+
elif benchmark_type == "clickbench":
291+
return parametrize_clickbench_queries(fully_qualified_names_for_embucket=for_embucket)
289292
elif benchmark_type == "tpcds":
290293
raise NotImplementedError("TPC-DS benchmarks not yet implemented")
291294
else:
@@ -433,7 +436,7 @@ def parse_args():
433436
parser = argparse.ArgumentParser(description="Run benchmarks on Snowflake and/or Embucket")
434437
parser.add_argument("--system", choices=["snowflake", "embucket", "both"], default="both")
435438
parser.add_argument("--runs", type=int, default=3)
436-
parser.add_argument("--benchmark-type", choices=["tpch", "tpcds"], default=os.environ.get("BENCHMARK_TYPE", "tpch"))
439+
parser.add_argument("--benchmark-type", choices=["tpch", "clickbench", "tpcds"], default=os.environ.get("BENCHMARK_TYPE", "tpch"))
437440
parser.add_argument("--dataset-path", help="Override the DATASET_PATH environment variable")
438441
parser.add_argument("--no-cache", action="store_true", help="Disable caching (force warehouse suspend and USE_CACHED_RESULT=False for Snowflake, force container restart for Embucket)")
439442
return parser.parse_args()

benchmark/clickbench/__init__.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
"""
2+
ClickBench benchmark utilities package.
3+
4+
This package contains all ClickBench related functionality including:
5+
- Table name configuration and parametrization
6+
- Query definitions with parametrized table names
7+
- DDL statements with parametrized table names
8+
9+
Main exports:
10+
- parametrize_clickbench_queries: Parametrize ClickBench queries (requires explicit parameter)
11+
- parametrize_clickbench_ddl: Parametrize ClickBench DDL statements (requires explicit parameter)
12+
- CLICKBENCH_TABLE_NAMES: Raw table name mappings
13+
- get_table_names: Get parametrized table names (requires explicit parameter)
14+
- parametrize_clickbench_statements: Generic parametrization function (requires explicit parameter)
15+
16+
Note: All functions require explicit fully_qualified_names_for_embucket parameter.
17+
No pre-computed constants are provided to enforce explicit parameter usage.
18+
"""
19+
20+
from .clickbench_table_names import (
21+
CLICKBENCH_TABLE_NAMES,
22+
get_table_names,
23+
parametrize_clickbench_statements
24+
)
25+
26+
from .clickbench_queries import (
27+
parametrize_clickbench_queries,
28+
)
29+
30+
from .clickbench_ddl import (
31+
parametrize_clickbench_ddl,
32+
)
33+
34+
__all__ = [
35+
# Table names and core functions
36+
'CLICKBENCH_TABLE_NAMES',
37+
'get_table_names',
38+
'parametrize_clickbench_statements',
39+
40+
# Query functions
41+
'parametrize_clickbench_queries',
42+
43+
# DDL functions
44+
'parametrize_clickbench_ddl',
45+
]
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
import os
2+
3+
from .clickbench_table_names import parametrize_clickbench_statements
4+
5+
# ClickBench DDL statement with parametrized table name
6+
_CLICKBENCH_DDL_RAW = [
7+
(
8+
"hits",
9+
"""
10+
-- Snowflake-like DDL for ClickBench hits table
11+
CREATE OR REPLACE TABLE {HITS_TABLE} (
12+
WatchID BIGINT,
13+
JavaEnable SMALLINT,
14+
Title VARCHAR,
15+
GoodEvent SMALLINT,
16+
EventTime BIGINT,
17+
EventDate SMALLINT,
18+
CounterID INTEGER,
19+
ClientIP INTEGER,
20+
RegionID INTEGER,
21+
UserID BIGINT,
22+
CounterClass SMALLINT,
23+
OS SMALLINT,
24+
UserAgent SMALLINT,
25+
URL VARCHAR,
26+
Referer VARCHAR,
27+
IsRefresh SMALLINT,
28+
RefererCategoryID SMALLINT,
29+
RefererRegionID INTEGER,
30+
URLCategoryID SMALLINT,
31+
URLRegionID INTEGER,
32+
ResolutionWidth SMALLINT,
33+
ResolutionHeight SMALLINT,
34+
ResolutionDepth SMALLINT,
35+
FlashMajor SMALLINT,
36+
FlashMinor SMALLINT,
37+
FlashMinor2 VARCHAR,
38+
NetMajor SMALLINT,
39+
NetMinor SMALLINT,
40+
UserAgentMajor SMALLINT,
41+
UserAgentMinor VARCHAR,
42+
CookieEnable SMALLINT,
43+
JavascriptEnable SMALLINT,
44+
IsMobile SMALLINT,
45+
MobilePhone SMALLINT,
46+
MobilePhoneModel VARCHAR,
47+
Params VARCHAR,
48+
IPNetworkID INTEGER,
49+
TraficSourceID SMALLINT,
50+
SearchEngineID SMALLINT,
51+
SearchPhrase VARCHAR,
52+
AdvEngineID SMALLINT,
53+
IsArtifical SMALLINT,
54+
WindowClientWidth SMALLINT,
55+
WindowClientHeight SMALLINT,
56+
ClientTimeZone SMALLINT,
57+
ClientEventTime BIGINT,
58+
SilverlightVersion1 SMALLINT,
59+
SilverlightVersion2 SMALLINT,
60+
SilverlightVersion3 INTEGER,
61+
SilverlightVersion4 SMALLINT,
62+
PageCharset VARCHAR,
63+
CodeVersion INTEGER,
64+
IsLink SMALLINT,
65+
IsDownload SMALLINT,
66+
IsNotBounce SMALLINT,
67+
FUniqID BIGINT,
68+
OriginalURL VARCHAR,
69+
HID INTEGER,
70+
IsOldCounter SMALLINT,
71+
IsEvent SMALLINT,
72+
IsParameter SMALLINT,
73+
DontCountHits SMALLINT,
74+
WithHash SMALLINT,
75+
HitColor VARCHAR,
76+
LocalEventTime BIGINT,
77+
Age SMALLINT,
78+
Sex SMALLINT,
79+
Income SMALLINT,
80+
Interests SMALLINT,
81+
Robotness SMALLINT,
82+
RemoteIP INTEGER,
83+
WindowName INTEGER,
84+
OpenerName INTEGER,
85+
HistoryLength SMALLINT,
86+
BrowserLanguage VARCHAR,
87+
BrowserCountry VARCHAR,
88+
SocialNetwork VARCHAR,
89+
SocialAction VARCHAR,
90+
HTTPError SMALLINT,
91+
SendTiming INTEGER,
92+
DNSTiming INTEGER,
93+
ConnectTiming INTEGER,
94+
ResponseStartTiming INTEGER,
95+
ResponseEndTiming INTEGER,
96+
FetchTiming INTEGER,
97+
SocialSourceNetworkID SMALLINT,
98+
SocialSourcePage VARCHAR,
99+
ParamPrice BIGINT,
100+
ParamOrderID VARCHAR,
101+
ParamCurrency VARCHAR,
102+
ParamCurrencyID SMALLINT,
103+
OpenstatServiceName VARCHAR,
104+
OpenstatCampaignID VARCHAR,
105+
OpenstatAdID VARCHAR,
106+
OpenstatSourceID VARCHAR,
107+
UTMSource VARCHAR,
108+
UTMMedium VARCHAR,
109+
UTMCampaign VARCHAR,
110+
UTMContent VARCHAR,
111+
UTMTerm VARCHAR,
112+
FromTag VARCHAR,
113+
HasGCLID SMALLINT,
114+
RefererHash BIGINT,
115+
URLHash BIGINT,
116+
CLID INTEGER
117+
);
118+
"""
119+
),
120+
]
121+
122+
123+
def parametrize_clickbench_ddl(fully_qualified_names_for_embucket):
124+
"""
125+
Replace table name placeholders in ClickBench DDL statements with actual table names.
126+
127+
Args:
128+
fully_qualified_names_for_embucket (bool): Required. If True, use EMBUCKET_DATABASE.EMBUCKET_SCHEMA.tablename format.
129+
If False, use just the default table names.
130+
131+
Returns:
132+
list: A list of (table_name, parametrized_ddl) tuples.
133+
"""
134+
return parametrize_clickbench_statements(_CLICKBENCH_DDL_RAW, fully_qualified_names_for_embucket)

0 commit comments

Comments
 (0)