Skip to content

Commit cf0f3af

Browse files
DanCodedThisandheroerampage644claudeYevheniiNiestierov
authored
experimental: Fix rebase (#1804)
* Add ClickBench benchmark + use Embucket experimental build (#1782) * Add missing #[test] attribute to test_make_cors_middleware (#1783) The test function was not being recognized by the test runner because it was missing the #[test] attribute. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude <noreply@anthropic.com> * api ui queries schema fix (#1789) * feat: add option to disable result caching for Snowflake benchmarks, … (#1786) * feat: add option to disable result caching for Snowflake benchmarks, update logging * feat: improve run type handling and update result paths logic * feat: update path name logic * feat: update path name in get_results_path logic here as well * Update benchmark/benchmark.py Co-authored-by: andheroe <3786879+andheroe@users.noreply.github.com> * feat: make oneliner from previous results_folder logic --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: andheroe <3786879+andheroe@users.noreply.github.com> * docs: update README to clarify caching options for benchmarks (#1797) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * api: UI OpenAPI snake_case schema fix (#1799) * schema fixes + requested changes from pr #1749 * more schema fixes * camelCase * camelCase for ErrorResponse * flaky test * [UI] Codegen rerun, NCU (#1791) * codegen * API_URL env * openapi * ncu * CI: Generate build artifacts (dist.tar) [skip ci] * Basic benchmarking crate (#1784) * Merge lock * Basic benchmarking crate * Basic benchmarking crate * Basic benchmarking crate * Fix readme * Fix versions * Create catalog * Fix clippy * Fix cargo * workflow complete (#1792) * remove `dedicated_executor` (#1772) * Yaro/slatedb durability config2 (#1773) * update slatedb to v0.8.2 * use less durable but faster option when put history items * [UI] Static hostname issue fix (Run-time Placeholder solution) (#1770) * CI: Generate build artifacts (dist.tar) [skip ci] --------- Co-authored-by: Nikita Striuk <32720808+nikitastryuk@users.noreply.github.com> Co-authored-by: github-actions[bot] <1310417+github-actions[bot]@users.noreply.github.com> * push main into experimental (#1775) * [UI] Static hostname issue fix (Run-time Placeholder solution) (#1770) * CI: Generate build artifacts (dist.tar) [skip ci] --------- Co-authored-by: Nikita Striuk <32720808+nikitastryuk@users.noreply.github.com> Co-authored-by: github-actions[bot] <1310417+github-actions[bot]@users.noreply.github.com> --------- Co-authored-by: andheroe <3786879+andheroe@users.noreply.github.com> Co-authored-by: Sergei Turukin <rampage644@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Yevhenii Niestierov <123905136+YevheniiNiestierov@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Nikita Striuk <32720808+nikitastryuk@users.noreply.github.com> Co-authored-by: github-actions[bot] <1310417+github-actions[bot]@users.noreply.github.com> Co-authored-by: Artem Osipov <59066880+osipovartem@users.noreply.github.com> Co-authored-by: Yaroslav Litvinov <yaroslav@embucket.com>
1 parent 1d9f9d0 commit cf0f3af

File tree

200 files changed

+5556
-2924
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

200 files changed

+5556
-2924
lines changed

.env.example

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ CORS_ENABLED=true
77
CORS_ALLOW_ORIGIN=http://localhost:8080
88
# Needed for UI to work
99
JWT_SECRET=secret
10+
API_URL=http://localhost:3000
1011

1112
# Option 1 (Memory)
1213
# SlateDB storage settings
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
name: Experimental rebase on main
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
8+
permissions:
9+
contents: write
10+
11+
concurrency:
12+
group: rebase-experimental
13+
cancel-in-progress: true
14+
15+
jobs:
16+
rebase-experimental:
17+
runs-on: ubuntu-latest
18+
steps:
19+
- name: Checkout full history
20+
uses: actions/checkout@v4
21+
with:
22+
fetch-depth: 0
23+
24+
- name: Configure git
25+
run: |
26+
git config --global user.name "github-actions[bot]"
27+
git config --global user.email "41898282+github-actions[bot]@users.noreply.github.com"
28+
29+
- name: Ensure experimental exists
30+
run: |
31+
if ! git ls-remote --heads origin experimental | grep -q refs/heads/experimental; then
32+
git checkout -b experimental
33+
git push origin experimental
34+
fi
35+
36+
- name: Fetch refs
37+
run: |
38+
git fetch origin main
39+
git fetch origin experimental
40+
41+
- name: Rebase experimental onto main and push (force-with-lease)
42+
run: |
43+
set -euo pipefail
44+
git checkout -B experimental origin/experimental
45+
if git rebase --rebase-merges --no-rebase-merges origin/main; then
46+
if ! git diff --quiet origin/experimental...HEAD; then
47+
git push --force-with-lease origin experimental
48+
else
49+
echo "No changes to push."
50+
fi
51+
else
52+
git rebase --abort || true
53+
echo "Rebase would conflict. Not pushing."
54+
exit 1
55+
fi

Cargo.lock

Lines changed: 1 addition & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,8 @@ members = [
1616
"crates/core-history",
1717
"crates/core-metastore",
1818
"crates/core-utils",
19-
"crates/api-sessions"
19+
"crates/api-sessions",
20+
"crates/benchmarks"
2021
]
2122
resolver = "2"
2223
package.license-file = "LICENSE"

Dockerfile

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,22 +11,28 @@ RUN apt-get update && apt-get install -y \
1111
ca-certificates \
1212
&& rm -rf /var/lib/apt/lists/*
1313

14-
# Copy source code
14+
# Copy all source code, including the pre-built frontend and entrypoint script
1515
COPY . .
1616

1717
# Build the application with optimizations
1818
RUN cargo build --release --bin embucketd
1919

20-
# Stage 4: Final runtime image
20+
# Stage 2: Final runtime image
2121
FROM gcr.io/distroless/cc-debian12 AS runtime
2222

23-
# Set working directory
24-
USER nonroot:nonroot
2523
WORKDIR /app
2624

27-
# Copy the binary and required files
25+
# Copy the compiled binary, API spec, frontend build, and entrypoint script
2826
COPY --from=builder /app/target/release/embucketd ./embucketd
2927
COPY --from=builder /app/rest-catalog-open-api.yaml ./rest-catalog-open-api.yaml
28+
COPY --from=builder /app/frontend/dist ./dist
29+
COPY --from=builder /app/entrypoint.sh /usr/local/bin/entrypoint.sh
30+
31+
# Make the script executable and ensure the nonroot user can modify app files
32+
RUN chmod +x /usr/local/bin/entrypoint.sh && chown -R nonroot:nonroot /app
33+
34+
# Switch to a non-privileged user
35+
USER nonroot:nonroot
3036

3137
# Expose port (adjust as needed)
3238
EXPOSE 8080
@@ -37,5 +43,7 @@ ENV FILE_STORAGE_PATH=data/
3743
ENV BUCKET_HOST=0.0.0.0
3844
ENV JWT_SECRET=63f4945d921d599f27ae4fdf5bada3f1
3945

40-
# Default command
46+
# Set the entrypoint to our script
47+
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
48+
4149
CMD ["./embucketd"]

benchmark/README.md

Lines changed: 135 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## Overview
22

3-
This benchmark tool executes queries derived from TPC-H against both Snowflake and Embucket with cache-clearing operations to ensure clean, cache-free performance measurements. For Snowflake, it uses warehouse suspend/resume operations. For Embucket, it restarts the Docker container before each query to eliminate internal caching. It provides detailed timing metrics including compilation time, execution time, and total elapsed time.
3+
This benchmark tool executes queries from multiple benchmark suites (TPC-H, ClickBench, TPC-DS) against both Snowflake and Embucket with cache-clearing operations to ensure clean, cache-free performance measurements. For Snowflake, it uses warehouse suspend/resume operations. For Embucket, it restarts the Docker container before each query to eliminate internal caching. It provides detailed timing metrics including compilation time, execution time, and total elapsed time.
44

55
## TPC Legal Considerations
66

@@ -14,9 +14,12 @@ Throughout this document and when talking about these benchmarks, you will see t
1414

1515
## Features
1616

17+
- **Multiple Benchmark Types**: Supports TPC-H, ClickBench, and TPC-DS benchmark suites
1718
- **Cache Isolation**:
1819
- **Snowflake**: Suspends and resumes warehouse before each query
1920
- **Embucket**: Restarts Docker container before each query to clear internal cache
21+
- **Flexible Caching Options**: Can run with or without cache clearing (`--no-cache` flag)
22+
- **Command Line Interface**: Full CLI support for system selection, benchmark type, and run configuration
2023
- **Result Cache Disabled**: Ensures no result caching affects benchmark results
2124
- **Comprehensive Metrics**: Tracks compilation time, execution time, and row counts
2225
- **CSV Export**: Saves results to CSV files for further analysis
@@ -51,37 +54,128 @@ SNOWFLAKE_WAREHOUSE=your_warehouse
5154

5255
**For Embucket (when using infrastructure):**
5356
```bash
54-
EMBUCKET_SQL_HOST=your_ec2_instance_ip
55-
EMBUCKET_SQL_PORT=3000
56-
EMBUCKET_SQL_PROTOCOL=http
57+
EMBUCKET_HOST=your_ec2_instance_ip
58+
EMBUCKET_PORT=3000
59+
EMBUCKET_PROTOCOL=http
5760
EMBUCKET_USER=embucket
5861
EMBUCKET_PASSWORD=embucket
5962
EMBUCKET_ACCOUNT=embucket
60-
EMBUCKET_DATABASE=embucket
61-
EMBUCKET_SCHEMA=public
63+
EMBUCKET_DATABASE=benchmark_database
64+
EMBUCKET_SCHEMA=benchmark_schema
6265
EMBUCKET_INSTANCE=your_instance_name
63-
EMBUCKET_DATASET=your_dataset_name
6466
SSH_KEY_PATH=~/.ssh/id_rsa
6567
```
6668

69+
**Benchmark Configuration:**
70+
```bash
71+
BENCHMARK_TYPE=tpch # Options: tpch, clickbench, tpcds
72+
DATASET_S3_BUCKET=embucket-testdata
73+
DATASET_PATH=tpch/01 # Path within S3 bucket
74+
SNOWFLAKE_WAREHOUSE_SIZE=XSMALL
75+
AWS_ACCESS_KEY_ID=your_aws_access_key_id
76+
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
77+
```
78+
6779
## Usage
6880

69-
Run the benchmark:
81+
### Command Line Interface
82+
83+
The benchmark supports comprehensive command-line options:
84+
7085
```bash
86+
# Run both Snowflake and Embucket with TPC-H (default)
7187
python benchmark.py
88+
89+
# Run only Embucket with TPC-H
90+
python benchmark.py --system embucket
91+
92+
# Run only Snowflake with TPC-H
93+
python benchmark.py --system snowflake
94+
95+
# Run ClickBench on both systems
96+
python benchmark.py --benchmark-type clickbench
97+
98+
# Run TPC-DS on Embucket only
99+
python benchmark.py --system embucket --benchmark-type tpcds
100+
101+
# Run with caching enabled (no container restarts/warehouse suspends)
102+
python benchmark.py --system embucket
103+
104+
# Run with caching disabled (force cache clearing)
105+
python benchmark.py --system embucket --no-cache
106+
107+
# Custom number of runs and dataset path
108+
python benchmark.py --runs 5 --dataset-path tpch/100
72109
```
73110

74-
**Current Behavior**: By default, the benchmark runs **only Embucket** benchmarks for 3 iterations. To run both Snowflake and Embucket with comparisons, you need to modify the `__main__` section in `benchmark.py` to call `run_benchmark(i + 1)` instead of `run_embucket_benchmark(i + 1)`.
111+
### Command Line Arguments
112+
113+
- `--system`: Choose platform (`snowflake`, `embucket`, `both`) - default: `both`
114+
- `--runs`: Number of benchmark runs - default: `3`
115+
- `--benchmark-type`: Benchmark suite (`tpch`, `clickbench`, `tpcds`) - default: `tpch`
116+
- `--dataset-path`: Override DATASET_PATH environment variable
117+
- `--cold-runs`: Force cache clearing (warehouse suspend for Snowflake, container restart for Embucket)
118+
- `--disable-result-cache`: Disable Snowflake's result cache only (USE_CACHED_RESULT=FALSE), no effect on Embucket
119+
120+
## Caching Configurations
121+
122+
### Snowflake Caching Options
123+
124+
- **Cold run**: `--cold-runs`
125+
- Suspends warehouse between queries
126+
- Automatically disables result cache
127+
- Results stored in `cold/` folder
128+
129+
- **Warm run with result cache**: *(default, no flags)*
130+
- Keeps warehouse active between queries
131+
- Enables result cache (USE_CACHED_RESULT=TRUE)
132+
- Results stored in `warm/` folder
133+
134+
- **Warm run without result cache**: `--disable-result-cache`
135+
- Keeps warehouse active between queries
136+
- Disables result cache (USE_CACHED_RESULT=FALSE)
137+
- Results stored in `warm_no_result_cache/` folder
138+
139+
### Embucket Caching Options
140+
141+
- **Cold run**: `--cold-runs`
142+
- Restarts container between queries
143+
- Results stored in `cold/` folder
144+
145+
- **Warm run**: *(default, no flags)*
146+
- Keeps container running between queries
147+
- Results stored in `warm/` folder
148+
149+
### Example Usage
150+
151+
```bash
152+
# Default: warm run (caching enabled) for both systems
153+
python benchmark.py
154+
155+
# Cold run (cache clearing) for both systems
156+
python benchmark.py --cold-runs
157+
158+
# Warm run with result cache disabled for Snowflake
159+
python benchmark.py --system snowflake --disable-result-cache
160+
161+
# Cold run for Embucket only
162+
python benchmark.py --system embucket --cold-runs
163+
164+
# Multiple runs with warm caching for both systems
165+
python benchmark.py --runs 5
166+
```
167+
168+
### Benchmark Process
75169

76170
The benchmark will:
77-
1. Connect to the configured platform (Embucket by default, or both if modified)
78-
2. Execute each query derived from TPC-H with cache-clearing operations:
79-
- **Snowflake**: Warehouse suspend/resume before each query
80-
- **Embucket**: Docker container restart before each query
171+
1. Connect to the configured platform(s)
172+
2. Execute each query from the selected benchmark suite with cache-clearing operations:
173+
- **Snowflake**: Warehouse suspend/resume before each query (if `--no-cache`)
174+
- **Embucket**: Docker container restart before each query (if `--no-cache`)
81175
3. Collect performance metrics from query history
82176
4. Display results and comparisons (if both platforms are run)
83177
5. Save detailed results to CSV files
84-
6. Calculate averages after 3 runs are completed
178+
6. Calculate averages after all runs are completed
85179

86180
## Embucket Container Restart Functionality
87181

@@ -95,8 +189,8 @@ For Embucket benchmarks, the system automatically restarts the Docker container
95189
- Creates a fresh database connection and executes the query
96190

97191
**Requirements:**
98-
- `EMBUCKET_SQL_HOST` set to your EC2 instance IP
99-
- `EMBUCKET_INSTANCE` and `EMBUCKET_DATASET` for result organization
192+
- `EMBUCKET_HOST` set to your EC2 instance IP
193+
- `EMBUCKET_INSTANCE` for result organization
100194
- `SSH_KEY_PATH` pointing to your private key (default: `~/.ssh/id_rsa`)
101195
- SSH access to the EC2 instance running Embucket
102196

@@ -115,37 +209,47 @@ The benchmark provides:
115209
- **Total Times**: Aggregated compilation and execution times
116210

117211
**File Organization:**
118-
- Snowflake results: `snowflake_tpch_results/{schema}/{warehouse}/`
119-
- Embucket results: `embucket_tpch_results/{dataset}/{instance}/`
212+
- Snowflake results: `snowflake_{benchmark_type}_results/{schema}/{warehouse}/`
213+
- Embucket results: `embucket_{benchmark_type}_results/{dataset}/{instance}/`
214+
215+
Where `{benchmark_type}` is one of: `tpch`, `clickbench`, or `tpcds`
120216

121217
## Files
122218

123219
- `benchmark.py` - Main benchmark script with restart functionality
124220
- `docker_manager.py` - Docker container management for Embucket restarts
125221
- `utils.py` - Connection utilities for Snowflake and Embucket
126-
- `tpch_queries.py` - Query definitions derived from TPC-H
127-
- `tpcds_queries.py` - Query definitions derived from TPC-DS (for future use)
222+
- `tpch/` - TPC-H benchmark utilities package (queries, DDL, table names)
223+
- `clickbench/` - ClickBench benchmark utilities package (queries, DDL, table names)
224+
- `tpcds/` - TPC-DS benchmark utilities package (queries, DDL, table names)
128225
- `calculate_average.py` - Result averaging and analysis
129226
- `config.py` - Configuration utilities
130227
- `data_preparation.py` - Data preparation utilities
131228
- `requirements.txt` - Python dependencies
132229
- `env_example` - Example environment configuration file
133230
- `infrastructure/` - Terraform infrastructure for EC2/Embucket deployment
134231
- `tpch-datagen/` - TPC-H data generation infrastructure
135-
- `tpch/` - TPC-H benchmark utilities package (queries, DDL, table names)
136-
- `tpcds_ddl/` - TPC-DS table definitions for Embucket
137232

138-
## Customizing Benchmark Behavior
233+
## Benchmark Types
139234

140-
**Default**: The benchmark runs only Embucket tests for 3 iterations.
235+
### TPC-H (Default)
236+
Derived from the TPC-H decision support benchmark. Includes 22 complex analytical queries testing various aspects of data warehousing performance.
141237

142-
**To run both Snowflake and Embucket with comparisons**: Modify the `__main__` section in `benchmark.py`:
143-
```python
144-
if __name__ == "__main__":
145-
for i in range(3):
146-
print(f"Run {i + 1} of 3")
147-
run_benchmark(i + 1) # Change from run_embucket_benchmark(i + 1)
148-
```
238+
### ClickBench
239+
Single-table analytical benchmark focusing on aggregation performance. Uses the `hits` table with web analytics data.
240+
241+
### TPC-DS
242+
Derived from the TPC-DS decision support benchmark. More complex than TPC-H with 99 queries testing advanced analytical scenarios.
243+
244+
## Environment Variables
245+
246+
The benchmark behavior can be controlled through environment variables in your `.env` file:
247+
248+
- `BENCHMARK_TYPE`: Default benchmark type (`tpch`, `clickbench`, `tpcds`)
249+
- `DATASET_PATH`: Path within S3 bucket for dataset location
250+
- `DATASET_S3_BUCKET`: S3 bucket containing benchmark datasets
251+
- `EMBUCKET_HOST`: EC2 instance IP for Embucket connection
252+
- `SSH_KEY_PATH`: Path to SSH private key for container restarts
149253

150254
## Requirements
151255

0 commit comments

Comments
 (0)