Skip to content

Commit 2d0b27f

Browse files
Youngwbvchag
authored andcommitted
Site: Add a blog for StarRocks and Apache Polaris Integration (apache#2851)
1 parent fc1367d commit 2d0b27f

File tree

2 files changed

+324
-0
lines changed

2 files changed

+324
-0
lines changed
Lines changed: 324 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,324 @@
1+
---
2+
#
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing,
14+
# software distributed under the License is distributed on an
15+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
# KIND, either express or implied. See the License for the
17+
# specific language governing permissions and limitations
18+
# under the License.
19+
#
20+
title: "StarRocks and Apache Polaris Integration: Building a Unified, High-Performance Data Lakehouse"
21+
date: 2025-10-21
22+
author: Wayne Yang
23+
---
24+
## Introduction: Why StarRocks + Apache Polaris?
25+
26+
### Modern Data-Architecture Pain Points: Silos & Engine Lock-in
27+
Today’s data-driven enterprises face two chronic ailments:
28+
29+
* **Data silos** : transactional data sits in RDBMS, click-streams in an S3 data lake, CRM data in a SaaS vault, etc. Cross-domain analysis requires expensive ETL and still arrives stale.
30+
31+
* **Engine lock-in** : every OLAP engine optimises for its own metadata layer and file layout. Migrating to a faster or cheaper engine means re-formatting, re-writing and re-governing years of data.
32+
33+
An open, interoperable architecture is no longer a luxury, it is a survival requirement.
34+
35+
### StarRocks at a Glance: Speed & Simplicity
36+
[StarRocks](https://www.starrocks.io/) is a modern MPP database designed for high-performance analytics and real-time data processing. Key features include:
37+
38+
* MPP, fully-vectorised execution, CBO that thrives on complex multi-table joins.
39+
40+
* Sub-second response on TB-scale data without pre-aggregation.
41+
42+
* Compute-storage separation since v3.0: scale stateless compute pods in seconds, keep data in cheap object storage.
43+
44+
### Apache Polaris (Incubating): Vendor-Neutral Iceberg Catalog
45+
* 100% open-source implementation of the Iceberg REST Catalog API.
46+
47+
* Pluggable metadata backend (PostgreSQL, in-mem) and multi-cloud storage support.
48+
49+
* Fine-grained RBAC + credential-vending (temporary STS tokens) for secure, governed sharing across engines.
50+
51+
### Technical benefits
52+
* Keep **ONE** copy of data in Iceberg on S3, queryable by Spark, Flink, Trino, and StarRocks concurrently.
53+
54+
* Maintain **ONE** set of role-based permissions in Polaris that apply to every engine.
55+
56+
* Utilize **StarRocks** to deliver BI dashboards, conduct ad-hoc exploration, and perform lightweight ETL operations on the same up-to-date dataset, zero data movement and zero vendor lock-in.
57+
58+
## Architecture
59+
![](/img/blog/2025/10/21/fig1-polaris-starrocks-architecture.png)
60+
61+
**Polaris** acts as the single source of truth for:
62+
63+
* Table schema, partitions, snapshots
64+
65+
* Role-based access control (RBAC)
66+
67+
* Short-lived, scoped cloud credentials (credential vending)
68+
69+
**StarRocks** acts as a stateless compute layer:
70+
71+
* Discover Iceberg metadata via REST calls
72+
73+
* Directly reads Parquet/ORC files from cloud storage using the vendored credentials
74+
75+
* Applies its own CBO and vectorised execution for query acceleration
76+
77+
## Deploy and Configure Polaris
78+
79+
User can refer to [Polaris Quickstart](https://polaris.apache.org/releases/1.1.0/getting-started/) to deploy Polaris,
80+
Here we will compile from source code and deploy Polaris via a standalone process.
81+
82+
### Clone Source Code and Start Polaris
83+
84+
1. Clone source code and checkout to released version
85+
86+
User can get the latest released version from https://github.com/apache/polaris/releases
87+
```bash
88+
# download the latest released version 1.1.0-incubating
89+
wget https://dlcdn.apache.org/incubator/polaris/1.1.0-incubating/apache-polaris-1.1.0-incubating.tar.gz
90+
tar -xvzf apache-polaris-1.1.0-incubating.tar.gz
91+
cd apache-polaris-1.1.0-incubating
92+
```
93+
94+
2. Build Polaris
95+
```bash
96+
./gradlew \
97+
:polaris-server:assemble \
98+
:polaris-server:quarkusAppPartsBuild --rerun \
99+
:polaris-admin:assemble \
100+
:polaris-admin:quarkusAppPartsBuild --rerun
101+
```
102+
103+
3. Run Polaris
104+
105+
Ensure you have Java 21+, and export aws access key and secret key first.
106+
```bash
107+
export AWS_ACCESS_KEY_ID=<access_key>
108+
export AWS_SECRET_ACCESS_KEY=<secret_key>
109+
110+
./gradlew run
111+
```
112+
113+
When Polaris is run using the./gradlew run command, the root principal credentials are **root** and **s3cr3t** for the **CLIENT_ID** and **CLIENT_SECRET**, respectively.
114+
115+
When using a Gradle-launched Polaris instance, it’ll launch an instance of Polaris that stores entities only **in-memory**. This means that any entities that you define will be destroyed when Polaris is shut down.
116+
117+
We suggest that users refer to this section (https://polaris.apache.org/releases/1.1.0/metastores/) to configure the metastore to persist Polaris entities.
118+
119+
```bash
120+
export POLARIS_PERSISTENCE_TYPE=relational-jdbc
121+
export QUARKUS_DATASOURCE_USERNAME=<your-username>
122+
export QUARKUS_DATASOURCE_PASSWORD=<your-password>
123+
export QUARKUS_DATASOURCE_JDBC_URL=<jdbc-url-of-postgres>
124+
125+
export AWS_ACCESS_KEY_ID=<access_key>
126+
export AWS_SECRET_ACCESS_KEY=<secret_key>
127+
128+
./gradlew run
129+
```
130+
131+
Using **Admin Tool** (https://polaris.apache.org/releases/1.1.0/admin-tool/) to bootstrap realms and create the necessary principal credentials for the Polaris server.
132+
133+
For example, to bootstrap the POLARIS realm and create its root principal credential with the client ID **root** and client secret **root_secret**, you can run the following command:
134+
135+
```bash
136+
java -jar runtime/admin/build/quarkus-app/quarkus-run.jar bootstrap -r POLARIS -c POLARIS,root,root_secret
137+
```
138+
139+
### Creating a Principal and Assigning it Privileges
140+
141+
Use the **Polaris CLI** (already built in the same folder):
142+
143+
1. Export CLIENT_ID and CLIENT_SECRET
144+
```bash
145+
export CLIENT_ID=root
146+
export CLIENT_SECRET=root_secret
147+
```
148+
149+
2. Create catalog
150+
```bash
151+
./polaris
152+
--client-id ${CLIENT_ID} \
153+
--client-secret ${CLIENT_SECRET} \
154+
catalogs create \
155+
--storage-type s3 \
156+
--default-base-location ${DEFAULT_BASE_LOCATION} \
157+
--role-arn ${ROLE_ARN} \
158+
polaris_catalog
159+
```
160+
161+
The DEFAULT_BASE_LOCATION you provide will be the default location that objects in this catalog should be stored in, and the ROLE_ARN you provide should be a Role ARN with access to read and write data in that location.
162+
These credentials will be provided to engines reading data from the catalog once they have authenticated with Polaris using credentials that have access to those resources.
163+
164+
3. Creating a Principal and Assigning it Privileges
165+
166+
Use below commands to create principal, principal role and catalog role
167+
```bash
168+
./polaris \
169+
--client-id ${CLIENT_ID} \
170+
--client-secret ${CLIENT_SECRET} \
171+
principals \
172+
create \
173+
jack
174+
175+
./polaris \
176+
--client-id ${CLIENT_ID} \
177+
--client-secret ${CLIENT_SECRET} \
178+
principal-roles \
179+
create \
180+
test_user_role
181+
182+
./polaris \
183+
--client-id ${CLIENT_ID} \
184+
--client-secret ${CLIENT_SECRET} \
185+
catalog-roles \
186+
create \
187+
--catalog polaris_catalog \
188+
test_catalog_role
189+
```
190+
191+
When the **principals create** commands successfully, it will return the credentials for this new principal,save it.
192+
193+
Use below command to grant privileges
194+
195+
```bash
196+
./polaris \
197+
--client-id ${CLIENT_ID} \
198+
--client-secret ${CLIENT_SECRET} \
199+
principal-roles \
200+
grant \
201+
--principal jack \
202+
test_user_role
203+
204+
./polaris \
205+
--client-id ${CLIENT_ID} \
206+
--client-secret ${CLIENT_SECRET} \
207+
catalog-roles \
208+
grant \
209+
--catalog polaris_catalog \
210+
--principal-role test_user_role \
211+
test_catalog_role
212+
213+
./polaris \
214+
--client-id ${CLIENT_ID} \
215+
--client-secret ${CLIENT_SECRET} \
216+
privileges \
217+
catalog \
218+
grant \
219+
--catalog polaris_catalog \
220+
--catalog-role test_catalog_role \
221+
CATALOG_MANAGE_CONTENT
222+
```
223+
224+
We grant **CATALOG_MANAGE_CONTENT** privilege to the catalog role `test_catalog_role`, and assign the principal role `test_user_role` to principal `jack`, then assign the catalog role `test_catalog_role` to principal role `test_user_role`.
225+
226+
## Configure StarRocks Iceberg Catalog
227+
228+
First, you need to have a StarRocks cluster up and running. Please refer to the [StarRocks Quick Start Guide](https://docs.starrocks.io/docs/quick_start) for instructions on setting up a StarRocks cluster.
229+
Then you can create an external Iceberg catalog in StarRocks that connects to Polaris.
230+
231+
### Create External Catalog
232+
1. Use credentials vending
233+
234+
It's recommended to use Polaris's credential vending feature to enhance security by avoiding long-lived static credentials.
235+
236+
Here is an example of creating an external catalog in StarRocks that connects to Polaris using credential vending:
237+
238+
```sql
239+
CREATE EXTERNAL CATALOG polaris_catalog
240+
PROPERTIES (
241+
"iceberg.catalog.uri" = "http://<POLARIS_HOST>:<POLARIS_PORT>/api/catalog",
242+
"type" = "iceberg",
243+
"iceberg.catalog.type" = "rest",
244+
"iceberg.catalog.warehouse" = "polaris_catalog",
245+
"iceberg.catalog.security" = "oauth2",
246+
"iceberg.catalog.oauth2.credential" = "<jack_client_id>:<jack_client_secret>",
247+
"iceberg.catalog.oauth2.scope"='PRINCIPAL_ROLE:ALL',
248+
"aws.s3.region" = "us-west-2",
249+
"iceberg.catalog.vended-credentials-enabled" = "true"
250+
);
251+
```
252+
We use jack's credential(client_id and client_secret) created above to access polaris_catalog.
253+
254+
Users will have the permissions of user jack when accessing Iceberg tables in polaris_catalog.
255+
256+
2. Use S3 storage credentials
257+
258+
If you prefer to use static S3 credentials instead of credential vending, you can create the external catalog in StarRocks as follows:
259+
```sql
260+
CREATE EXTERNAL CATALOG polaris_catalog
261+
PROPERTIES (
262+
"iceberg.catalog.uri" = "http://<POLARIS_HOST>:<POLARIS_PORT>/api/catalog",
263+
"type" = "iceberg",
264+
"iceberg.catalog.type" = "rest",
265+
"iceberg.catalog.warehouse" = "polaris_catalog",
266+
"iceberg.catalog.security" = "oauth2",
267+
"iceberg.catalog.oauth2.credential" = "<jack_client_id>:<jack_client_secret>",
268+
"iceberg.catalog.oauth2.scope"='PRINCIPAL_ROLE:ALL',
269+
"aws.s3.region" = "us-west-2",
270+
"aws.s3.access_key" = "<access_key>",
271+
"aws.s3.secret_key" = "<secret_key>",
272+
"iceberg.catalog.vended-credentials-enabled" = "false"
273+
);
274+
```
275+
276+
### Manage iceberg tables through StarRocks
277+
Connect to StarRocks and run the following commands to create and query an Iceberg table through the Polaris catalog:
278+
279+
```sql
280+
-- switch to external iceberg catalog
281+
StarRocks>set catalog polaris_catalog;
282+
Query OK, 0 rows affected (0.00 sec)
283+
284+
-- create database
285+
StarRocks>create database polaris_db;
286+
Query OK, 0 rows affected (0.06 sec)
287+
288+
StarRocks>use polaris_db;
289+
Database changed
290+
291+
-- create iceberg table taxis
292+
StarRocks>CREATE TABLE taxis
293+
-> (
294+
-> trip_id bigint,
295+
-> trip_distance float,
296+
-> fare_amount double,
297+
-> store_and_fwd_flag string,
298+
-> vendor_id bigint
299+
-> )
300+
-> PARTITION BY (vendor_id);
301+
Query OK, 0 rows affected
302+
303+
-- insert data
304+
StarRocks>INSERT INTO taxis
305+
-> VALUES (1000371, 1.8, 15.32, 'N', 1), (1000372, 2.5, 22.15, 'N', 2), (1000373, 0.9, 9.01, 'N', 2), (1000374, 8.4, 42.13, 'Y', 1);
306+
Query OK, 4 rows affected
307+
308+
-- query iceberg table
309+
StarRocks>select * from taxis;
310+
+---------+---------------+-------------+--------------------+-----------+
311+
| trip_id | trip_distance | fare_amount | store_and_fwd_flag | vendor_id |
312+
+---------+---------------+-------------+--------------------+-----------+
313+
| 1000372 | 2.5 | 22.15 | N | 2 |
314+
| 1000373 | 0.9 | 9.01 | N | 2 |
315+
| 1000371 | 1.8 | 15.32 | N | 1 |
316+
| 1000374 | 8.4 | 42.13 | Y | 1 |
317+
+---------+---------------+-------------+--------------------+-----------+
318+
4 rows in set
319+
320+
```
321+
322+
For more information about using Iceberg table with StarRocks, please refer to:
323+
324+
https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog
174 KB
Loading

0 commit comments

Comments
 (0)