|
| 1 | +--- |
| 2 | +# |
| 3 | +# Licensed to the Apache Software Foundation (ASF) under one |
| 4 | +# or more contributor license agreements. See the NOTICE file |
| 5 | +# distributed with this work for additional information |
| 6 | +# regarding copyright ownership. The ASF licenses this file |
| 7 | +# to you under the Apache License, Version 2.0 (the |
| 8 | +# "License"); you may not use this file except in compliance |
| 9 | +# with the License. You may obtain a copy of the License at |
| 10 | +# |
| 11 | +# http://www.apache.org/licenses/LICENSE-2.0 |
| 12 | +# |
| 13 | +# Unless required by applicable law or agreed to in writing, |
| 14 | +# software distributed under the License is distributed on an |
| 15 | +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 16 | +# KIND, either express or implied. See the License for the |
| 17 | +# specific language governing permissions and limitations |
| 18 | +# under the License. |
| 19 | +# |
| 20 | +title: "StarRocks and Apache Polaris Integration: Building a Unified, High-Performance Data Lakehouse" |
| 21 | +date: 2025-10-21 |
| 22 | +author: Wayne Yang |
| 23 | +--- |
| 24 | +## Introduction: Why StarRocks + Apache Polaris? |
| 25 | + |
| 26 | +### Modern Data-Architecture Pain Points: Silos & Engine Lock-in |
| 27 | +Today’s data-driven enterprises face two chronic ailments: |
| 28 | + |
| 29 | +* **Data silos** : transactional data sits in RDBMS, click-streams in an S3 data lake, CRM data in a SaaS vault, etc. Cross-domain analysis requires expensive ETL and still arrives stale. |
| 30 | + |
| 31 | +* **Engine lock-in** : every OLAP engine optimises for its own metadata layer and file layout. Migrating to a faster or cheaper engine means re-formatting, re-writing and re-governing years of data. |
| 32 | + |
| 33 | +An open, interoperable architecture is no longer a luxury, it is a survival requirement. |
| 34 | + |
| 35 | +### StarRocks at a Glance: Speed & Simplicity |
| 36 | +[StarRocks](https://www.starrocks.io/) is a modern MPP database designed for high-performance analytics and real-time data processing. Key features include: |
| 37 | + |
| 38 | +* MPP, fully-vectorised execution, CBO that thrives on complex multi-table joins. |
| 39 | + |
| 40 | +* Sub-second response on TB-scale data without pre-aggregation. |
| 41 | + |
| 42 | +* Compute-storage separation since v3.0: scale stateless compute pods in seconds, keep data in cheap object storage. |
| 43 | + |
| 44 | +### Apache Polaris (Incubating): Vendor-Neutral Iceberg Catalog |
| 45 | +* 100% open-source implementation of the Iceberg REST Catalog API. |
| 46 | + |
| 47 | +* Pluggable metadata backend (PostgreSQL, in-mem) and multi-cloud storage support. |
| 48 | + |
| 49 | +* Fine-grained RBAC + credential-vending (temporary STS tokens) for secure, governed sharing across engines. |
| 50 | + |
| 51 | +### Technical benefits |
| 52 | +* Keep **ONE** copy of data in Iceberg on S3, queryable by Spark, Flink, Trino, and StarRocks concurrently. |
| 53 | + |
| 54 | +* Maintain **ONE** set of role-based permissions in Polaris that apply to every engine. |
| 55 | + |
| 56 | +* Utilize **StarRocks** to deliver BI dashboards, conduct ad-hoc exploration, and perform lightweight ETL operations on the same up-to-date dataset, zero data movement and zero vendor lock-in. |
| 57 | + |
| 58 | +## Architecture |
| 59 | + |
| 60 | + |
| 61 | +**Polaris** acts as the single source of truth for: |
| 62 | + |
| 63 | +* Table schema, partitions, snapshots |
| 64 | + |
| 65 | +* Role-based access control (RBAC) |
| 66 | + |
| 67 | +* Short-lived, scoped cloud credentials (credential vending) |
| 68 | + |
| 69 | +**StarRocks** acts as a stateless compute layer: |
| 70 | + |
| 71 | +* Discover Iceberg metadata via REST calls |
| 72 | + |
| 73 | +* Directly reads Parquet/ORC files from cloud storage using the vendored credentials |
| 74 | + |
| 75 | +* Applies its own CBO and vectorised execution for query acceleration |
| 76 | + |
| 77 | +## Deploy and Configure Polaris |
| 78 | + |
| 79 | +User can refer to [Polaris Quickstart](https://polaris.apache.org/releases/1.1.0/getting-started/) to deploy Polaris, |
| 80 | +Here we will compile from source code and deploy Polaris via a standalone process. |
| 81 | + |
| 82 | +### Clone Source Code and Start Polaris |
| 83 | + |
| 84 | +1. Clone source code and checkout to released version |
| 85 | + |
| 86 | +User can get the latest released version from https://github.com/apache/polaris/releases |
| 87 | +```bash |
| 88 | +# download the latest released version 1.1.0-incubating |
| 89 | +wget https://dlcdn.apache.org/incubator/polaris/1.1.0-incubating/apache-polaris-1.1.0-incubating.tar.gz |
| 90 | +tar -xvzf apache-polaris-1.1.0-incubating.tar.gz |
| 91 | +cd apache-polaris-1.1.0-incubating |
| 92 | +``` |
| 93 | + |
| 94 | +2. Build Polaris |
| 95 | +```bash |
| 96 | +./gradlew \ |
| 97 | + :polaris-server:assemble \ |
| 98 | + :polaris-server:quarkusAppPartsBuild --rerun \ |
| 99 | + :polaris-admin:assemble \ |
| 100 | + :polaris-admin:quarkusAppPartsBuild --rerun |
| 101 | +``` |
| 102 | + |
| 103 | +3. Run Polaris |
| 104 | + |
| 105 | +Ensure you have Java 21+, and export aws access key and secret key first. |
| 106 | +```bash |
| 107 | +export AWS_ACCESS_KEY_ID=<access_key> |
| 108 | +export AWS_SECRET_ACCESS_KEY=<secret_key> |
| 109 | + |
| 110 | +./gradlew run |
| 111 | +``` |
| 112 | + |
| 113 | +When Polaris is run using the./gradlew run command, the root principal credentials are **root** and **s3cr3t** for the **CLIENT_ID** and **CLIENT_SECRET**, respectively. |
| 114 | + |
| 115 | +When using a Gradle-launched Polaris instance, it’ll launch an instance of Polaris that stores entities only **in-memory**. This means that any entities that you define will be destroyed when Polaris is shut down. |
| 116 | + |
| 117 | +We suggest that users refer to this section (https://polaris.apache.org/releases/1.1.0/metastores/) to configure the metastore to persist Polaris entities. |
| 118 | + |
| 119 | +```bash |
| 120 | +export POLARIS_PERSISTENCE_TYPE=relational-jdbc |
| 121 | +export QUARKUS_DATASOURCE_USERNAME=<your-username> |
| 122 | +export QUARKUS_DATASOURCE_PASSWORD=<your-password> |
| 123 | +export QUARKUS_DATASOURCE_JDBC_URL=<jdbc-url-of-postgres> |
| 124 | + |
| 125 | +export AWS_ACCESS_KEY_ID=<access_key> |
| 126 | +export AWS_SECRET_ACCESS_KEY=<secret_key> |
| 127 | + |
| 128 | +./gradlew run |
| 129 | +``` |
| 130 | + |
| 131 | +Using **Admin Tool** (https://polaris.apache.org/releases/1.1.0/admin-tool/) to bootstrap realms and create the necessary principal credentials for the Polaris server. |
| 132 | + |
| 133 | +For example, to bootstrap the POLARIS realm and create its root principal credential with the client ID **root** and client secret **root_secret**, you can run the following command: |
| 134 | + |
| 135 | +```bash |
| 136 | +java -jar runtime/admin/build/quarkus-app/quarkus-run.jar bootstrap -r POLARIS -c POLARIS,root,root_secret |
| 137 | +``` |
| 138 | + |
| 139 | +### Creating a Principal and Assigning it Privileges |
| 140 | + |
| 141 | +Use the **Polaris CLI** (already built in the same folder): |
| 142 | + |
| 143 | +1. Export CLIENT_ID and CLIENT_SECRET |
| 144 | +```bash |
| 145 | +export CLIENT_ID=root |
| 146 | +export CLIENT_SECRET=root_secret |
| 147 | +``` |
| 148 | + |
| 149 | +2. Create catalog |
| 150 | +```bash |
| 151 | +./polaris |
| 152 | + --client-id ${CLIENT_ID} \ |
| 153 | + --client-secret ${CLIENT_SECRET} \ |
| 154 | + catalogs create \ |
| 155 | + --storage-type s3 \ |
| 156 | + --default-base-location ${DEFAULT_BASE_LOCATION} \ |
| 157 | + --role-arn ${ROLE_ARN} \ |
| 158 | + polaris_catalog |
| 159 | +``` |
| 160 | + |
| 161 | +The DEFAULT_BASE_LOCATION you provide will be the default location that objects in this catalog should be stored in, and the ROLE_ARN you provide should be a Role ARN with access to read and write data in that location. |
| 162 | +These credentials will be provided to engines reading data from the catalog once they have authenticated with Polaris using credentials that have access to those resources. |
| 163 | + |
| 164 | +3. Creating a Principal and Assigning it Privileges |
| 165 | + |
| 166 | +Use below commands to create principal, principal role and catalog role |
| 167 | +```bash |
| 168 | +./polaris \ |
| 169 | + --client-id ${CLIENT_ID} \ |
| 170 | + --client-secret ${CLIENT_SECRET} \ |
| 171 | + principals \ |
| 172 | + create \ |
| 173 | + jack |
| 174 | + |
| 175 | +./polaris \ |
| 176 | + --client-id ${CLIENT_ID} \ |
| 177 | + --client-secret ${CLIENT_SECRET} \ |
| 178 | + principal-roles \ |
| 179 | + create \ |
| 180 | + test_user_role |
| 181 | + |
| 182 | +./polaris \ |
| 183 | + --client-id ${CLIENT_ID} \ |
| 184 | + --client-secret ${CLIENT_SECRET} \ |
| 185 | + catalog-roles \ |
| 186 | + create \ |
| 187 | + --catalog polaris_catalog \ |
| 188 | + test_catalog_role |
| 189 | +``` |
| 190 | + |
| 191 | +When the **principals create** commands successfully, it will return the credentials for this new principal,save it. |
| 192 | + |
| 193 | +Use below command to grant privileges |
| 194 | + |
| 195 | +```bash |
| 196 | +./polaris \ |
| 197 | + --client-id ${CLIENT_ID} \ |
| 198 | + --client-secret ${CLIENT_SECRET} \ |
| 199 | + principal-roles \ |
| 200 | + grant \ |
| 201 | + --principal jack \ |
| 202 | + test_user_role |
| 203 | + |
| 204 | +./polaris \ |
| 205 | + --client-id ${CLIENT_ID} \ |
| 206 | + --client-secret ${CLIENT_SECRET} \ |
| 207 | + catalog-roles \ |
| 208 | + grant \ |
| 209 | + --catalog polaris_catalog \ |
| 210 | + --principal-role test_user_role \ |
| 211 | + test_catalog_role |
| 212 | + |
| 213 | + ./polaris \ |
| 214 | + --client-id ${CLIENT_ID} \ |
| 215 | + --client-secret ${CLIENT_SECRET} \ |
| 216 | + privileges \ |
| 217 | + catalog \ |
| 218 | + grant \ |
| 219 | + --catalog polaris_catalog \ |
| 220 | + --catalog-role test_catalog_role \ |
| 221 | + CATALOG_MANAGE_CONTENT |
| 222 | +``` |
| 223 | + |
| 224 | +We grant **CATALOG_MANAGE_CONTENT** privilege to the catalog role `test_catalog_role`, and assign the principal role `test_user_role` to principal `jack`, then assign the catalog role `test_catalog_role` to principal role `test_user_role`. |
| 225 | + |
| 226 | +## Configure StarRocks Iceberg Catalog |
| 227 | + |
| 228 | +First, you need to have a StarRocks cluster up and running. Please refer to the [StarRocks Quick Start Guide](https://docs.starrocks.io/docs/quick_start) for instructions on setting up a StarRocks cluster. |
| 229 | +Then you can create an external Iceberg catalog in StarRocks that connects to Polaris. |
| 230 | + |
| 231 | +### Create External Catalog |
| 232 | +1. Use credentials vending |
| 233 | + |
| 234 | +It's recommended to use Polaris's credential vending feature to enhance security by avoiding long-lived static credentials. |
| 235 | + |
| 236 | +Here is an example of creating an external catalog in StarRocks that connects to Polaris using credential vending: |
| 237 | + |
| 238 | +```sql |
| 239 | +CREATE EXTERNAL CATALOG polaris_catalog |
| 240 | +PROPERTIES ( |
| 241 | + "iceberg.catalog.uri" = "http://<POLARIS_HOST>:<POLARIS_PORT>/api/catalog", |
| 242 | + "type" = "iceberg", |
| 243 | + "iceberg.catalog.type" = "rest", |
| 244 | + "iceberg.catalog.warehouse" = "polaris_catalog", |
| 245 | + "iceberg.catalog.security" = "oauth2", |
| 246 | + "iceberg.catalog.oauth2.credential" = "<jack_client_id>:<jack_client_secret>", |
| 247 | + "iceberg.catalog.oauth2.scope"='PRINCIPAL_ROLE:ALL', |
| 248 | + "aws.s3.region" = "us-west-2", |
| 249 | + "iceberg.catalog.vended-credentials-enabled" = "true" |
| 250 | + ); |
| 251 | +``` |
| 252 | +We use jack's credential(client_id and client_secret) created above to access polaris_catalog. |
| 253 | + |
| 254 | +Users will have the permissions of user jack when accessing Iceberg tables in polaris_catalog. |
| 255 | + |
| 256 | +2. Use S3 storage credentials |
| 257 | + |
| 258 | +If you prefer to use static S3 credentials instead of credential vending, you can create the external catalog in StarRocks as follows: |
| 259 | +```sql |
| 260 | +CREATE EXTERNAL CATALOG polaris_catalog |
| 261 | +PROPERTIES ( |
| 262 | + "iceberg.catalog.uri" = "http://<POLARIS_HOST>:<POLARIS_PORT>/api/catalog", |
| 263 | + "type" = "iceberg", |
| 264 | + "iceberg.catalog.type" = "rest", |
| 265 | + "iceberg.catalog.warehouse" = "polaris_catalog", |
| 266 | + "iceberg.catalog.security" = "oauth2", |
| 267 | + "iceberg.catalog.oauth2.credential" = "<jack_client_id>:<jack_client_secret>", |
| 268 | + "iceberg.catalog.oauth2.scope"='PRINCIPAL_ROLE:ALL', |
| 269 | + "aws.s3.region" = "us-west-2", |
| 270 | + "aws.s3.access_key" = "<access_key>", |
| 271 | + "aws.s3.secret_key" = "<secret_key>", |
| 272 | + "iceberg.catalog.vended-credentials-enabled" = "false" |
| 273 | + ); |
| 274 | +``` |
| 275 | + |
| 276 | +### Manage iceberg tables through StarRocks |
| 277 | +Connect to StarRocks and run the following commands to create and query an Iceberg table through the Polaris catalog: |
| 278 | + |
| 279 | +```sql |
| 280 | +-- switch to external iceberg catalog |
| 281 | +StarRocks>set catalog polaris_catalog; |
| 282 | +Query OK, 0 rows affected (0.00 sec) |
| 283 | + |
| 284 | +-- create database |
| 285 | +StarRocks>create database polaris_db; |
| 286 | +Query OK, 0 rows affected (0.06 sec) |
| 287 | + |
| 288 | +StarRocks>use polaris_db; |
| 289 | +Database changed |
| 290 | + |
| 291 | +-- create iceberg table taxis |
| 292 | +StarRocks>CREATE TABLE taxis |
| 293 | + -> ( |
| 294 | + -> trip_id bigint, |
| 295 | + -> trip_distance float, |
| 296 | + -> fare_amount double, |
| 297 | + -> store_and_fwd_flag string, |
| 298 | + -> vendor_id bigint |
| 299 | + -> ) |
| 300 | + -> PARTITION BY (vendor_id); |
| 301 | +Query OK, 0 rows affected |
| 302 | + |
| 303 | +-- insert data |
| 304 | +StarRocks>INSERT INTO taxis |
| 305 | + -> VALUES (1000371, 1.8, 15.32, 'N', 1), (1000372, 2.5, 22.15, 'N', 2), (1000373, 0.9, 9.01, 'N', 2), (1000374, 8.4, 42.13, 'Y', 1); |
| 306 | +Query OK, 4 rows affected |
| 307 | + |
| 308 | +-- query iceberg table |
| 309 | +StarRocks>select * from taxis; |
| 310 | ++---------+---------------+-------------+--------------------+-----------+ |
| 311 | +| trip_id | trip_distance | fare_amount | store_and_fwd_flag | vendor_id | |
| 312 | ++---------+---------------+-------------+--------------------+-----------+ |
| 313 | +| 1000372 | 2.5 | 22.15 | N | 2 | |
| 314 | +| 1000373 | 0.9 | 9.01 | N | 2 | |
| 315 | +| 1000371 | 1.8 | 15.32 | N | 1 | |
| 316 | +| 1000374 | 8.4 | 42.13 | Y | 1 | |
| 317 | ++---------+---------------+-------------+--------------------+-----------+ |
| 318 | +4 rows in set |
| 319 | + |
| 320 | +``` |
| 321 | + |
| 322 | +For more information about using Iceberg table with StarRocks, please refer to: |
| 323 | + |
| 324 | +https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog |
0 commit comments