Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMORO-1907]Correct and supplement documentation #1995

Merged
merged 6 commits into from
Sep 19, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 45 additions & 17 deletions docs/admin-guides/deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,13 @@ You can choose to download the stable release package from [download page](../..

## Download the distribution

All released package can be downaloded from [download page](../../download/).
All released package can be downloaded from [download page](../../download/).
You can download amoro-x.y.z-bin.zip (x.y.z is the release number), and you can also download the runtime packages for each engine version according to the engine you are using.
Unzip it to create the amoro-x.y.z directory in the same directory, and then go to the amoro-x.y.z directory.

## Source code compilation

You can build based on the master branch without compiling Trino. The compilation method and the directory of results are described below
You can build based on the master branch without compiling Trino. The compilation method and the directory of results is described below
wangtaohz marked this conversation as resolved.
Show resolved Hide resolved

```shell
git clone https://github.com/NetEase/amoro.git
Expand All @@ -38,7 +38,7 @@ base_dir=$(pwd)
mvn clean package -DskipTests -pl '!Trino'
cd dist/target/
ls
amoro-x.y.z-bin.zip # AMS release pakcage
amoro-x.y.z-bin.zip # AMS release package
dist-x.y.z-tests.jar
dist-x.y.z.jar
archive-tmp/
Expand All @@ -53,14 +53,14 @@ maven-archiver/

cd ${base_dir}/spark/v3.1/spark-runtime/target
ls
amoro-spark-3.1-runtime-0.4.0.jar # Spark v3.1 runtime package)
amoro-spark-3.1-runtime-0.4.0-tests.jar
amoro-spark-3.1-runtime-0.4.0-sources.jar
original-amoro-spark-3.1-runtime-0.4.0.jar
amoro-spark-3.1-runtime-x.y.z.jar # Spark v3.1 runtime package)
amoro-spark-3.1-runtime-x.y.z-tests.jar
amoro-spark-3.1-runtime-x.y.z-sources.jar
original-amoro-spark-3.1-runtime-x.y.z.jar
```

If you need to compile the Trino module at the same time, you need to install jdk17 locally and configure `toolchains.xml` in the user's ${user.home}/.m2/ directory, then run mvn
package -P toolchain to compile the entire project.
If you need to compile the Trino module at the same time, you need to install jdk17 locally and configure `toolchains.xml` in the user's `${user.home}/.m2/` directory,
then run `mvn package -P toolchain` to compile the entire project.

```xml
<?xml version="1.0" encoding="UTF-8"?>
Expand All @@ -80,14 +80,14 @@ package -P toolchain to compile the entire project.

## Configuration

If you want to use AMS in a production environment, it is recommended to modify `{ARCTIC_HOME}/conf/config.yaml` by referring to the following configuration steps.
If you want to use AMS in a production environment, it is recommended to modify `{AMORO_HOME}/conf/config.yaml` by referring to the following configuration steps.

### Configure the service address

- The `ams.server-bind-host` configuration specifies the host to which AMS is bound. The default value, `0.0.0.0,` indicates binding to all network interfaces.
- The `ams.server-expose-host` configuration specifies the host exposed by AMS that the compute engine and optimizer use to connect to AMS. You can configure a specific IP address on the machine or an IP prefix. When AMS starts up, it will find the first host that matches this prefix.
- The `ams.thrift-server.table-service.bind-port` configuration specifies the binding port of the Thrift Server that provides the table service. The compute engine accesses AMS through this port, and the default value is 1260.
- The `ams.thrift-server.optimizing-service.bind-port` configuration specifies the binding port of the Thrift Server that provides the optimizing service. The optimizers accesses AMS through this port, and the default value is 1261.
- The `ams.server-expose-host` configuration specifies the host exposed by AMS that the computing engines and optimizers used to connect to AMS. You can configure a specific IP address on the machine, or an IP prefix. When AMS starts up, it will find the first host that matches this prefix.
- The `ams.thrift-server.table-service.bind-port` configuration specifies the binding port of the Thrift Server that provides the table service. The computing engines access AMS through this port, and the default value is 1260.
- The `ams.thrift-server.optimizing-service.bind-port` configuration specifies the binding port of the Thrift Server that provides the optimizing service. The optimizers access AMS through this port, and the default value is 1261.
- The `ams.http-server.bind-port` configuration specifies the port to which the HTTP service is bound. The Dashboard and Open API are bound to this port, and the default value is 1630.

```yaml
Expand All @@ -106,12 +106,12 @@ ams:
```

{{< hint info >}}
make sure the port is not used before configuring it
Make sure the port is not used before configuring it.
{{< /hint >}}

### Configure system database

Users can use MySQL/PostgreSQL as the system database instead of Derby.
You can use MySQL/PostgreSQL as the system database instead of the default Derby.

Create an empty database in MySQL/PostgreSQL, then AMS will automatically create table structures in this MySQL/PostgreSQL database when it first started.

Expand Down Expand Up @@ -150,7 +150,7 @@ ams:
zookeeper-address: 127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183 # ZooKeeper server address.
```

### Configure containers
### Configure optimizer containers

To scale out the optimizer through AMS, container configuration is required.
If you choose to manually start an external optimizer, no additional container configuration is required. AMS will initialize a container named `external` by default to store all externally started optimizers.
Expand Down Expand Up @@ -204,4 +204,32 @@ You can also restart/stop AMS with the following command:

```shell
bin/ams.sh restart/stop
```
```

## Upgrade AMS

### Upgrade system databases

You can find all the upgrade SQL scripts under `{ARCTIC_HOME}/conf/mysql/` with name pattern `upgrade-a.b.c-to-x.y.z.sql`.
Execute the upgrade SQL scripts one by one to your system database based on your starting and target versions.

### Replace all libs and plugins

Replace all contents in the original `{ARCTIC_HOME}/lib` directory with the contents in the lib directory of the new installation package.
Replace all contents in the original `{ARCTIC_HOME}/plugin` directory with the contents in the plugin directory of the new installation package.

{{< hint info >}}
Backup the old content before replacing it, so that you can roll back the upgrade operation if necessary.
{{< /hint >}}

### Configure new parameters

The old configuration file `{ARCTIC_HOME}/conf/config.yaml` is usually compatible with the new version, but the new version may introduce new parameters. Try to compare the configuration files of the old and new versions, and reconfigure the parameters if necessary.

### Restart AMS

Restart AMS with the following commands:
```shell
bin/ams.sh restart
```

2 changes: 1 addition & 1 deletion docs/admin-guides/managing-catalogs.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Common properties include:
We recommend users to create a Catalog following the guidelines below:

- If you want to use it in conjunction with HMS, choose `External Catalog` for the `Type` and `Hive Metastore` for the `Metastore`, and choose the table format based on your needs, Mixed-Hive or Iceberg.
- If you want to use Mixed-Iceberg provided by amoro, choose `Internal Catalog` for the `Type` and `Mixed-Iceberg` for the table format.
- If you want to use Mixed-Iceberg provided by Amoro, choose `Internal Catalog` for the `Type` and `Mixed-Iceberg` for the table format.

## Delete catalog
When a user needs to delete a Catalog, they can go to the details page of the Catalog and click the Remove button at the bottom of the page to perform the deletion.
Expand Down
24 changes: 16 additions & 8 deletions docs/admin-guides/managing-optimizers.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ The optimizer is the execution unit for performing self-optimizing tasks on a ta
* Optimizer: The specific unit that performs optimizing tasks, usually with multiple concurrent units.

## Optimizer container
Before using self-optimizing, you need to configure the container information in the configuration file. Opimizer container represents a specific set of runtime environment configuration, and the scheduling scheme of optimizer in that runtime environment. container includes three types: flink, local, and external.
Before using self-optimizing, you need to configure the container information in the configuration file. Optimizer container represents a specific set of runtime environment configuration, and the scheduling scheme of optimizer in that runtime environment. container includes three types: flink, local, and external.

### Local container
Local conatiner is a way to start Optimizer by local process and supports multi-threaded execution of Optimizer tasks. It is recommended to be used only in demo or local deployment scenarios. If the environment variable for jdk is not configured, the user can configure java_home to point to the jdk root directory. If already configured, this configuration item can be ignored.
Local container is a way to start Optimizer by local process and supports multi-threaded execution of Optimizer tasks. It is recommended to be used only in demo or local deployment scenarios. If the environment variable for jdk is not configured, the user can configure java_home to point to the jdk root directory. If already configured, this configuration item can be ignored.

```yaml
containers:
Expand All @@ -42,8 +42,8 @@ in the "export.{env_arg}" property of the container's properties. The commonly u
with the hadoop compatible package flink-shaded-hadoop-2-uber-x.y.z.jar, you need to download it and copy it to the
FLINK_HOME/lib directory. The flink-shaded-hadoop-2-uber-2.7.5-10.0.jar is generally sufficient and can be downloaded
at: https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.7.5-10.0/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar
- HADOOP_CONF_DIR, which holds the configuration files for the hadoop cluster (including hdfs-site.xml, core-site.xml, yarn-site.xml ). If the hadoop cluster has kerberos authentication enabled, you need to prepare an additional krb5.conf and a keytab file for the user to submit tasks
- JVM_ARGS, you can configure flink to run additional configuration parameters, here is an example of configuring krb5.conf, specify the address of krb5.conf to be used by Flink when committing via -Djava.security.krb5.conf=/opt/krb5.conf
- HADOOP_CONF_DIR, which holds the configuration files for the hadoop cluster (including hdfs-site.xml, core-site.xml, yarn-site.xml ). If the hadoop cluster has kerberos authentication enabled, you need to prepare an additional `krb5.conf` and a keytab file for the user to submit tasks
- JVM_ARGS, you can configure flink to run additional configuration parameters, here is an example of configuring krb5.conf, specify the address of krb5.conf to be used by Flink when committing via `-Djava.security.krb5.conf=/opt/krb5.conf`
- HADOOP_USER_NAME, the username used to submit tasks to yarn
- FLINK_CONF_DIR, the directory where flink_conf.yaml is located

Expand Down Expand Up @@ -87,9 +87,15 @@ The optimizer group supports the following properties:
| Property | Container type | Required | Default | Description |
|---------------------|----------------|----------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| scheduling-policy | All | No | quota | The scheduler group scheduling policy, the default value is `quota`, it will be scheduled according to the quota resources configured for each table, the larger the table quota is, the more optimizer resources it can take. There is also a configuration `balanced` that will balance the scheduling of each table, the longer the table has not been optimized, the higher the scheduling priority will be. |
| flink-conf.* | flink | No | N/A | Any configuration for `flink on yarn` mode, like `flink-conf.taskmanager.memory.process.size` or `flink-conf.jobmanager.memory.process.size`. The value in `conf/flink-conf.yaml` will be used if not setted here. You can find more supported property in [Flink Configuration](https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/) |
| flink-conf.* | flink | No | N/A | Any configuration for `flink on yarn` mode, like `flink-conf.taskmanager.memory.process.size` or `flink-conf.jobmanager.memory.process.size`. The value in `conf/flink-conf.yaml` will be used if not set here. You can find more supported property in [Flink Configuration](https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/) |
| memory | local | Yes | N/A | The memory size of the local optimizer Java process. |

{{< hint info >}}
To better utilize the resources of Flink Optimizer, it is recommended to add the following configuration to the Flink Optimizer Group:
* Set `flink-conf.taskmanager.memory.managed.size` to `32mb` as Flink optimizer does not have any computation logic, it does not need to occupy managed memory.
* Set `flink-conf.taskmanager.memory.netwrok.max` to `32mb` as there is no need for communication between operators in Flink Optimizer.
zhoujinsong marked this conversation as resolved.
Show resolved Hide resolved
{{< /hint >}}

### Edit optimizer group

You can click the `edit` button on the `Optimizer Groups` page to modify the configuration of the Optimizer group.
Expand All @@ -115,7 +121,7 @@ You can click the `Release` button on the `Optimizer` page to release the optimi
![release optimizer](../images/admin/optimizer_release.png)

{{< hint info >}}
Currently, only pptimizer scaled through the dashboard can be released on dashboard.
Currently, only optimizer scaled through the dashboard can be released on dashboard.
{{< /hint >}}

### Deploy external optimizer
Expand All @@ -124,8 +130,10 @@ You can submit optimizer in your own Flink task development platform or local Fl

```shell
./bin/flink run-application -t yarn-application \
-Djobmanager.memory.process.size=1024m \
-Dtaskmanager.memory.process.size=2048m \
-Djobmanager.memory.process.size=1024mb \
-Dtaskmanager.memory.process.size=2048mb \
-Dtaskmanager.memory.managed.size=32mb \
-Dtaskmanager.memory.network.max=32mb \
zhoujinsong marked this conversation as resolved.
Show resolved Hide resolved
-c com.netease.arctic.optimizer.flink.FlinkOptimizer \
${ARCTIC_HOME}/plugin/optimize/OptimizeJob.jar \
-a 127.0.0.1:1261 \
Expand Down
4 changes: 2 additions & 2 deletions docs/concepts/table-watermark.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ However, in high-freshness streaming data warehouses, massive small files and fr
freshness, the greater the impact on performance. To achieve the required performance, users must incur higher costs. Thus, for streaming data
warehouses, data freshness, query performance, and cost form a tripartite paradox.

<img src="../images/concepts/fressness_cost_performance.png" alt="Fressness, cost and performance" width="60%" height="60%">
<img src="../images/concepts/freshness_cost_performance.png" alt="Freshness, cost and performance" width="60%" height="60%">

Amoro offers a resolution to the tripartite paradox for users by utilizing AMS management functionality and a self-optimizing mechanism. Unlike
traditional data warehouses, Lakehouse tables are utilized in a multitude of data pipelines, AI, and BI scenarios. Measuring data freshness is
Expand Down Expand Up @@ -58,4 +58,4 @@ greater flexibility:
SHOW TBLPROPERTIES test_db.test_log_store ('watermark.base');
```

You can learn about how to use Watermark in detail by referring to [Managing tables](../managing-tables/).
You can learn about how to use Watermark in detail by referring to [Managing tables](../using-tables/).
12 changes: 6 additions & 6 deletions docs/engines/flink/flink-ddl.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,16 +189,16 @@ Not supported at the moment
| BIGINT | BIGINT |
| FLOAT | FLOAT |
| DOUBLE | DOUBLE |
| DECIAML(p, s) | DECIAML(p, s) |
| DECIMAL(p, s) | DECIMAL(p, s) |
| DATE | DATE |
| TIMESTAMP(6) | TIMESTAMP |
| VARBINARY | BYNARY |
| VARBINARY | BINARY |
| ARRAY<T> | ARRAY<T> |
| MAP<K, V> | MAP<K, V> |
| ROW | STRUCT |


### Mixed-Iceberg daata types
### Mixed-Iceberg data types
| Flink Data Type | Mixed-Iceberg Data Type |
|-----------------------------------|-------------------------|
| CHAR(p) | STRING |
Expand All @@ -211,13 +211,13 @@ Not supported at the moment
| BIGINT | LONG |
| FLOAT | FLOAT |
| DOUBLE | DOUBLE |
| DECIAML(p, s) | DECIAML(p, s) |
| DECIMAL(p, s) | DECIMAL(p, s) |
| DATE | DATE |
| TIMESTAMP(6) | TIMESTAMP |
| TIMESTAMP(6) WITH LCOAL TIME ZONE | TIMESTAMPTZ |
| TIMESTAMP(6) WITH LOCAL TIME ZONE | TIMESTAMPTZ |
| BINARY(p) | FIXED(p) |
| BINARY(16) | UUID |
| VARBINARY | BYNARY |
| VARBINARY | BINARY |
| ARRAY<T> | ARRAY<T> |
| MAP<K, V> | MAP<K, V> |
| ROW | STRUCT |
Expand Down
Loading