Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1135] improvement(docs): Add docs about tables advanced feature like partitioning #1203

Merged
merged 22 commits into from
Jan 2, 2024
Merged
Changes from 6 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
3049470
Add docs about tables advanced feature like partitioning
yuqi1129 Dec 19, 2023
1ac2270
Add docs about tables advanced feature like partitioning
yuqi1129 Dec 19, 2023
31677a9
Resolve discussion
yuqi1129 Dec 19, 2023
164ddf0
Resolve discussion
yuqi1129 Dec 19, 2023
bfd2802
Resolve discussion again
yuqi1129 Dec 19, 2023
af0b348
Update doc again
yuqi1129 Dec 19, 2023
d4c086f
Polish docs
yuqi1129 Dec 21, 2023
41582dd
Resolve discussion again
yuqi1129 Dec 25, 2023
a08a184
Remove the source type and result type column
yuqi1129 Dec 25, 2023
ae6b3c3
Merge branch 'main' of github.com:datastrato/graviton into issue_1135
yuqi1129 Dec 25, 2023
31ddcd4
Add description about default null ordering value
yuqi1129 Dec 25, 2023
b70b394
Use a separate doc to describe partitioning, bucketing and sorted table
yuqi1129 Dec 25, 2023
6e37e14
Add document header for table-partitioning-bucketing-sort-order.md
yuqi1129 Dec 25, 2023
3f6c622
Add descriptions about default value of sort direction.
yuqi1129 Dec 25, 2023
993fdff
Change some improper variants naming
yuqi1129 Dec 25, 2023
b1d3db6
Fix discussion again
yuqi1129 Dec 25, 2023
108117a
Optimize code.
yuqi1129 Dec 27, 2023
c0503f8
Fix Jerry's comments and format some code
yuqi1129 Jan 2, 2024
b993c01
Polish docs again
yuqi1129 Jan 2, 2024
a266e95
1. Add the necessary messages needed by table partitioning
yuqi1129 Jan 2, 2024
cc5c454
Change to use api method
yuqi1129 Jan 2, 2024
983dbab
Update table-partitioning-bucketing-sort-order.md
jerryshao Jan 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 153 additions & 0 deletions docs/manage-metadata-using-gravitino.md
Original file line number Diff line number Diff line change
Expand Up @@ -733,6 +733,159 @@ In addition to the basic settings, Gravitino supports the following features:
| Bucketed table | Equal to `CLUSTERED BY` in Apache Hive, some engine may use different words to describe it. | [Distribution](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/rel/expressions/distributions/Distribution.html) |
| Sorted order table | Equal to `SORTED BY` in Apache Hive, some engine may use different words to describe it. | [SortOrder](pathname:///docs/0.3.0/api/java/com/datastrato/gravitino/rel/expressions/sorts/SortOrder.html) |

#### Partitioned table

Currently, Gravitino supports the following partitioning strategies:

:::note
The `score`, `dt` and `city` are the field names in the table.
mchades marked this conversation as resolved.
Show resolved Hide resolved
:::

| Function strategy | Description | Source types | Result type | Json example | Java example | Equivalent SQL semantics |
yuqi1129 marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, Gravitino do not care about the source type and result type, the type limitation depends on the underlying catalog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqqttt123 What's your opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a transform, source type and result type are important. Gravitino may not care. But users will care about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding links to different catalog partitioning docs?

For the same type and partitioning strategy, it may be feasible in catalogA but prohibited in catalogB, as this is likely dependent on the catalog's implicit type conversion strategy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may require a few more PRs to refine it, so I will add an issue about it later.

|-------------------|--------------------------------------------------------------|-------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------|-------------------------------------------------|------------------------------------|
| `identity` | Source value, unmodified | Any | Source type | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` |
| `hour` | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | timestamp`, timestamptz | int | `{"strategy":"hour","fieldName":["score"]}` | `Transforms.hour("score")` | `PARTITION BY hour(score)` |
| `day` | Extract a date or timestamp day, as days from 1970-01-01 | date, timestamp, timestamptz | int | `{"strategy":"day","fieldName":["score"]}` | `Transforms.day("score")` | `PARTITION BY day(score)` |
| `month` | Extract a date or timestamp month, as months from 1970-01-01 | date, timestamp, timestamptz | int | `{"strategy":"month","fieldName":["score"]}` | `Transforms.month("score")` | `PARTITION BY month(score)` |
| `year` | Extract a date or timestamp year, as years from 1970 | date, timestamp, timestamptz | int | `{"strategy":"year","fieldName":["score"]}` | `Transforms.year("score")` | `PARTITION BY year(score)` |
| `bucket[N]` | Hash of value, mod N | int, long, decimal, date, time, timestamp, timestamptz, string, uuid, fixed, binary | int | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` |
| `truncate[W]` | Value truncated to width W | int, long, decimal, string | Source type | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` |
| `list` | Partition the table by a list value | Any | Any | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` |
| `range` | Partition the table by a range value | Any | Any | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` |

Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["score"]}` is equivalent to `PARTITION BY toDate(score)` in SQL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add this to the document.

All transforms must return null for a null input value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about this point as it's only appliable to Iceberg currently, I need to check this for Hive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if we don't follow this, should we explain null input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about this point as it's only appliable to Iceberg currently, I need to check this for Hive.

Can you help confirm this, @mchades?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not determined by Gravitino, but rather depends on the underlying catalog.

For complex function, please refer to `FunctionPartitioningDTO`.
yuqi1129 marked this conversation as resolved.
Show resolved Hide resolved

The following is an example of creating a partitioned table:

<Tabs>
<TabItem value="Json" label="Json">

```json
[
{
"strategy": "identity",
"fieldName": [
"score"
]
}
]
```

</TabItem>
<TabItem value="java" label="Java">

```java
new Transform[] {
// Partition by score
Transforms.identity("score")
}
```

</TabItem>
</Tabs>


#### Bucketed table

- Strategy. It defines in which way you bucket the table.
yuqi1129 marked this conversation as resolved.
Show resolved Hide resolved

| Bucket strategy | Description | Source types | Result type | Json | Java |
yuqi1129 marked this conversation as resolved.
Show resolved Hide resolved
|-----------------|----------------------------------------------------------------------------------------------------------------------|--------------|-------------|---------|------------------|
| hash | Bucket table using hash. The data will be distributed into buckets based on the hash value of the key. | Any | int | `hash` | `Strategy.HASH` |
| range | Bucket table using range. The data will be divided into buckets based on a specified range or interval of values. | Any | Source type | `range` | `Strategy.RANGE` |
| even | Bucket table using even. The data will be evenly distributed into buckets, ensuring an equal distribution of data. | Any | Source type | `even` | `Strategy.EVEN` |

- Number. It defines how many buckets you use to bucket the table.
- Function arguments. It defines which field or function should be used to bucket the table. Gravitino supports the following three kinds of arguments, for more, you can refer to Java class `FunctionArg` and `DistributionDTO` to use more complex function arguments.
yuqi1129 marked this conversation as resolved.
Show resolved Hide resolved
yuqi1129 marked this conversation as resolved.
Show resolved Hide resolved

| Expression type | Json example | Java example | Equivalent SQL semantics | Description |
|-----------------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|--------------------------|--------------------------------|
| field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | field reference value `score` |
| function | `{"type":"function","functionName":"hour","fieldName":["score"]}` | `new FuncExpressionDTO.Builder()<br/>.withFunctionName("hour")<br/>.withFunctionArgs("score").build()` | `hour(score)` | function value `hour(score)` |
| constant | `{"type":"constant","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder()<br/>.withValue("10")<br/>.withDataType(Types.IntegerType.get())<br/>.build()` | `10` | Integer constant `10` |
mchades marked this conversation as resolved.
Show resolved Hide resolved


<Tabs>
<TabItem value="Json" label="Json">

```json
{
"strategy": "hash",
"number": 4,
"funcArgs": [
{
"type": "field",
"fieldName": ["score"]
}
]
}
```

</TabItem>
<TabItem value="java" label="Java">

```java
new DistributionDTO.Builder()
.withStrategy(Strategy.HASH)
.withNumber(4)
.withArgs(FieldReferenceDTO.of("score"))
.build()
```

</TabItem>
</Tabs>


#### Sorted order table

To define a sorted order table, you should use the following three components to construct a valid sorted order table.

- Direction. It defines in which direction we sort the table.

| Direction | Json | Java | Description |
yuqi1129 marked this conversation as resolved.
Show resolved Hide resolved
|------------| ------ | -------------------------- |-------------------------------------------|
| ascending | `asc` | `SortDirection.ASCENDING` | Sorted by a field or a function ascending |
| descending | `desc` | `SortDirection.DESCENDING` | Sorted by a field or a function ascending |

- Null ordering. It describes how to handle null value when ordering

| Null ordering Type | Json | Java | Description |
|--------------------| ------------- | -------------------------- |-----------------------------------|
| null_first | `nulls_first` | `NullOrdering.NULLS_FIRST` | Put null value in the first place |
| null_last | `nulls_last` | `NullOrdering.NULLS_LAST` | Put null value in the last place |

- Sort term. It shows which field or function should be used to sort the table, please refer to the `Expression type` in the bucketed table chapter.

<Tabs>
<TabItem value="Json" label="Json">

```json
{
"direction": "asc",
"nullOrder": "NULLS_LAST",
"sortTerm": {
"type": "field",
"fieldName": ["score"]
}
}
```

</TabItem>
<TabItem value="java" label="Java">

```java
new SortOrderDTO.Builder()
.withDirection(SortDirection.ASCENDING)
.withNullOrder(NullOrdering.NULLS_LAST)
.withSortTerm(FieldReferenceDTO.of("score"))
.build()
```

</TabItem>
</Tabs>


:::tip
yuqi1129 marked this conversation as resolved.
Show resolved Hide resolved
**Not all catalogs may support those features.**. Please refer to the related document for more details.
:::
Expand Down