Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data ingestion overview about batch #55

Merged
merged 16 commits into from
Nov 25, 2024
Merged
46 changes: 46 additions & 0 deletions delivery/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -124,3 +124,49 @@ WITH (
<Note>
File sink currently supports only append-only mode, so please change the query to `append-only` and specify this explicitly after the `FORMAT ... ENCODE ...` statement.
</Note>

## Batching strategy for file sink

RisingWave implements batching strategies for file sinks to optimize file management by preventing the generation of numerous small files. The batching strategy is available for Parquet, JSON, and CSV encode.

### Category

- **Batching based on row numbers**:
RisingWave monitors the number of rows written and completes the file once the maximum row count threshold is reached.

- **Batching based on rollover interval**:
RisingWave checks the threshold each time a chunk is about to be written and when a barrier is encountered.

- If no batching strategy is specified, RisingWave defaults to writing a new file every 10 seconds.

<Note>The condition for batching is relatively coarse-grained. The actual number of rows or exact timing of file completion may vary from the specified thresholds, as this function is intentionally flexible to prioritize efficient file management.</Note>

### File organization

You can use `path_partition_prefix` to organize files into subdirectories based on their creation time. The available options are month, day, or hour. If not specified, files will be stored directly in the root directory without any time-based subdirectories.

Regarding file naming rules, currently, files follow the naming pattern `/Option<path_partition_prefix>/executor_id + timestamp.suffix`. `Timestamp` differentiates files batched by the rollover interval.

The output files look like below:

```
path/2024-09-20/47244640257_1727072046.parquet
path/2024-09-20/47244640257_1727072055.parquet
```

### Example

```sql
CREATE SINK s1
FROM t
WITH (
connector = 's3',
max_row_count = '100',
rollover_seconds = '10',
type = 'append-only',
path_partition_prefix = 'day'
) FORMAT PLAIN ENCODE PARQUET (force_append_only=true);
```

In this example, if the number of rows in the file exceeds 100, or if writing has continued for more than 10 seconds, the writing of this file will be completed.
Once completed, the file will be visible in the downstream sink system.
1 change: 1 addition & 0 deletions mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@
{"source": "/docs/current/architecture", "destination": "/reference/architecture"},
{"source": "/docs/current/fault-tolerance", "destination": "/reference/fault-tolerance"},
{"source": "/docs/current/limitations", "destination": "/reference/limitations"},
{"source": "/docs/current/sources", "destination": "/integrations/sources/overview"},
{"source": "/docs/current/sql-alter-connection", "destination": "/sql/commands/sql-alter-connection"},
{"source": "/docs/current/sql-alter-database", "destination": "/sql/commands/sql-alter-database"},
{"source": "/docs/current/sql-alter-function", "destination": "/sql/commands/sql-alter-function"},
Expand Down