refactor: deprecate StreamChunkWithState
for source state
#14384
Labels
Milestone
StreamChunkWithState
for source state
#14384
Background and Motivation
In previous implementations, we use
StreamChunkWithState
, containing the payload asStreamChunk
and an offset mapping (fromsplit_id
to the latest offset per partition). We truncate the offset per batch and only keep the last one, which brings us the limitation that we cannot do truncation anywhere after going into source exec while maintaining the semantic of exactly once.based on additional columns(#14215), I propose to derive
partition
(file
for fs source) andoffset
for all sources and read the partition and offset info from the stream chunk itself rather than a separate struct. It allows not to treat the chunk as a whole, both throttling (#13800) and reusable source (risingwavelabs/rfcs#72) can benefit from it.After the refactor, we load split_id and offset from spec columns and no longer need
StreamChunkWithState
.Compatibility
I don't want to make breaking changes to the table catalog, but additional columns are required to be part of the schema.
@BugenZhao and I found a way that we can do it when building source schema, manually adding the two columns into it and pruning the columns before yielding.
Note that if users specify
include partition
orinclude offset
in SQL, then they are in the table schema. We don't have to manually add them and prune them.This approach does not require changing data in meta, all changes happen in runtime and the new "hidden columns" do not have to be materialized, ie. a table with connector.
Implementation
SourceDesc::column_catalogs_to_source_column_descs
)SourceDesc
andFsSourceDesc
risingwave/src/stream/src/executor/source/source_executor.rs
Lines 571 to 589 in 7c84d06
risingwave/src/stream/src/executor/source/source_executor.rs
Line 609 in 7c84d06
The text was updated successfully, but these errors were encountered: