cdc: move changefeed options logic from encoders to cdcevent pkg #103309

jayshrivastava · 2023-05-15T15:27:14Z

As seen in #103129, we do not support certain options for parquet. The reason is that every time we add a new format, you need to add code to the encoder to support them. These options are not encoder-specific, so we should remove the need to write code for them in each encoder. It would be much better to write the logic once in the cdcevent package - it should be a simple matter of tacking on extra columns to the events before they are sent to the encoders.

The list of columns which can be handled by the cdc event descriptor are

Jira issue: CRDB-27937

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-05-15T15:27:17Z

cc @cockroachdb/cdc

jayshrivastava · 2023-05-24T22:02:50Z

I have a WIP here where I tried adding some metadata columns into the cdcevent.Row. 74a2387

I have some concerns with doing this.

Options are treated differently by different encoders - so you can't just tack on extra columns to the row and treat them as regular columns. Ex. The JSON encoder passes a before field if diff is specified. Parquet will instead create a tuple column containing the previous row.

Another approach is to pass (updatedRow, previousRow, metadata) to the encoders so the encoders access each metadata field and use them as required. This way the code used to calculate the field is not duplicated in each decoder. The problem is that now that in addition to the encoders, the `cdcevent.Row needs to be given a same copy of the changefeed options. This is unnecessary action at a distance - if the options are different for any reason, then the decoder will expect different metadata than what is passed. The current code only requires the encoders to know about options and they can choose to deal with them as they wish.

The last problem is that cdcevent.Row is currently tied closely to the schema of the underlying table. Some code straight up assumes that the columns have an underlying table. There is also a lot of action at a distance right now where if the schema changes, any encoder of cdcevent.Row will know that theres a schema change and needs to update itself accordingly. For example, the cloudstorage sink will create a new file with a new schema whenever it observes the schema timestamp change. If the options change, then there is no such mechanism that prevents us from writing rows with the wrong schema to a file (well, luckily we can only change options when the changefeed is paused which results in new files being created on resume, but there are no strong guarantees like we have around schema changes).

TLDR:
I think the best solution is to have some util functions in the cdcevent.Row which calculate simple metadata for you as a function of the updatedRow and prevRow. For example, updated and mvcc_timestamp are very easy to calculate. These functions can be stored in a central place like cdcevent. The encoders should just call these with the updatedRow and prevRow they already get in the current code.

jayshrivastava added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cdc Change Data Capture T-cdc labels May 15, 2023

jayshrivastava self-assigned this May 15, 2023

exalate-issue-sync bot unassigned jayshrivastava Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cdc: move changefeed options logic from encoders to cdcevent pkg #103309

cdc: move changefeed options logic from encoders to cdcevent pkg #103309

jayshrivastava commented May 15, 2023 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented May 15, 2023

jayshrivastava commented May 24, 2023

cdc: move changefeed options logic from encoders to cdcevent pkg #103309

cdc: move changefeed options logic from encoders to cdcevent pkg #103309

Comments

jayshrivastava commented May 15, 2023 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented May 15, 2023

jayshrivastava commented May 24, 2023

jayshrivastava commented May 15, 2023 •

edited by exalate-issue-sync bot

Loading