Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdc: move changefeed options logic from encoders to cdcevent pkg #103309

Open
6 tasks
jayshrivastava opened this issue May 15, 2023 · 2 comments
Open
6 tasks

cdc: move changefeed options logic from encoders to cdcevent pkg #103309

jayshrivastava opened this issue May 15, 2023 · 2 comments
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc

Comments

@jayshrivastava
Copy link
Contributor

jayshrivastava commented May 15, 2023

As seen in #103129, we do not support certain options for parquet. The reason is that every time we add a new format, you need to add code to the encoder to support them. These options are not encoder-specific, so we should remove the need to write code for them in each encoder. It would be much better to write the logic once in the cdcevent package - it should be a simple matter of tacking on extra columns to the events before they are sent to the encoders.

The list of columns which can be handled by the cdc event descriptor are

  • updated
  • mvcc_timestamp
  • diff
  • key in value
  • topic in value
  • resolved

Jira issue: CRDB-27937

@jayshrivastava jayshrivastava added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cdc Change Data Capture T-cdc labels May 15, 2023
@jayshrivastava jayshrivastava self-assigned this May 15, 2023
@blathers-crl
Copy link

blathers-crl bot commented May 15, 2023

cc @cockroachdb/cdc

@jayshrivastava
Copy link
Contributor Author

I have a WIP here where I tried adding some metadata columns into the cdcevent.Row. 74a2387

I have some concerns with doing this.

Options are treated differently by different encoders - so you can't just tack on extra columns to the row and treat them as regular columns. Ex. The JSON encoder passes a before field if diff is specified. Parquet will instead create a tuple column containing the previous row.

Another approach is to pass (updatedRow, previousRow, metadata) to the encoders so the encoders access each metadata field and use them as required. This way the code used to calculate the field is not duplicated in each decoder. The problem is that now that in addition to the encoders, the `cdcevent.Row needs to be given a same copy of the changefeed options. This is unnecessary action at a distance - if the options are different for any reason, then the decoder will expect different metadata than what is passed. The current code only requires the encoders to know about options and they can choose to deal with them as they wish.

The last problem is that cdcevent.Row is currently tied closely to the schema of the underlying table. Some code straight up assumes that the columns have an underlying table. There is also a lot of action at a distance right now where if the schema changes, any encoder of cdcevent.Row will know that theres a schema change and needs to update itself accordingly. For example, the cloudstorage sink will create a new file with a new schema whenever it observes the schema timestamp change. If the options change, then there is no such mechanism that prevents us from writing rows with the wrong schema to a file (well, luckily we can only change options when the changefeed is paused which results in new files being created on resume, but there are no strong guarantees like we have around schema changes).

TLDR:
I think the best solution is to have some util functions in the cdcevent.Row which calculate simple metadata for you as a function of the updatedRow and prevRow. For example, updated and mvcc_timestamp are very easy to calculate. These functions can be stored in a central place like cdcevent. The encoders should just call these with the updatedRow and prevRow they already get in the current code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc
Projects
None yet
Development

No branches or pull requests

1 participant