You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As seen in #103129, we do not support certain options for parquet. The reason is that every time we add a new format, you need to add code to the encoder to support them. These options are not encoder-specific, so we should remove the need to write code for them in each encoder. It would be much better to write the logic once in the cdcevent package - it should be a simple matter of tacking on extra columns to the events before they are sent to the encoders.
The list of columns which can be handled by the cdc event descriptor are
I have a WIP here where I tried adding some metadata columns into the cdcevent.Row. 74a2387
I have some concerns with doing this.
Options are treated differently by different encoders - so you can't just tack on extra columns to the row and treat them as regular columns. Ex. The JSON encoder passes a before field if diff is specified. Parquet will instead create a tuple column containing the previous row.
Another approach is to pass (updatedRow, previousRow, metadata) to the encoders so the encoders access each metadata field and use them as required. This way the code used to calculate the field is not duplicated in each decoder. The problem is that now that in addition to the encoders, the `cdcevent.Row needs to be given a same copy of the changefeed options. This is unnecessary action at a distance - if the options are different for any reason, then the decoder will expect different metadata than what is passed. The current code only requires the encoders to know about options and they can choose to deal with them as they wish.
The last problem is that cdcevent.Row is currently tied closely to the schema of the underlying table. Some code straight up assumes that the columns have an underlying table. There is also a lot of action at a distance right now where if the schema changes, any encoder of cdcevent.Row will know that theres a schema change and needs to update itself accordingly. For example, the cloudstorage sink will create a new file with a new schema whenever it observes the schema timestamp change. If the options change, then there is no such mechanism that prevents us from writing rows with the wrong schema to a file (well, luckily we can only change options when the changefeed is paused which results in new files being created on resume, but there are no strong guarantees like we have around schema changes).
TLDR:
I think the best solution is to have some util functions in the cdcevent.Row which calculate simple metadata for you as a function of the updatedRow and prevRow. For example, updated and mvcc_timestamp are very easy to calculate. These functions can be stored in a central place like cdcevent. The encoders should just call these with the updatedRow and prevRow they already get in the current code.
As seen in #103129, we do not support certain options for parquet. The reason is that every time we add a new format, you need to add code to the encoder to support them. These options are not encoder-specific, so we should remove the need to write code for them in each encoder. It would be much better to write the logic once in the
cdcevent
package - it should be a simple matter of tacking on extra columns to the events before they are sent to the encoders.The list of columns which can be handled by the cdc event descriptor are
Jira issue: CRDB-27937
The text was updated successfully, but these errors were encountered: