Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres CDC #2156

Closed
7 tasks done
cgardens opened this issue Feb 22, 2021 · 11 comments
Closed
7 tasks done

Postgres CDC #2156

cgardens opened this issue Feb 22, 2021 · 11 comments
Assignees
Labels
cdc type/enhancement New feature or request

Comments

@cgardens
Copy link
Contributor

cgardens commented Feb 22, 2021

Tell us about the problem you're trying to solve

  • CDC is one of the most sought after features from our users.

Describe the solution you’d like

  • Define what features are included in CDC
  • Spec out how we will implement it in Airbyte
  • POC to see if debezium is a good choice (issue)
  • Consensus on CDC's interaction with the AB protocol (issue)
  • Implement debezium jdbc source
  • Handle state management (issue)
  • Handle logic on when to close debezium (issue)

Spec

Related

@cgardens cgardens added the type/enhancement New feature or request label Feb 22, 2021
@sherifnada
Copy link
Contributor

TIL Debezium supports embedded CDC. I don't know if this is a good solution, but it's relevant since we brought up Debezium last week.

Relevant section from the README:

Applications that don't need or want this level of fault tolerance, performance, scalability, and reliability can instead use Debezium's embedded connector engine to run a connector directly within the application space. They still want the same data change events, but prefer to have the connectors send them directly to the application rather than persist them inside Kafka.

@cgardens
Copy link
Contributor Author

cgardens commented Mar 3, 2021

wal2json is the library that I spent some time looking at a little while ago. I found it, because it is what the the Singer tap-postgres uses. The thing to note here is that it only handles writing postgres WALs to a json object formatted specifically for wal2json. It is not obviously extensible to other dbs. It may still be useful in just extract the WAL from a postgres db (even if we have to do further transformation).

I spent some time trying to set this up on 2 local databases but got hung up and never got back to finishing it. The key here is that it's non-trivial to set up because you have to mess with your db settings. I think this is likely going to be true no matter what tool we use.

@michel-tricot
Copy link
Contributor

That's really cool to get an MVP up and running!

@michel-tricot
Copy link
Contributor

Yep regarding the setup complexity. CDC requires the logs to be created which isn't a default

@cgardens cgardens added this to the Core - 2021-03-12 milestone Mar 8, 2021
@ChristopheDuong
Copy link
Contributor

@sherifnada
Copy link
Contributor

sherifnada commented Mar 8, 2021

Separately, some questions I have about CDC:

  1. Is it going to be only within DBs (e.g: PG to PG) or is there a world where it makes sense to have it work across DBs as well? We don't need to implement anything more than single-DB CDC out the gate but it may affect our approach if the latter idea is desirable.
  2. How will it impact the protocol? An assumption we had previously is that you can hook up any source to any destination. With CDC this assumption changes.

It may make sense to address this in a quick RFC

@cgardens
Copy link
Contributor Author

cgardens commented Mar 22, 2021

proposal on how to handle CDC metadata. Ultimately we are not going in this direction, but documenting it here for posterity.

Based on a conversation with michel, jared, sherif, and me, we agreed that for the first iteration that the CDC metadata fields will go in the data section of the airbyte record. We will document the common naming conventions used for thos metadata fields.

Reasoning:

  1. more clearly a 2-way door, and less additional upfront work.
  2. we can handle marking a column as a deletion field as a new feature in the future in the same way we do for cursor field and primary key.
  3. keeps the destination CDC-unaware
  4. ultimately it is the prerogative of sources to add "metadata" in their output already. ultimately cdc deleted, etc follow under this jurisdiction.
  5. any more invasive changes to the protocol to support this opens lots of cans of worms that we have do not have the data to make well informed choices on. since there are alternatives it is better to punt on making those decisions.

@cgardens
Copy link
Contributor Author

updated the description with progress from last week and with new to dos for this week.

@jrhizor
Copy link
Contributor

jrhizor commented Mar 29, 2021

Remaining:

  • configuring table includes
  • updating catalog for metadata fields
  • allowing replication modes based on primary key existence
  • set up replication slot as a oneOf field
  • handle termination
  • make sure full refresh edge cases are handled
  • make sure WAL expiration edge cases are handled

@jrhizor
Copy link
Contributor

jrhizor commented Apr 12, 2021

@cgardens should we rename this to postgres / mark it as a duplicate with #957 ? Then create separate tickets for other CDC sources?

@cgardens
Copy link
Contributor Author

cgardens commented Apr 12, 2021 via email

@jrhizor jrhizor changed the title Implement CDC Add Postgres CDC Apr 12, 2021
@jrhizor jrhizor changed the title Add Postgres CDC Postgres CDC Apr 12, 2021
@jrhizor jrhizor closed this as completed Apr 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cdc type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants