Airbyte data source [CU-15xp9xt] #493

mildbyte · 2021-07-19T20:47:12Z

Intro

Splitgraph DataSource implementation for connectors that use the Airbyte standard.

Similar to Singer (some Airbyte taps wrap around Singer) but with a few benefits:

JSONSchema for parameters pre-specified
Ship as Docker images (no need to mess around with venvs and PEXes)
Support for custom cursor and PK fields
Normalization (converting data streams to actual table schemas) is decoupled from ingestion and is a separate step (supporting custom dbt transforms (not supported here))

Not exposed publicly: this is just an AirbyteDataSource class that can be inherited from with some overrides to make an SG-compatible data source.

Example for MySQL:

class MySQLAirbyteDataSource(AirbyteDataSource):
    docker_image = "airbyte/source-mysql:latest"
    airbyte_name = "airbyte-mysql"

    credentials_schema = {"type": "object", "properties": {"password": {"type": "string"}}}
    params_schema = {
        "type": "object",
        "properties": {
            "host": {"type": "string"},
            "port": {"type": "integer"},
            "database": {"type": "string"},
            "username": {"type": "string"},
            "replication_method": {"type": "string"},
        },
        "required": ["host", "port", "database", "username", "replication_method"],
    }

    @classmethod
    def get_name(cls) -> str:
        return "MySQL (Airbyte)"

    @classmethod
    def get_description(cls) -> str:
        return "MySQL (Airbyte)"

Implementation:

JSONSchema: theoretically could run the source with --spec but this is currently done OOB.
Introspection: Run the source in the discovery mode, parse the catalog, convert it to SG schema. Note there is no guarantee the actual schema will be the same as what we output: we have no control over Airbyte's normalization and can't easily infer what it'll do or whether it'll split some substreams into separate tables. The only thing this lets the user do is ignore individual streams (named same as the tables)
Load/sync:
- Start the source container and the Postgres receiver container, pipe data between them.
- At the end of it, we get a schema with tables named _airbyte_raw_[streamname].
- These are always append-or-truncate only, so we don't do sgr checkout -- we just manually link these tables to the previous version (extending the Splitgraph table)
- Normalization: This converts the raw tables into the actual final tables using dbt. We check the raw tables out using LQ (which makes the dbt normalization step faster, as it always reads the whole table), run the Airbyte normalization container and then commit the new tables into the image.

Limitations

Normalization a bit crude and doesn't detect some things like timestamps (Airbyte recommends writing custom dbt models for raw data instead -- we don't currently support those either)
Sometimes the discovery step has issues when scraping its logs for the catalog -- I think this is a Docker-level problem and I haven't been able to reproduce it for the last couple of days
Our load maps to a series of Airbyte-level settings that are configurable separately per-stream but not through our interface:
- sync mode: full_refresh (source ignores state)
- destination sync mode: overwrite (always delete raw tables)
Same with sync:
- sync mode: incremental (source uses the state)
- destination sync mode: append_dedup (this appends to the raw tables and, when using dbt to normalize, always uses the primary key)
Sometimes the sources don't come with well-defined parameters:
- primary key: this is required for deduplication at normalization time, if missing, normalization will break for append_dedup.
- cursor: this is required for incremental loads, if missing, normalization will break for append/append_dedup.
- Added a class-level override for these. Might need to expose in the plugin params instead
Use a horrid regex hack to find out what sync mode Airbyte chose for each "raw" table (can't infer it directly because the algorithm for converting stream names into table names involves a lot of slugging and Unicode normalization).

References:

…at can be referenced by non-Singer data sources.

… the source path.

…o `common`

…s to check out.

…d use it for source containers too (in test we want to hit the MySQL Docker container from the host).

…it them; ignore log lines that aren't Airbyte messages.

…for the destination, we want to override the namespace since otherwise PG will write out to the wrong schema).

…the destination.

…loader as a test).

… from it into the codebase, since that's all we need. Update the Poetry lockfile.

…airbyte-cdk`).

…(not needed)

…ues with Docker log batch size (16364).

…ar imports (since plugins are loaded in the commandline module).

…hout various image creation routines instead of ad hoc numbers.

…ta sources (through `airbyte_cursor_field` and `airbyte_primary_key` table params). Also, report the plugin's default cursor/PK back to the user at introspection time (as suggested default table params).

… (table params can be anything that dumps to JSON).

… data source through a function rather than at instantiation time.

…oad mode too)

* API functionality to get the raw URL for a data source (#457) * LQ scan / filtering simplification to speed up writes / Singer loads (#464, #489) * API functionality for Airbyte support (`AirbyteDataSource` class, #493) * Speed up `sgr cloud load` by bulk API calls (#500) Full set of changes: [`v0.2.14...v0.2.15`](v0.2.14...v0.2.15)

mildbyte added 22 commits July 12, 2021 16:24

Move some common Singer ingestion routines into separate functions th…

e210ea7

…at can be referenced by non-Singer data sources.

Allow copy_to_container to take in the actual data rather than just…

60227bc

… the source path.

Fix circular import: move types into _utils and rename _utils int…

4e97d12

…o `common`

Allow empty data.

03f52b8

Make Image._lq_checkout public, allow it to take in a list of table…

1c6c8da

…s to check out.

Initial implementation of an Airbyte <> Splitgraph data source shim

e16fdc2

Add Airbyte to the requirements as an extra and install it in tests.

5be1cc2

Break Docker network mode / host detection into a separate routine an…

a6039c6

…d use it for source containers too (in test we want to hit the MySQL Docker container from the host).

Airbyte log parsing fixes: decode the logs in case of an error and em…

7f80065

…it them; ignore log lines that aren't Airbyte messages.

Add initial test suite for Airbyte (introspection/catalog manip)

7a7725c

Use two different configurations for the source and the destination (…

514aa84

…for the destination, we want to override the namespace since otherwise PG will write out to the wrong schema).

Always stub out the namespace in the messages between the source and …

e82f596

…the destination.

Use a custom image comment for Airbyte-generated images.

7604b6c

Misc fixes to incremental loads and errors.

dfe0128

Add more unit and end-to-end tests for Airbyte loads (uses the MySQL …

15598a2

…loader as a test).

Add missing __init__.py to Singer

a15d7af

Delete airbyte-cdk from the deps and instead copy the Pydantic models…

439987c

… from it into the codebase, since that's all we need. Update the Poetry lockfile.

Fix imports in tests and remove the pytest import guard (don't need `…

d9c4dd8

…airbyte-cdk`).

Add chardet as a dependency (went away after requests was upgraded?)

43bf9f9

[CU-15xp9xt] Delete the airbyte extra from the installation script …

2511981

…(not needed)

Use the Airbyte message reader iterator in introspection to avoid iss…

578c2f9

…ues with Docker log batch size (16364).

Move Docker utilities out of splitgraph.commandline to avoid circul…

5302bef

…ar imports (since plugins are loaded in the commandline module).

mildbyte force-pushed the feature/airbyte-data-source-CU-15xp9xt branch from 7be2128 to 5302bef Compare July 20, 2021 11:58

mildbyte added 5 commits July 20, 2021 15:23

Merge branch 'master' into feature/airbyte-data-source-CU-15xp9xt

9fefa06

Factor the DEFAULT_CHUNK_SIZE out into a variable and use it throug…

2ef8567

…hout various image creation routines instead of ad hoc numbers.

Allow overriding the Docker environment in Airbyte containers.

62d4e71

Add support for overriding the cursor field / PK in Airbyte-backed da…

0b091f9

…ta sources (through `airbyte_cursor_field` and `airbyte_primary_key` table params). Also, report the plugin's default cursor/PK back to the user at introspection time (as suggested default table params).

Fix table param override flow in Singer too.

6cd734a

mildbyte force-pushed the feature/airbyte-data-source-CU-15xp9xt branch from 6696b32 to 6cd734a Compare July 21, 2021 11:40

Add the PK/cursor to the Airbyte stream even when in load mode.

ed4d422

mildbyte added 3 commits July 21, 2021 12:57

Change table_schema_params_to_dict to support non-string parameters…

55bdb38

… (table params can be anything that dumps to JSON).

Validate table params against the JSONSchema if they're passed into a…

cb1b295

… data source through a function rather than at instantiation time.

Fix a test (we set the cursor field from the connector's default in l…

8e5c387

…oad mode too)

mildbyte merged commit 1affeba into master Jul 21, 2021

mildbyte deleted the feature/airbyte-data-source-CU-15xp9xt branch July 21, 2021 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Airbyte data source [CU-15xp9xt] #493

Airbyte data source [CU-15xp9xt] #493

mildbyte commented Jul 19, 2021

Airbyte data source [CU-15xp9xt] #493

Airbyte data source [CU-15xp9xt] #493

Conversation

mildbyte commented Jul 19, 2021

Intro

Example for MySQL:

Implementation:

Limitations

References: