AirbyteLib: Support write strategies: 'merge' and 'auto' #34592

aaronsteers · 2024-01-29T03:34:04Z

This adds write strategies 'merge' and 'auto'. This also adds primary key support for streams.

There's also a placeholder for force_full_refresh. Currently, all operations all full refresh, but it was necessary to add this option in order to make replace a viable write strategy. The replace write strategy will fail unless force_full_refresh is also set, because we don't want to replace previous sync resurlts with a new incremental (aka partial) one.

When we add a state/incremental support, the force_full_refresh option would give an option to ignore (not pass) any state which was previously tracked.

This PR also adds primary key definitions to source-faker. While source-faker does have id columns, it didn't previously declare these as primary keys. The needed changes to source-faker have now been migrated to this PR:

Source Faker: Declare primary keys #34644

Unfortunately, this PR can't merge until the above merges. Or otherwise, we'll need to create a new test source with primary keys.

Implementation Details

There are two code paths for merge upsert behavior. As of this PR, we now support a generic 2-step merge upsert which is fully generic and should work for all SQL DB platforms. Essentially, this runs an UPDATE and then an INSERT operation. And because SQL dialects differ, this is written in SQLAlchemy abstractions, which will be translated into the dialect of the respective database.

The second code path is the one-step native merge upsert. The base class does have an implementation of this, based on Postgres dialect. However, if caches do not declare that supports_merge_insert = True, then they will automatically fall back to the emulated (2-step) operation. To optimize loads, they can set supports_merge_insert = True and use the base implementation, if compatible, or they can override the merge upsert method to match their native dialect.

vercel · 2024-01-29T03:34:10Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Feb 5, 2024 6:57pm

aaronsteers · 2024-01-30T05:25:41Z

airbyte-lib/tests/integration_tests/test_source_test_fixture.py

-    os.environ["AIRBYTE_LOCAL_REGISTRY"] = "./tests/integration_tests/fixtures/registry.json"
-    os.environ["DO_NOT_TRACK"] = "true"
-


Moved to a function-scoped fixture and locally-scoped patch method to avoid bleeding 'AIRBYTE_LOCAL_REGISTRY' to other files.

…r-integration-tests

flash1293

Still reviewing, but left some comments so far. It would be great to have a test for snowflake and postgres for merge to make sure it works as expected.

Theoretically it should be covered by sqlalchemy, but that seems important enough to have an e2e test for

airbyte-lib/airbyte_lib/_processors.py

airbyte-lib/pyproject.toml

flash1293 · 2024-01-30T11:10:31Z

airbyte-lib/pyproject.toml

@@ -45,7 +45,7 @@ ruff = "^0.1.11"
 types-jsonschema = "^4.20.0.0"
 google-cloud-secret-manager = "^2.17.0"
 types-requests = "2.31.0.4"
-freezegun = "^1.4.0"
+source-faker = {path = "../airbyte-integrations/connectors/source-faker", develop = true}


Why do we need source-faker installed like this? Shouldn't airbyte-lib handle it via separate venv? It should also pull the local version

Before deciding to include, I checked the dependencies of source-faker. It only depends on memesis and the CDK. And memesis has no hard dependencies.

Given the light footprint, and that this is only used as a test dependency, the benefits are:

Stateless execution. There's no time at which the connector is not installed.

Faster test invocation. Tests can start instantly without having to install of verify the source's installation fixture is ready.

Simplicity. Following from the statelessness of not needing the fixtures to install it, the overall test framework is simpler. We can expect that source-faker is on PATH and if not, its an error.

Tests BYO model for connector executables.

I'm glad we have the source-faker example which installs itself and valides that part of the process. Given that we already have those checks, adding as a dev dependency just focuses on the value we want to get from this connector (running it) without additional complexity.

Sidebar: I see that source-faker is now published to pypi 🙌, so we can pin to that specific published pypi version instead of using the relative reference. This will ensure our tests are stable even if the faker source changes.

flash1293

Tested around a bit and looks mostly good, there is one important comment about primary keys being set on the final table. Maybe we should add a test validating that as well? Seems worth it.

flash1293 · 2024-01-31T11:39:54Z

airbyte-lib/airbyte_lib/caches/base.py

@@ -415,10 +441,15 @@ def _create_table(
        self,
        table_name: str,
        column_definition_str: str,
+        primary_keys: list[str] | None = None,


Primary keys are not set on the final table.

I think it's because this argument is never passed in, I think it should be on line 404 where the final table is created (self._create_table(table_name, column_definition_str, self._get_primary_keys(stream_name)))

not sure whether there is an advantage of setting the primary key for the loading tables (I would guess not, maybe the join that's required for the merge would be slightly faster but OTOH the temporary tables are pretty small so that shouldn't change a thing).

flash1293 · 2024-01-31T11:42:25Z

airbyte-lib/tests/integration_tests/test_source_faker_integration.py

+    assert catalog.streams[1].primary_key
+
+    read_result = source_faker_seed_a.read(duckdb_cache, write_strategy="append")
+    assert read_result.cache._get_primary_keys("products") == ["id"]


would be nice to have. a test for the composite primary keys as that logic is never called at the moment

flash1293 · 2024-01-31T11:46:57Z

airbyte-lib/tests/integration_tests/test_source_faker_integration.py

+        assert len(list(result.cache.streams["purchases"])) == FAKER_SCALE_A
+
+        # Third run, seed B - should increase record count to the scale of B, which is greater than A.
+        # TODO: See if we can reliably predict the exact number of records, since we use fixed seeds.


from what I can tell we can rely on this (especially now with the pinned version of faker)

…r-integration-tests

Co-authored-by: Joe Reuter <joe@airbyte.io>

…4592) Co-authored-by: Joe Reuter <joe@airbyte.io>

Co-authored-by: Joe Reuter <joe@airbyte.io>

add failing test and source-faker dev dependency

Loading
Loading status checks…

8f16d9a

aaronsteers added 2 commits January 28, 2024 22:32

add test for auto strategy

ebf61ec

misc bug fixes and code improvements

031282b

octavia-squidington-iii added area/connectors connectors/source/faker labels Jan 30, 2024

aaronsteers added 6 commits January 29, 2024 19:06

add generic emulated merge insert

abd50d9

clean up typing for as_temp_files

f032fb7

update docs

e0040eb

fix tests and linter

cda8a46

isolate env var changes to same test file, rename file

97707fe

fix test fixture env var bleeding to other test files

6736c16

aaronsteers changed the title ~~[Do not merge]: Airbyte Lib: source-faker merge and append strategies~~ AirbyteLib: Support write strategies: 'merge' and 'auto' Jan 30, 2024

aaronsteers requested review from flash1293 and evantahler and removed request for evantahler January 30, 2024 05:11

aaronsteers mentioned this pull request Jan 30, 2024

Source Faker: Declare primary keys #34644

Merged

pass write strategy consistently

3699e34

aaronsteers commented Jan 30, 2024

View reviewed changes

aaronsteers added 2 commits January 29, 2024 21:26

remove incorrect comment

7f32ceb

Merge remote-tracking branch 'origin/master' into aj/airbyte-lib/fake…

590dd50

…r-integration-tests

vercel bot deployed to Preview January 30, 2024 07:05 View deployment

poetry lock

d1860f8

flash1293 reviewed Jan 30, 2024

View reviewed changes

aaronsteers added 2 commits January 30, 2024 18:03

apply suggestion

d93023c

Merge branch 'master' into aj/airbyte-lib/faker-integration-tests

3e3ff9d

octavia-squidington-iii removed the area/connectors label Jan 31, 2024

vercel bot deployed to Preview January 31, 2024 02:05 View deployment

aaronsteers added 2 commits January 30, 2024 18:21

restore freezegun dependency

d3cca4e

lint and mypy fixes

b063027

aaronsteers added 13 commits January 30, 2024 20:36

fewer iterations

2f22e81

don't leak stderr to users

0507a8e

unique namespace for snowflake

10de53c

move shared fixtures to conftest

aaa10ad

cache _get_table_by_name()

9cde94a

remove stdout redirect

a87256b

improve tests

4ebe750

format

6cdafc2

declare local executable

ba08a9d

patch path to include venv bin

6779b54

rework fixtures

15f5717

disable slow snowflake test for now

2703d48

fix comment

f7f4786

flash1293 reviewed Jan 31, 2024

View reviewed changes

aaronsteers and others added 2 commits February 1, 2024 21:34

auto-apply test fixture registry

5b25a59

Merge remote-tracking branch 'origin/master' into aj/airbyte-lib/fake…

ea56ba4

…r-integration-tests

vercel bot deployed to Preview February 2, 2024 17:28 View deployment

Merge remote-tracking branch 'origin/master' into aj/airbyte-lib/fake…

fe90d5d

…r-integration-tests

vercel bot deployed to Preview February 5, 2024 08:52 View deployment

Merge branch 'master' into aj/airbyte-lib/faker-integration-tests

80b0708

vercel bot deployed to Preview February 5, 2024 18:57 View deployment

aaronsteers merged commit 22b63c7 into master Feb 6, 2024
19 checks passed

aaronsteers deleted the aj/airbyte-lib/faker-integration-tests branch February 6, 2024 08:25

flash1293 mentioned this pull request Feb 8, 2024

airbyte-lib: Clean up test schema in Snowflake #35015

Merged

xiaohansong pushed a commit that referenced this pull request Feb 13, 2024

AirbyteLib: Support write strategies: 'merge' and 'auto' (#34592)

b5d4d02

Co-authored-by: Joe Reuter <joe@airbyte.io>

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 21, 2024

AirbyteLib: Support write strategies: 'merge' and 'auto' (airbytehq#3…

3efdae8

…4592) Co-authored-by: Joe Reuter <joe@airbyte.io>

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

AirbyteLib: Support write strategies: 'merge' and 'auto' (airbytehq#3…

4a760cc

…4592) Co-authored-by: Joe Reuter <joe@airbyte.io>

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

AirbyteLib: Support write strategies: 'merge' and 'auto' (airbytehq#3…

e44094f

…4592) Co-authored-by: Joe Reuter <joe@airbyte.io>

xiaohansong pushed a commit that referenced this pull request Feb 27, 2024

AirbyteLib: Support write strategies: 'merge' and 'auto' (#34592)

a9ea93f

Co-authored-by: Joe Reuter <joe@airbyte.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AirbyteLib: Support write strategies: 'merge' and 'auto' #34592

AirbyteLib: Support write strategies: 'merge' and 'auto' #34592

aaronsteers commented Jan 29, 2024 •

edited

Loading

vercel bot commented Jan 29, 2024 •

edited

Loading

aaronsteers Jan 30, 2024

flash1293 left a comment

flash1293 Jan 30, 2024

aaronsteers Jan 31, 2024

flash1293 left a comment

flash1293 Jan 31, 2024

flash1293 Jan 31, 2024

flash1293 Jan 31, 2024

		os.environ["AIRBYTE_LOCAL_REGISTRY"] = "./tests/integration_tests/fixtures/registry.json"
		os.environ["DO_NOT_TRACK"] = "true"

AirbyteLib: Support write strategies: 'merge' and 'auto' #34592

AirbyteLib: Support write strategies: 'merge' and 'auto' #34592

Conversation

aaronsteers commented Jan 29, 2024 • edited Loading

Implementation Details

vercel bot commented Jan 29, 2024 • edited Loading

aaronsteers Jan 30, 2024

Choose a reason for hiding this comment

flash1293 left a comment

Choose a reason for hiding this comment

flash1293 Jan 30, 2024

Choose a reason for hiding this comment

aaronsteers Jan 31, 2024

Choose a reason for hiding this comment

flash1293 left a comment

Choose a reason for hiding this comment

flash1293 Jan 31, 2024

Choose a reason for hiding this comment

flash1293 Jan 31, 2024

Choose a reason for hiding this comment

flash1293 Jan 31, 2024

Choose a reason for hiding this comment

aaronsteers commented Jan 29, 2024 •

edited

Loading

vercel bot commented Jan 29, 2024 •

edited

Loading