Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AirbyteLib: Support write strategies: 'merge' and 'auto' #34592

Merged
merged 44 commits into from
Feb 6, 2024

Conversation

aaronsteers
Copy link
Collaborator

@aaronsteers aaronsteers commented Jan 29, 2024

This adds write strategies 'merge' and 'auto'. This also adds primary key support for streams.

There's also a placeholder for force_full_refresh. Currently, all operations all full refresh, but it was necessary to add this option in order to make replace a viable write strategy. The replace write strategy will fail unless force_full_refresh is also set, because we don't want to replace previous sync resurlts with a new incremental (aka partial) one.

When we add a state/incremental support, the force_full_refresh option would give an option to ignore (not pass) any state which was previously tracked.

This PR also adds primary key definitions to source-faker. While source-faker does have id columns, it didn't previously declare these as primary keys. The needed changes to source-faker have now been migrated to this PR:

Unfortunately, this PR can't merge until the above merges. Or otherwise, we'll need to create a new test source with primary keys.

Implementation Details

There are two code paths for merge upsert behavior. As of this PR, we now support a generic 2-step merge upsert which is fully generic and should work for all SQL DB platforms. Essentially, this runs an UPDATE and then an INSERT operation. And because SQL dialects differ, this is written in SQLAlchemy abstractions, which will be translated into the dialect of the respective database.

The second code path is the one-step native merge upsert. The base class does have an implementation of this, based on Postgres dialect. However, if caches do not declare that supports_merge_insert = True, then they will automatically fall back to the emulated (2-step) operation. To optimize loads, they can set supports_merge_insert = True and use the base implementation, if compatible, or they can override the merge upsert method to match their native dialect.

Copy link

vercel bot commented Jan 29, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 5, 2024 6:57pm

@aaronsteers aaronsteers changed the title [Do not merge]: Airbyte Lib: source-faker merge and append strategies AirbyteLib: Support write strategies: 'merge' and 'auto' Jan 30, 2024
@aaronsteers aaronsteers requested review from flash1293 and evantahler and removed request for evantahler January 30, 2024 05:11
Comment on lines 40 to 49
os.environ["AIRBYTE_LOCAL_REGISTRY"] = "./tests/integration_tests/fixtures/registry.json"
os.environ["DO_NOT_TRACK"] = "true"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to a function-scoped fixture and locally-scoped patch method to avoid bleeding 'AIRBYTE_LOCAL_REGISTRY' to other files.

Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing, but left some comments so far. It would be great to have a test for snowflake and postgres for merge to make sure it works as expected.

Theoretically it should be covered by sqlalchemy, but that seems important enough to have an e2e test for

airbyte-lib/airbyte_lib/_processors.py Outdated Show resolved Hide resolved
airbyte-lib/airbyte_lib/_processors.py Show resolved Hide resolved
airbyte-lib/pyproject.toml Show resolved Hide resolved
@@ -45,7 +45,7 @@ ruff = "^0.1.11"
types-jsonschema = "^4.20.0.0"
google-cloud-secret-manager = "^2.17.0"
types-requests = "2.31.0.4"
freezegun = "^1.4.0"
source-faker = {path = "../airbyte-integrations/connectors/source-faker", develop = true}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need source-faker installed like this? Shouldn't airbyte-lib handle it via separate venv? It should also pull the local version

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before deciding to include, I checked the dependencies of source-faker. It only depends on memesis and the CDK. And memesis has no hard dependencies.

Given the light footprint, and that this is only used as a test dependency, the benefits are:

  1. Stateless execution. There's no time at which the connector is not installed.
  2. Faster test invocation. Tests can start instantly without having to install of verify the source's installation fixture is ready.
  3. Simplicity. Following from the statelessness of not needing the fixtures to install it, the overall test framework is simpler. We can expect that source-faker is on PATH and if not, its an error.
  4. Tests BYO model for connector executables.

I'm glad we have the source-faker example which installs itself and valides that part of the process. Given that we already have those checks, adding as a dev dependency just focuses on the value we want to get from this connector (running it) without additional complexity.

Sidebar: I see that source-faker is now published to pypi 🙌, so we can pin to that specific published pypi version instead of using the relative reference. This will ensure our tests are stable even if the faker source changes.

Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested around a bit and looks mostly good, there is one important comment about primary keys being set on the final table. Maybe we should add a test validating that as well? Seems worth it.

@@ -415,10 +441,15 @@ def _create_table(
self,
table_name: str,
column_definition_str: str,
primary_keys: list[str] | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primary keys are not set on the final table.

I think it's because this argument is never passed in, I think it should be on line 404 where the final table is created (self._create_table(table_name, column_definition_str, self._get_primary_keys(stream_name)))

not sure whether there is an advantage of setting the primary key for the loading tables (I would guess not, maybe the join that's required for the merge would be slightly faster but OTOH the temporary tables are pretty small so that shouldn't change a thing).

assert catalog.streams[1].primary_key

read_result = source_faker_seed_a.read(duckdb_cache, write_strategy="append")
assert read_result.cache._get_primary_keys("products") == ["id"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have. a test for the composite primary keys as that logic is never called at the moment

assert len(list(result.cache.streams["purchases"])) == FAKER_SCALE_A

# Third run, seed B - should increase record count to the scale of B, which is greater than A.
# TODO: See if we can reliably predict the exact number of records, since we use fixed seeds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I can tell we can rely on this (especially now with the pinned version of faker)

@aaronsteers aaronsteers merged commit 22b63c7 into master Feb 6, 2024
19 checks passed
@aaronsteers aaronsteers deleted the aj/airbyte-lib/faker-integration-tests branch February 6, 2024 08:25
xiaohansong pushed a commit that referenced this pull request Feb 13, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 21, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
xiaohansong pushed a commit that referenced this pull request Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants