-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose staging tables truncation to config #1717
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good already!
da24fa4
to
82b1a06
Compare
63927de
to
14b74ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a proper test (look at test_stage_loading.py):
- make sure that staging destination is not truncated (flag at default) and truncated (flag to false). at the same time make sure that destination tables are NOT truncated
- exclude Athena if not force iceberg (or make sure that staging destination was not truncated)
- use some kind of simple dataset (not github - takes a lot of time)
@VioletM feel free to push this task to @rudolfix or @sh-rp or someone else...
14b74ec
to
46e4117
Compare
Added a test which runs on
I think we should left default behavior as is -- truncated (flag to True by default). Otherwise, with the update users start to see increase in the staging usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very good! thanks for the tests and docs update. your staging test is not marked as essential (OK!) so it won't run on a few destinations. nevertheless we can merge it and fix any remaining tests just before release
(I expect that athena non iceberg will fail here)
Hm, but this test uses |
@@ -96,4 +100,21 @@ In essence, you need to set up two destinations and then pass them to `dlt.pipel | |||
|
|||
Run the pipeline script as usual. | |||
|
|||
> 💡 Please note that `dlt` does not delete loaded files from the staging storage after the load is complete. | |||
:::tip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd run this whole new section through chatgpt to grammar correct (or run the grammar checker on this page)
# check there are two staging files | ||
_, staging_client = pipeline._get_destination_clients(pipeline.default_schema) | ||
with staging_client: | ||
assert len(staging_client.list_table_files(table_name)) == 2 # type: ignore[attr-defined] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for later: we should probably allow fs_client on the pipeline to also return the staging filesystem client with a flag.
@@ -503,6 +503,9 @@ def _should_autodetect_schema(self, table_name: str) -> bool: | |||
self.schema._schema_tables, table_name, AUTODETECT_SCHEMA_HINT, allow_none=True | |||
) or (self.config.autodetect_schema and table_name not in self.schema.dlt_table_names()) | |||
|
|||
def should_truncate_table_before_load_on_staging_destination(self, table: TTableSchema) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imho having the same exact line in all implementations is not DRY. We can keep it like this for now, but I would rather implement this in the superclass, probably with hasattr and isinstance to get the config and verify the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you see the previous version? at least here this is simple and mypy forces you to do overrides. previously you had to setup cooperative calling of super() and keep state in the class that should be an interface/trait. also if you forgot about that, the config would have no effect and you'll never know about it
"destination_config", destinations_configs(all_staging_configs=True), ids=lambda x: x.name | ||
) | ||
def test_truncate_staging_dataset(destination_config: DestinationTestConfiguration) -> None: | ||
"""This test checks if tables truncation on staging destination done according to the configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test is good, but to make it great we should also test wether keeping the staging files around will make the data be loaded again although it shouldn't. but for now i'd say it's good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, should be easy to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please run the grammar checker on the docs and then you can merge.
* Expose staging tables truncation to config * Fix comments, add tests * Fix tests * Move implementation from mixing, add tests * Fix docs grammar
* Expose staging tables truncation to config * Fix comments, add tests * Fix tests * Move implementation from mixing, add tests * Fix docs grammar
No description provided.