-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: delta plugin support write #284
Changes from all commits
51b6d0d
82e24fd
b3eb619
8257751
43fce24
09aa73f
952bd6f
1b559d8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
Something what i don't understand? | ||
|
||
- For example we want to push something with a sql plugin to a database. Why we first write it down to disk and then read in memory with pandas and then push it further? | ||
|
||
|
||
|
||
Future: | ||
|
||
- If we make an external materialization we want that upstream models can refer to it? How to do it we have to register an df and create an view to it? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a bit tricky because the underlying delta table can't be just created but it has to be registered first as df. We do that currently when the delta table is defined in the source but we have to invoke this process here too. Let me try some stuff here if not we can make it a limitation and provide as the last layer that can't be referenced There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As of my understanding, if you use SQL/excel write plugin, you cant reference it afterward? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, you can reference any There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just noting that I would support a custom materialization type for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think that we can also put it in external but i have to refactor it a bit, but i am still not so far away, currently i just play around and try to make it work anyway |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -1,10 +1,12 @@ | ||||||
import os | ||||||
import pyarrow.compute as pc | ||||||
from typing import Any | ||||||
from typing import Dict | ||||||
|
||||||
from deltalake import DeltaTable | ||||||
from deltalake import DeltaTable, write_deltalake | ||||||
|
||||||
from . import BasePlugin | ||||||
from ..utils import SourceConfig | ||||||
from ..utils import SourceConfig, TargetConfig | ||||||
|
||||||
|
||||||
class Plugin(BasePlugin): | ||||||
|
@@ -16,15 +18,14 @@ def configure_cursor(self, cursor): | |||||
|
||||||
def load(self, source_config: SourceConfig): | ||||||
if "delta_table_path" not in source_config: | ||||||
raise Exception("'delta_table_path' is a required argument for the delta table!") | ||||||
raise Exception( | ||||||
"'delta_table_path' is a required argument for the delta table!" | ||||||
) | ||||||
|
||||||
table_path = source_config["delta_table_path"] | ||||||
storage_options = source_config.get("storage_options", None) | ||||||
storage_options = source_config.get("storage_options", {}) | ||||||
|
||||||
if storage_options: | ||||||
dt = DeltaTable(table_path, storage_options=storage_options) | ||||||
else: | ||||||
dt = DeltaTable(table_path) | ||||||
dt = DeltaTable(table_path, storage_options=storage_options) | ||||||
|
||||||
# delta attributes | ||||||
as_of_version = source_config.get("as_of_version", None) | ||||||
|
@@ -43,6 +44,110 @@ def load(self, source_config: SourceConfig): | |||||
def default_materialization(self): | ||||||
return "view" | ||||||
|
||||||
def store(self, target_config: TargetConfig, df=None): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this part will be updated as integration with current There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
also we can instantiate
just leaving a comment to want to be of little help.. I've never contributed to open source, is there any way I can apply if I want to contribute? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @ZergRocks, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The pyarrow.fs won't be implemented, it was a placeholder for a possible implementation but we are not continuing with that anymore. |
||||||
mode = target_config.config.get("mode", "overwrite") | ||||||
table_path = target_config.location.path | ||||||
storage_options = target_config.config.get("storage_options", {}) | ||||||
|
||||||
if mode == "overwrite_partition": | ||||||
partition_key = target_config.config.get("partition_key", None) | ||||||
if not partition_key: | ||||||
raise Exception( | ||||||
"'partition_key' has to be defined when mode 'overwrite_partition'!" | ||||||
) | ||||||
|
||||||
if isinstance(partition_key, str): | ||||||
partition_key = [partition_key] | ||||||
|
||||||
partition_dict = [] | ||||||
for each_key in partition_key: | ||||||
unique_key_array = pc.unique(df[each_key]) | ||||||
|
||||||
if len(unique_key_array) == 1: | ||||||
partition_dict.append((each_key, str(unique_key_array[0]))) | ||||||
else: | ||||||
raise Exception( | ||||||
f"'{each_key}' column has not one unique value, values are: {str(unique_key_array)}" | ||||||
) | ||||||
create_insert_partition(table_path, df, partition_dict, storage_options) | ||||||
elif mode == "merge": | ||||||
# very slow -> https://github.com/delta-io/delta-rs/issues/1846 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not the case anymore :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @ion-elgreco, thank you very much for all the comments; I am happy that there is somebody from the delta-rs side with some insights. I very much appreciate your help. |
||||||
unique_key = target_config.config.get("unique_key", None) | ||||||
if not unique_key: | ||||||
raise Exception("'unique_key' has to be defined when mode 'merge'!") | ||||||
if isinstance(unique_key, str): | ||||||
unique_key = [unique_key] | ||||||
|
||||||
predicate_stm = " and ".join( | ||||||
[ | ||||||
f'source."{each_unique_key}" = target."{each_unique_key}"' | ||||||
for each_unique_key in unique_key | ||||||
] | ||||||
) | ||||||
|
||||||
try: | ||||||
target_dt = DeltaTable(table_path, storage_options=storage_options) | ||||||
except Exception: | ||||||
# TODO handle this better | ||||||
write_deltalake( | ||||||
table_or_uri=table_path, data=df, storage_options=storage_options | ||||||
) | ||||||
|
||||||
target_dt = DeltaTable(table_path, storage_options=storage_options) | ||||||
# TODO there is a problem if the column name is uppercase | ||||||
target_dt.merge( | ||||||
source=df, | ||||||
predicate=predicate_stm, | ||||||
source_alias="source", | ||||||
target_alias="target", | ||||||
).when_not_matched_insert_all().execute() | ||||||
else: | ||||||
write_deltalake( | ||||||
table_or_uri=table_path, | ||||||
data=df, | ||||||
mode=mode, | ||||||
storage_options=storage_options, | ||||||
) | ||||||
|
||||||
|
||||||
def table_exists(table_path, storage_options): | ||||||
# this is bad, i have to find the way to see if there is table behind path | ||||||
try: | ||||||
DeltaTable(table_path, storage_options=storage_options) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To do this operation a bit cheaper, set also |
||||||
except Exception: | ||||||
return False | ||||||
return True | ||||||
|
||||||
|
||||||
## TODO | ||||||
# add partition writing | ||||||
# add optimization, vacumm options to automatically run before each run ? | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The plan is to make these things configurable on the delta table itself which then should handle at which interval to vacuum or optimize, similar to spark-delta. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am courious about that one is there some docu? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not yet :) Was just sharing what I was planning to add |
||||||
# can deltars optimize if the data is bigger then memory? | ||||||
|
||||||
|
||||||
def create_insert_partition(table_path, data, partitions, storage_options): | ||||||
"""create a new delta table on the path or overwrite existing partition""" | ||||||
|
||||||
if table_exists(table_path, storage_options): | ||||||
partition_expr = [ | ||||||
(partition_name, "=", partition_value) | ||||||
for (partition_name, partition_value) in partitions | ||||||
] | ||||||
print( | ||||||
f"Overwriting delta table under: {table_path} \nwith partition expr: {partition_expr}" | ||||||
) | ||||||
write_deltalake( | ||||||
table_path, data, partition_filters=partition_expr, mode="overwrite" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Partition_filters are going to be a thing of the past. We are moving to the rust engine slowly and soon there will be a predicate overwrite (in the next python 0.15.2) that is more flexible than the partition_filter overwrite that used by the pyarrow writer. |
||||||
) | ||||||
else: | ||||||
partitions = [ | ||||||
partition_name for (partition_name, partition_value) in partitions | ||||||
] | ||||||
print( | ||||||
f"Creating delta table under: {table_path} \nwith partitions: {partitions}" | ||||||
) | ||||||
write_deltalake(table_path, data, partition_by=partitions) | ||||||
|
||||||
|
||||||
# Future | ||||||
# TODO add databricks catalog |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
{% materialization external_table, adapter="duckdb", supported_languages=['sql', 'python'] %} | ||
{{ log("External macro") }} | ||
|
||
{%- set target_relation = this.incorporate(type='view') %} | ||
|
||
{%- set plugin_name = config.get('plugin') -%} | ||
{%- set location = render(config.get('location', default=external_location(this, config))) -%}) | ||
{%- set format = config.get('format', 'parquet') -%} | ||
|
||
{% do store_relation(plugin_name, target_relation, location, format, config) %} | ||
|
||
{% call statement('main', language='sql') -%} | ||
|
||
{%- endcall %} | ||
|
||
-- we have to load this table as df and create target_relation view | ||
|
||
{{ return({'relations': [target_relation]}) }} | ||
{% endmaterialization %} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
import pytest | ||
from pathlib import Path | ||
import pandas as pd | ||
import tempfile | ||
|
||
from dbt.tests.util import ( | ||
check_relations_equal, | ||
run_dbt, | ||
) | ||
from deltalake.writer import write_deltalake | ||
|
||
delta_schema_yml = """ | ||
|
||
""" | ||
|
||
ref1 = """ | ||
select 2 as a, 'test' as b | ||
""" | ||
|
||
delta1_sql = """ | ||
{{ config( | ||
materialized='external_table', | ||
plugin = 'delta', | ||
location = '/workspaces/dbt-duckdb/delta_test', | ||
storage_options = { | ||
'test' : 'test' | ||
} | ||
|
||
) }} | ||
select * from {{ref('ref1')}} | ||
""" | ||
|
||
delta2_sql = """ | ||
{{ config( | ||
materialized='external_table', | ||
plugin = 'delta', | ||
location = '/workspaces/dbt-duckdb/delta2_test', | ||
mode = 'merge', | ||
unique_key = 'a' | ||
|
||
) }} | ||
select * from {{ref('ref1')}} | ||
""" | ||
|
||
|
||
@pytest.mark.skip_profile("buenavista", "md") | ||
class TestPlugins: | ||
@pytest.fixture(scope="class") | ||
def delta_test_table1(self): | ||
td = tempfile.TemporaryDirectory() | ||
path = Path(td.name) | ||
table_path = path / "test_delta_table1" | ||
|
||
yield table_path | ||
|
||
td.cleanup() | ||
|
||
@pytest.fixture(scope="class") | ||
def profiles_config_update(self, dbt_profile_target): | ||
plugins = [{"module": "delta"}] | ||
return { | ||
"test": { | ||
"outputs": { | ||
"dev": { | ||
"type": "duckdb", | ||
"path": dbt_profile_target.get("path", ":memory:"), | ||
"plugins": plugins, | ||
} | ||
}, | ||
"target": "dev", | ||
} | ||
} | ||
|
||
@pytest.fixture(scope="class") | ||
def models(self, delta_test_table1): | ||
return { | ||
|
||
"delta_table2.sql": delta2_sql, | ||
"ref1.sql": ref1 | ||
} | ||
|
||
def test_plugins(self, project): | ||
results = run_dbt() | ||
#assert len(results) == 4 | ||
|
||
# check_relations_equal( | ||
# project.adapter, | ||
# [ | ||
# "delta_table3", | ||
# "delta_table3_expected", | ||
# ], | ||
# ) | ||
# res = project.run_sql("SELECT count(1) FROM 'delta_table3'", fetch="one") | ||
# assert res[0] == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Primarily an artifact of the historical architecture here-- we built the plugins on top of the existing
external
materialization type. There isn't a reason we could not also support plugins on non-external materializations (tables/views/etc.), it's just work that hasn't come up yet.