-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: delta plugin support write #284
feat: delta plugin support write #284
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it's taken me all week to get to this, been feeling under the weather
@@ -0,0 +1,9 @@ | |||
Something what i don't understand? | |||
|
|||
- For example we want to push something with a sql plugin to a database. Why we first write it down to disk and then read in memory with pandas and then push it further? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Primarily an artifact of the historical architecture here-- we built the plugins on top of the existing external
materialization type. There isn't a reason we could not also support plugins on non-external materializations (tables/views/etc.), it's just work that hasn't come up yet.
|
||
Future: | ||
|
||
- If we make an external materialization we want that upstream models can refer to it? How to do it we have to register an df and create an view to it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current external
materialization does this already; dbt-duckdb creates a VIEW
in the DuckDB database that points at the externally materialized file so that it can be used/queried like normal by any other models that we run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky because the underlying delta table can't be just created but it has to be registered first as df. We do that currently when the delta table is defined in the source but we have to invoke this process here too. Let me try some stuff here if not we can make it a limitation and provide as the last layer that can't be referenced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of my understanding, if you use SQL/excel write plugin, you cant reference it afterward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can reference any external
materialization after it gets created in other, downstream models (sorry for the lag I've been traveling in Asia and will be here for the next few weeks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that I would support a custom materialization type for delta
tables if that is what we really need to make the write-side work well here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that we can also put it in external but i have to refactor it a bit, but i am still not so far away, currently i just play around and try to make it work anyway
@@ -43,6 +44,110 @@ def load(self, source_config: SourceConfig): | |||
def default_materialization(self): | |||
return "view" | |||
|
|||
def store(self, target_config: TargetConfig, df=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part will be updated as integration with current profiles.yml
's filesystem
option after this PR finished delta-io/delta-rs#570 right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delta-rs
's filesystem seems to will use pyarrow
's fs
and that can be incorporated into this method with if-else or match clause
def initialize_plugins(cls, creds: DuckDBCredentials) -> Dict[str, BasePlugin]: |
also we can instantiate pyarrow.fs
in current existing pattern. but I'm not familar with this repository's convention
if creds.filesystems: |
just leaving a comment to want to be of little help..
I've never contributed to open source, is there any way I can apply if I want to contribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ZergRocks,
I am not sure if i follow here. As i understand, you ask if we support custom fs which can be used in the delta-rs package
https://delta-io.github.io/delta-rs/usage/loading-table/#custom-storage-backends?
So it is currently not implemented but maybe it can be at least for begining in the read part which is done by my side, so if you want to do it i would be happy to support you. You can write me in the slack dbt chanell and we can comment on that together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pyarrow.fs won't be implemented, it was a placeholder for a possible implementation but we are not continuing with that anymore.
hi all, we are really into that feature. Is there a rough estimation if and when it could be merged to main branch? Thanks |
Hi @nfoerster, I also, from time to time, rebase on the main branch, and if you want, you can use this branch to try stuff out and provide feedback/use cases which are missing, but be aware that there are possibly other things broken.
I hope I will write in next few days a small proposal to refactor the current implementation so we can implement this plugin and test it. |
def table_exists(table_path, storage_options): | ||
# this is bad, i have to find the way to see if there is table behind path | ||
try: | ||
DeltaTable(table_path, storage_options=storage_options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do this operation a bit cheaper, set also without_files=True
.
check_relations_equal, | ||
run_dbt, | ||
) | ||
from deltalake.writer import write_deltalake |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can import it directly from deltalake
|
||
## TODO | ||
# add partition writing | ||
# add optimization, vacumm options to automatically run before each run ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to make these things configurable on the delta table itself which then should handle at which interval to vacuum or optimize, similar to spark-delta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am courious about that one is there some docu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet :) Was just sharing what I was planning to add
f"Overwriting delta table under: {table_path} \nwith partition expr: {partition_expr}" | ||
) | ||
write_deltalake( | ||
table_path, data, partition_filters=partition_expr, mode="overwrite" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partition_filters are going to be a thing of the past. We are moving to the rust engine slowly and soon there will be a predicate overwrite (in the next python 0.15.2) that is more flexible than the partition_filter overwrite that used by the pyarrow writer.
) | ||
create_insert_partition(table_path, df, partition_dict, storage_options) | ||
elif mode == "merge": | ||
# very slow -> https://github.com/delta-io/delta-rs/issues/1846 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the case anymore :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ion-elgreco, thank you very much for all the comments; I am happy that there is somebody from the delta-rs side with some insights. I very much appreciate your help.
I will recheck this when I finish general plugin refactoring on the dbt side and start again to test with the deltalake integration.
Hi @jwills, The main problem is that when we use the plugin export, we have an extra step where we export data to the parquet file and then load it again into memory with pandas. I think we have to provide a direct reference to the arrow format (ideally dataset) to the underlying export library (pandas, datafusion in delta case) if this is somehow possible. TL;DR; step 3. step 5. New implementation: What is the problem right now?
Regarding other plugins i took a look into pandas <-> arrow combination https://arrow.apache.org/docs/python/pandas.html
In my opinion i think we should ask around if somebody reference to something what is exported with a plugin and try to understand the use case and try to find a solution for this. I know that especially for delta plugin this feature would be nice to have it so i would think about this Nevertheless, I think that some good architecture should not reference something that you export and if you want to use it in following steps it should be again written in the source definition Next steps:
|
Hey @milicevica23 apologies for the lag here, I missed this go by in my email. My limited understanding of the issue is that it sounds like the If that's the case, then I think that we shouldn't keep trying to jam the iceberg and delta stuff into the |
Hi @jwills,
My understanding is that the first developed feature was 1. but with time, the 2. case was built on top of the first one. This introduced the wrong process with one unnecessary step, which was that we first export the file into the parquet and then reread it in the memory instead of giving the data directly from the duckdb over the arrow format to the exporter. I am still unsure how and what the best approach is, but we have to rethink and simplify the whole export part and combine the above 1. and 2. use cases into a unified one-plugin approach. It is not easy, but I think it is a needed step for the feature. |
No description provided.