ibis support - hand over credentials to ibis backend for a number of destinations #2004

sh-rp · 2024-10-30T12:48:04Z

Description

Adds support for handing over our credentials to an ibis backend. Supported destinations are:

    "dlt.destinations.postgres",
    "dlt.destinations.duckdb",
    "dlt.destinations.motherduck",
    "dlt.destinations.filesystem",
    "dlt.destinations.bigquery",
    "dlt.destinations.snowflake",
    "dlt.destinations.redshift",
    "dlt.destinations.mssql",
    "dlt.destinations.synapse",
    "dlt.destinations.clickhouse",

Discussion at ibis about using ibis only as a query builder: ibis-project/ibis#10452. There also is a question there about setting a default database (they call what we call dataset a database)

netlify · 2024-10-30T12:48:18Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`7f767b8`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/673cbca7bae10700080dec9c

sh-rp · 2024-10-30T13:56:01Z

dlt/pipeline/pipeline.py

@@ -1729,3 +1734,11 @@ def _dataset(self, dataset_type: TDatasetType = "dbapi") -> SupportsReadableData
            schema=(self.default_schema if self.default_schema_name else None),
            dataset_type=dataset_type,
        )
+
+    def _ibis(self) -> IbisBackend:


I don't think our original idea of exposing an ibis dataset via the same methods as the dbapi one makes sense, the interface of ibis is completely different to ours.

but you can use overloads on Literal dataset_type

@overload def _dataset(self, dataset_type: TDatasetType = "dbapi") -> SupportsReadableDataset: ... @overload def _dataset(self, dataset_type: TDatasetType = "ibis") -> IbisBackend: ...

to return different types. hmmm but maybe you are right. we should implement readable dataset interface on ibis.

so maybe we add ibis() on DBAPI dataset? that will return ibis backend? I do not think adding it on a relation makes sense

I like the overload, I did not know this was possible...

sh-rp · 2024-10-30T13:57:12Z

dlt/destinations/dataset.py

+        raise NotImplementedError()
+
+    ibis = ibis.connect(client.config.credentials.to_native_representation())
+    # NOTE: there seems to be no standardized way to set the current dataset / schema in ibis


I'm not sure where we should put the stuff that converts our credentials into something IBIS understands. client.config.credentials.to_native_representation() could work for many cases, but selecting the default dataset will probably be something destination specific..

Also for the filesystem destination this will work quite differently, because we first need to populate the duckdb database and then attach ibis to it.

part that creates backend should go to dlt/helpers/ibis

part that converts credentials should go to common/libs/ibis
here we can improve a lot. we already pollute credentials in common with concrete converting functions (like fsspec, delta etc.) we should create some universal mechanism to convert them.

rudolfix · 2024-11-01T16:55:08Z

dlt/destinations/impl/filesystem/sql_client.py

@@ -165,13 +165,18 @@ def open_connection(self) -> duckdb.DuckDBPyConnection:
            # set up dataset
            if not self.has_dataset():
                self.create_dataset()
+                print("CREATE")


rudolfix · 2024-11-01T17:07:06Z

dlt/pipeline/pipeline.py

@@ -1729,3 +1734,11 @@ def _dataset(self, dataset_type: TDatasetType = "dbapi") -> SupportsReadableData
            schema=(self.default_schema if self.default_schema_name else None),
            dataset_type=dataset_type,
        )
+
+    def _ibis(self) -> IbisBackend:


but you can use overloads on Literal dataset_type

@overload def _dataset(self, dataset_type: TDatasetType = "dbapi") -> SupportsReadableDataset: ... @overload def _dataset(self, dataset_type: TDatasetType = "ibis") -> IbisBackend: ...

to return different types. hmmm but maybe you are right. we should implement readable dataset interface on ibis.

so maybe we add ibis() on DBAPI dataset? that will return ibis backend? I do not think adding it on a relation makes sense

rudolfix · 2024-11-01T17:10:40Z

dlt/common/destination/reference.py

+        """fetch arrow table of first 'chunk_size' items"""
+        ...
+
+    def ibis(self, chunk_size: int = None) -> Optional[IbisTable]: ...


I'd add it on ReadableDBAPIDataset and return implementation of it in _dataset of pipeline

rudolfix · 2024-11-01T17:11:37Z

dlt/destinations/dataset.py

@@ -42,18 +90,91 @@ def __init__(
        self.iter_arrow = self._wrap_iter("iter_arrow")  # type: ignore
        self.iter_fetch = self._wrap_iter("iter_fetch")  # type: ignore

+    # TODO: where should this go, should cursors support "native" ibis or do we do a conversion somewhere
+    def ibis(self, chunk_size: int = None) -> Optional[IbisTable]:


this does not make sense... we have stuff materialized. way better to just do df() which has data frame interface

I removed those

rudolfix · 2024-11-01T17:12:30Z

dlt/common/destination/reference.py

+        """iterate over arrow tables of 'chunk_size' items"""
+        ...
+
+    def iter_ibis(self, chunk_size: int) -> Generator[IbisTable, None, None]: ...


how are you going to implement that? you'll need to create IbisTable on a chunk of data which is lazy evaluated. otherwise does not make sense

rudolfix · 2024-11-01T17:18:17Z

dlt/destinations/dataset.py

+        raise NotImplementedError()
+
+    ibis = ibis.connect(client.config.credentials.to_native_representation())
+    # NOTE: there seems to be no standardized way to set the current dataset / schema in ibis


part that creates backend should go to dlt/helpers/ibis

part that converts credentials should go to common/libs/ibis
here we can improve a lot. we already pollute credentials in common with concrete converting functions (like fsspec, delta etc.) we should create some universal mechanism to convert them.

rudolfix · 2024-11-01T17:22:33Z

dlt/destinations/dataset.py

+
+    def head(self) -> "ReadableDBAPIRelation":
+        return self.limit(5)
+

 class ReadableDBAPIDataset(SupportsReadableDataset):


let's move ibis() here. then we hand over the connection and return the ibis backend

can we move schema property to the Protocol?

rudolfix · 2024-11-01T17:25:24Z

dlt/destinations/dataset.py

+        raise NotImplementedError()
+
+    # NOTE: there seems to be no standardized way to set the current dataset / schema in ibis
+    ibis.raw_sql(f"SET search_path TO {dataset_name};")


why? you just do:
return ibis[dataset_name]
to select namespace.
https://duckdb.org/docs/guides/python/ibis.html

where does it say that on this page? In any case it does not work and I also looked for a bit before and could not find any standard way to select the current database (as it is called in ibis). The call the whole thing (in the case of duckdb the file) a catalog and what we call dataset is called database there. You can list the database but not set a default one.. At least I have not found a method.

it seems like there is no way to consistently pre-select a database, and for example setting the schema on connection in snowflake for some reason also does not work, so I am opting now to not set any default schema/database on a connection and we need to tell the user to give the database name for listing and retrieving tables when using the ibis interfaces.

rudolfix · 2024-11-01T19:25:10Z

also here you wrote a helper: https://github.com/dlt-hub/dlt/pull/1491/files

tests/load/test_read_interfaces.py

rudolfix

very good. I'm for keeping ibis there if we are going to implement ibis-based dataset.

rudolfix · 2024-11-07T16:13:28Z

tests/load/test_read_interfaces.py

@@ -131,7 +131,7 @@ def double_items():
    yield pipeline

    # NOTE: we need to drop pipeline data here since we are keeping the pipelines around for the whole module
-    drop_pipeline_data(pipeline)
+    # drop_pipeline_data(pipeline)


all ok here?

rudolfix · 2024-11-07T16:16:39Z

dlt/common/libs/ibis.py

+        con = ibis.duckdb.from_connection(duck)
+
+    # NOTE: there seems to be no standardized way to set the current dataset / schema in ibis
+    con.raw_sql(f"SET search_path TO {dataset_name};")


I do not need this here. ibis uses fully qualified names so con.dataset.table

dlt/common/libs/ibis.py

rudolfix · 2024-11-12T14:43:14Z

dlt/destinations/dataset.py

@@ -228,6 +237,17 @@ def __init__(
        self._sql_client: SqlClientBase[Any] = None
        self._schema: Schema = None

+    def ibis(self) -> IbisBackend:


I think this is OK. when we have full ibis support we can return Dataset with ibis implementation.

btw. maybe you should overload

def _dataset(self, dataset_type: TDatasetType = "dbapi") -> SupportsReadableDataset:

to return different dataset impl. per engine. or just return DBAPI one for now. otherwise people won't see ibis

I'm not quite sure what you mean. But I was thinking we should have dataset_type be "dbapi", "ibis" and "auto". If it is auto it will select dbapi if there are no ibis expressions available and ibis if there is. That said, dbapi should have a different name, since the ibis expression based one also uses dbapi.

rudolfix · 2024-11-12T14:49:32Z

dlt/destinations/dataset.py

+
+
+# helpers
+def _get_client_for_destination(


we need to extract _get_destination_clients and _get_destination_client_initial_config from Pipeline and put it here. there are many corner case they do (ie. we support a multi dataset layout for pipelines with many schemas, also empty dataset names are allowed ie. on clikchouse)

you can do it now or we do it in next ticket that will wrap up all work before we release dataset functionality

I did this, let's see if the tests pass. I maybe we should also put the function for creating a dataset_name there?

tests/load/test_read_interfaces.py

rudolfix

LGTM!

sh-rp self-assigned this Oct 30, 2024

sh-rp linked an issue Oct 30, 2024 that may be closed by this pull request

ibis support for datasets / destinations #2003

Closed

sh-rp commented Oct 30, 2024

View reviewed changes

sh-rp force-pushed the feat/2003-ibis-support branch from 2adb193 to 5befef9 Compare October 31, 2024 12:10

sh-rp changed the title ~~2003 - Experiment: ibis support~~ 2003 - PoC: ibis support Oct 31, 2024

rudolfix requested changes Nov 1, 2024

View reviewed changes

sh-rp force-pushed the feat/2003-ibis-support branch 4 times, most recently from ab62c59 to 500b5ff Compare November 7, 2024 15:19

sh-rp commented Nov 11, 2024

View reviewed changes

tests/load/test_read_interfaces.py Outdated Show resolved Hide resolved

sh-rp changed the title ~~2003 - PoC: ibis support~~ ibis support - hand over credentials to ibis backend for a number of destinations Nov 11, 2024

sh-rp force-pushed the feat/2003-ibis-support branch from 54af827 to cbf98d6 Compare November 12, 2024 12:50

sh-rp marked this pull request as ready for review November 12, 2024 13:35

sh-rp requested a review from rudolfix November 12, 2024 13:35

rudolfix requested changes Nov 12, 2024

View reviewed changes

sh-rp force-pushed the feat/2003-ibis-support branch 3 times, most recently from c7f44d5 to bdaf034 Compare November 13, 2024 10:23

sh-rp added 8 commits November 18, 2024 11:12

add PoC for ibis table support on readabledbapidataset

98b374d

add PoC for exposing an ibis backend for a destination

df7f63f

install ibis dependency for tests

767d6de

add support for filesystem

854b855

remove print statments

b7bd1e3

remove ibis tables from dbapirelation

ef536e1

clean up interfaces

da78768

move backend creation and skip tests for unsupported backend

902bf92

sh-rp added 21 commits November 18, 2024 11:12

fix dependencies and typing

cc4d6ff

mark import not found, can't be linted on 3.8 and 3.9

d73ece3

add snowflake and bigquery support

a7a9001

add redshift and maybe fix linter

cf0a8eb

fix linter

371a557

remove unneeded dependency

7f2328f

add in missing pipeline drop

c818140

fix snowflake table access test

2682c2b

add mssql support

fff06d7

enable synapse

7534106

add clickhouse support

c78e282

enable motherduck

23951d0

post rebase lock file update

49eae5f

enable motherduck

c3eaf80

add missing ibis framework extras

b01c950

remove argument of create ibis backend

e450b58

extract destination client factories into dataset file

313caef

fix partial loading example

f93a515

fix setting of default schema name in destination config

71a404c

fix default dataset for staging destination

20b4dae

post rebase lockfile update

cf954ca

sh-rp force-pushed the feat/2003-ibis-support branch from aa8640f to cf954ca Compare November 18, 2024 10:14

sh-rp linked an issue Nov 19, 2024 that may be closed by this pull request

Release Dataset Feature #2074

Closed

5 tasks

always set azure transport connection

7f767b8

rudolfix approved these changes Nov 23, 2024

View reviewed changes

rudolfix merged commit dfde071 into devel Nov 23, 2024
58 of 59 checks passed

rudolfix deleted the feat/2003-ibis-support branch November 23, 2024 21:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ibis support - hand over credentials to ibis backend for a number of destinations #2004

ibis support - hand over credentials to ibis backend for a number of destinations #2004

sh-rp commented Oct 30, 2024 •

edited

Loading

netlify bot commented Oct 30, 2024 •

edited

Loading

sh-rp Oct 30, 2024

rudolfix Nov 1, 2024

sh-rp Nov 1, 2024

sh-rp Oct 30, 2024

sh-rp Oct 30, 2024

rudolfix Nov 1, 2024

rudolfix Nov 1, 2024

rudolfix Nov 1, 2024

rudolfix Nov 1, 2024

rudolfix Nov 1, 2024

sh-rp Nov 7, 2024

rudolfix Nov 1, 2024

sh-rp Nov 7, 2024

rudolfix Nov 1, 2024

rudolfix Nov 1, 2024

rudolfix Nov 1, 2024

rudolfix Nov 1, 2024

sh-rp Nov 2, 2024

sh-rp Nov 11, 2024

rudolfix commented Nov 1, 2024

rudolfix left a comment

rudolfix Nov 7, 2024

rudolfix Nov 7, 2024

rudolfix Nov 12, 2024

rudolfix Nov 12, 2024

sh-rp Nov 12, 2024

rudolfix Nov 12, 2024

sh-rp Nov 12, 2024

rudolfix left a comment

ibis support - hand over credentials to ibis backend for a number of destinations #2004

ibis support - hand over credentials to ibis backend for a number of destinations #2004

Conversation

sh-rp commented Oct 30, 2024 • edited Loading

Description

netlify bot commented Oct 30, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix commented Nov 1, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

sh-rp commented Oct 30, 2024 •

edited

Loading

netlify bot commented Oct 30, 2024 •

edited

Loading