dataset factory #1945

sh-rp · 2024-10-10T17:07:43Z

Description

This PR is an example implementation of a dataset factory to build datasets without a pipeline object.

netlify · 2024-10-10T17:07:58Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`d367c68`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/670d2d63ec07160008d09ce1

sh-rp · 2024-10-10T17:08:17Z

dlt/destinations/job_client_impl.py

- f" ORDER BY {c_inserted_at} DESC;"
- )
- return self._row_to_schema_info(query, self.schema.name)
+ if any_schema_name:


this can be compressed and also needs to be implemented for all destinations

also: is this signature ok, or do we want to add a new function for this? I'm also not sure about this "any_schema_name":.. :)

sh-rp · 2024-10-10T17:08:53Z

tests/load/test_read_interfaces.py

@@ -212,6 +212,47 @@ def double_items():
 loads_table = pipeline._dataset()[pipeline.default_schema.loads_table_name]
 loads_table.fetchall()

+ # check dataset factory


we need proper tests to ensure that always the newest schema actually is selected, this is just basic test code to make sure it generally works

sh-rp · 2024-10-10T17:09:42Z

dlt/destinations/dataset.py

+def dataset(
+ destination: TDestinationReferenceArg,
+ dataset_name: str,
+ schema: Union[Schema, str, None] = None,


we allow a given schema or alternatively a schema name which will be loaded from the destination or no schema name which will do the autodiscovery as discussed.

OK cool. but as discussed we'll need to implement dataset compatible with pipeline dataset (many schemas, different database layout: we support schema separation but it is rarely used)

dlt/pipeline/pipeline.py

dlt/__init__.py

rudolfix

this is good! I think there's a big overlap with dataset and part of pipeline that does the same:

keeping destinations (also staging - you should IMO include is as optional, sometimes we'll use ie. Athena to do Iceberg but actually open staging filesystem as a data lake)
keeping a list of schemas on the dataset
initializing configs, exposing various clients

do you think it makes sense to refactor pipeline right now?

do you think we could keep a dataset instance in the pipeline and just expose some methods from it

the standalone part looks good. I'm not sure if we should go for a single schema or for many schemas dataset?

… fixed linter errors, made dataset aware of config types

…ion)

rudolfix

I'd change the WithState interface to be more explicit and also add schema tests for the filesystem

dlt/__init__.py

rudolfix · 2024-10-11T15:41:56Z

dlt/common/destination/reference.py

@@ -657,8 +659,8 @@ def __exit__(

 class WithStateSync(ABC):
 @abstractmethod
- def get_stored_schema(self) -> Optional[StorageSchemaInfo]:
- """Retrieves newest schema from destination storage"""
+ def get_stored_schema(self, any_schema_name: bool = False) -> Optional[StorageSchemaInfo]:


I'd rather add a new method but it really do not fit here. this interface assumes that there's a known name of a schema.

my take would be to change signature to

get_stored_schema(self, schema_name: str = None)

if None is specified, we load the newest schema, if name is provided we load the newest schema with given name

Sounds good, I had the same idea but thought it might not be good to change the default behavior of this method. I have changed it now and updated all the places in the code and tests where it is used.

rudolfix · 2024-10-11T15:42:53Z

dlt/destinations/dataset.py

+def dataset(
+ destination: TDestinationReferenceArg,
+ dataset_name: str,
+ schema: Union[Schema, str, None] = None,


OK cool. but as discussed we'll need to implement dataset compatible with pipeline dataset (many schemas, different database layout: we support schema separation but it is rarely used)

tests/load/test_job_client.py

rudolfix · 2024-10-11T15:47:17Z

tests/load/test_read_interfaces.py

+ dataset = dlt.dataset(
+ destination=destination_for_dataset,
+ dataset_name=pipeline.dataset_name,
+ schema="wrong_schema_name",


we allow a new schema to be added right? that's why we do not raise when schema is not known? we'll need a better method of adding schemas to a dataset. also to sync schemas etc.

What actually happens here is that an empty schema is created as a standin, it does not do anything and also does not get saved anywhere. I can change that if you like, but afaik for now it should be ok.

no, that is OK. I just thing that all the code that interacts with destination and now is in the pipeline (ie. schema lists, schema storage etc. probably could go to Dataset at some point)

rudolfix

LGTM! ready for merge!

sh-rp commented Oct 10, 2024

View reviewed changes

dlt/pipeline/pipeline.py Show resolved Hide resolved

sh-rp commented Oct 10, 2024

View reviewed changes

dlt/__init__.py Outdated Show resolved Hide resolved

sh-rp marked this pull request as ready for review October 10, 2024 17:11

rudolfix requested changes Oct 10, 2024

View reviewed changes

create first version of dataset factory

9e0147b

sh-rp force-pushed the feat/1943-dataset-factory branch from eb43626 to 9e0147b Compare October 11, 2024 11:00

sh-rp added 3 commits October 11, 2024 13:52

update all destination implementations for getting the newest schema,…

c9ad58f

… fixed linter errors, made dataset aware of config types

test retrieval of schema for all destinations (except custom destinat…

a715f1a

…ion)

add simple tests for schema selection in dataset tests

213b89c

sh-rp force-pushed the feat/1943-dataset-factory branch from cbe8ef5 to 213b89c Compare October 11, 2024 14:20

unify filesystem schema behavior with other destinations

50a5d47

sh-rp force-pushed the feat/1943-dataset-factory branch from 760ed95 to 54a9d6a Compare October 11, 2024 15:10

fix gcs delta tests

d9ab96c

sh-rp force-pushed the feat/1943-dataset-factory branch from 54a9d6a to d9ab96c Compare October 11, 2024 15:37

rudolfix requested changes Oct 11, 2024

View reviewed changes

sh-rp added 6 commits October 13, 2024 16:59

try to fix ci errors

c6f178b

allow athena in a kind of "read only" mode

1e78212

fix delta table tests?

f49c3ca

mark dataset factory as private

9354e86

change signature and behavior of get_stored_schema

5f3dbdf

fix weaviate schema retrieval

b52a4a8

sh-rp changed the title ~~WIP: dataset factory~~ dataset factory Oct 14, 2024

switch back to properties

d367c68

rudolfix approved these changes Oct 14, 2024

View reviewed changes

sh-rp merged commit bc13448 into devel Oct 15, 2024
115 of 118 checks passed

sh-rp deleted the feat/1943-dataset-factory branch October 15, 2024 06:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset factory #1945

dataset factory #1945

sh-rp commented Oct 10, 2024

netlify bot commented Oct 10, 2024 •

edited

Loading

sh-rp Oct 10, 2024

sh-rp Oct 10, 2024

sh-rp Oct 10, 2024

sh-rp Oct 10, 2024

rudolfix Oct 11, 2024

rudolfix left a comment

rudolfix left a comment

rudolfix Oct 11, 2024

sh-rp Oct 14, 2024 •

edited

Loading

rudolfix Oct 11, 2024

rudolfix Oct 11, 2024

sh-rp Oct 14, 2024

rudolfix Oct 14, 2024

rudolfix left a comment

dataset factory #1945

dataset factory #1945

Conversation

sh-rp commented Oct 10, 2024

Description

netlify bot commented Oct 10, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

netlify bot commented Oct 10, 2024 •

edited

Loading

sh-rp Oct 14, 2024 •

edited

Loading