Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] devel from dlt-hub:devel #2

Open
wants to merge 543 commits into
base: devel
Choose a base branch
from
Open

Conversation

pull[bot]
Copy link

@pull pull bot commented May 4, 2024

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

akelad and others added 28 commits August 28, 2024 12:06
* Expose staging tables truncation to config

* Fix comments, add tests

* Fix tests

* Move implementation from mixing, add tests

* Fix docs grammar
* allows to configure external location and named credential for databricks

* fixes #1703

* normalizes 'value' when wrapping simple objects in relational, fixes #1754

* simplifies fsspec globbing and allows various url formats that are preserved when reconstituting full url, allows abfss databricks format

* adds info on partially loaded packages to docs

* renames remote_uri to remote_url in traces

* fixes delta for abfss

* adds nested tables dlt columns collision test
* always truncates staging tables on athena + replace without iceberg

* adds athena staging configs to all staging configs

* updates athena tests for staging destination
master merge for 0.5.4 release
* feat: add timezone flag to configure timestamp data

* fix: delete timezone init

* test: add duckdb timestamps with timezone

* test: fix resource hints for timestamp

* test: correct duckdb timestamps

* test: timezone tests for parquet files

* exp: add notebook with timestamp exploration

* test: refactor timestamp tests

* test: simplified tests and extended experiments

* exp: timestamp exp for duckdb and parquet

* fix: add pyarrow reflection for timezone flag

* fix lint errors

* fix: CI/CD move tests pyarrow module

* fix: pyarrow timezone defaults true

* refactor: typemapper signatures

* fix: duckdb timestamp config

* docs: updated duckdb.md timestamps

* fix: revert duckdb timestamp defaults

* fix: restore duckdb timestamp default

* fix: duckdb timestamp mapper

* fix: delete notebook

* docs: added timestamp and timezone section

* refactor: duckdb precision exception message

* feat: postgres timestamp timezone config

* fix: postgres timestamp precision

* fix: postgres timezone false case

* feat: add snowflake timezone and precision flag

* test: postgres invalid timestamp precision

* test: unified timestamp invalid precision

* test: unified column flag timezone

* chore: add warn log for unsupported timezone or precision flag

* docs: timezone and precision flags for timestamps

* fix: none case error

* docs: add duckdb default precision

* fix: typing errors

* rebase: formatted files from upstream devel

* fix: warning message and reference TODO

* test: delete duplicated input_data array

* docs: moved timestamp config to data types section

* fix: lint and format

* fix: lint local errors
… with cursor_path missing or None value (#1576)

* allows specification of what happens on cursor_path missing or cursor_path having the value None: raise differentiated exceptions, exclude row, or include row.

* Documents handling None values at the incremental cursor

* fixes incremental extract crashing if one record has cursor_path = None

* test that add_map can be used to transform items before the incremental function is called

* Unifies treating of None values for python Objects (including pydantic), pandas, and arrow

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* - Change default vector column name to "vector" to conform with lancedb standard
- Add search tests with tantivy as search engine

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Format and fix linting

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add custom embedding function registration test

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Spawn process in test to make sure registry can be deserialized from arrow files

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Simplify null string handling

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Change NULL string replacement with random string, doc clarification

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Update default vector column name in docs

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

---------

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>
(cherry picked from commit 8995f70)
(cherry picked from commit 071135b)
* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* small improvements

* fix layout

* added faqs

* fix code blocks

* fixing linting error

* minor fixes

* Update deploy-with-dagster.md

* Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-dagster.md

Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>

* Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-dagster.md

Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>

* Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-dagster.md

Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>

* Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-dagster.md

Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

---------

Co-authored-by: Alena <alena@dlthub.com>
* updted the documentation

* Updated
the `initial_value` is a parameter of `dlt.sources.incremental`.
* small improvements

* Updated lancedb title
…t nullable (#1791)

* regression test & fix for arrow table with non-nullable cursor column

* regression test for arrow Table and arrow RecordBatch

* formats code
* copies rest_api source code and test suite, adjusts imports

* integrates rest_client/conftest.pi into rest_api/conftest.py. Fixes incompatibilities except for POST request (/search/posts)

* integrates POST search test

* do no longer skip test with typed dict config

* reuses tests/sources/helpers/rest_client/conftest.py in tests/sources/rest_api

* checks off TODO

* formats rest_api code according to dlt-core rules

* fixes typing errors and graphlib import error

* moves latest changes from rest_api into core (687e7ddab3a95fa621584741af543e561147ebe3). Formats and lints entire rest API
starts to reorganize test suite

* modularizes rest_api test suite

* formats code and imports

* updates signature of Paginator.update_state()

* moves source test suite after duckdb is installed

* end-to-end test rest_api_source on all destinations. Removes redundant helpers from test/utils.py

* adds example rest_api_pipeline.py, corrects sample rest_api_pipeline docs on secrets

* loads latest 30 days of issues instead of fixed date

* refactors types

* tests example rest_api pipelines, adds filesystem configs to load tests

* fix inheritance of incremental args, make typed_dict detection work with typing extensions dicts

* type incremental cursor_path as str

* refactors intersection of TResourceHints and ResourceBase into TResourceHintsBase

* uses str instead of generic TCursorValue

* configures github access token for CI

* copies sql source and tests

* adjusts import paths

* workaround for UUID type missing in sqlalchemy < 2.0

* extracts load tests to tests/load. Adds necessary test utility functions

* formats code

* corrects example postgres credentials for the test suite

* formats imports, removes duplicate definition

* conditionally skips test for range type detection

* fixes side effects of tests modifying os.environ.

* fixes lint errors

* moves tests to right places, runs on all destinations where applicable
moves filesystem source with tests and examples
rearranges old sources.filesystem
adds copy sig for transformers
fixes Windows tests
moves source test suite after duckdb is installed
Revert "attempt to make duckdb a minimal dependency by removing it from extras"

This reverts commit 6b7e670.
attempt to make duckdb a minimal dependency by removing it from extras
formats code
updates signature of Paginator.update_state()
formats imports
modularizes rest_api test suite
adds new files from 687e7ddab3a95fa621584741af543e561147ebe3, starts to reorganize test suite
moves latest changes from rest_api into core (687e7ddab3a95fa621584741af543e561147ebe3). Formats and lints entire rest API
fixes last type errors
fixes more type errors and formats code
fixes graphlib import error
fixes more type errors
fixes type errors except for test_configurations.py
fixes typing errors where optional field was required
formats rest_api code according to dlt-core rules
checks off TODO
reuses tests/sources/helpers/rest_client/conftest.py in tests/sources/rest_api
do no longer skip test with typed dict config
integrates POST search test
integrates rest_client/conftest.pi into rest_api/conftest.py. Fixes incompatibilities except for POST request (/search/posts)
copies rest_api source code and test suite, adjusts imports

* post rebase fixes and formatting

* first simple version of init command that can use core sources

* update tests for core sources

* improve tests a bit more

* move init / generic source to core

* detect explicit repo url in init command

* update output and clean up structure in init command a  bit

* fix tests

* add option for omitting core sources and reverting to the old behavior

* add core sources to the dlt init -l list

* add init template files to build

* remove one unneded file

* revert common tests file

* move sources tests to dedicated file

* remove destination tests for now, revert later

* upgrade sqlalchemy for local source tests

* create sql_database extra

* fix bug in transform

* set up timezone fixtures properly, still does not work right

* fallback to timezone on duckdb with timestamp

* separate common from load tests properly

* update duckdb timezone test

* add sql_alchemy dependency to last part of common tests

* updates imports

* add sql_database_pipeline file, update dlt init commands, add basic tests for sql_database_pipeline

* only import sqlalchemy in tests if present

* fix linter errors

* bump connectorx for python 3.12 support

* move sql_alchemy shims to shims file and use the original file for the same dependency system as with other libs

* Fix linter errors (reverts back to wilis version from a few commits ago)

* exclude connectorx from python 3.8

* make rest api example pipeline also work without a token

* remove secrets from local sources tests

* change test setup to work with both sqlalchemy versions

* adds secrets to a part of common tests

* make sql database pipeline tests succeed on both sqlalchemy versions

* add excel dependenices to common tests

* fix bug in schema inference of sql_alchemy backed sources

* fix tests running for sql alchemy 1.4

* add concept of single file templates in the core

* update tests and fix some

* add some example pipelines

* fixes some issues

* sort source names

* fix unsupported columns

* fix all sql database tests for sqlalchemy 2.0

* fix some tests for sqlalchemy 1.4

* deselect connectorx incremental tests on sqlalchemy 1.4

* fixes some more tests

* some cleanup

* fix bug in init script

* Revert "remove destination tests for now, revert later"

This reverts commit 47e1933.

* exclude sources load tests from destination workflows

* fix openpyxl install

* disable requests tests for now

* fix commen tests

* add dataframe example pipeline
clean up other examples a bit

* add intro examples

* update cleaning scripts for athena and redshift

* make timezone tests slightly more strict

* reorders sql_database import to get user friendly dependency error

---------

Co-authored-by: dave <shrps@posteo.net>
Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* concats tables and record batches before being written to control row group size

* flushes the item buffer for empty tables

* Update dlt/common/data_writers/buffered.py

Co-authored-by: Willi Müller <willi.mueller@posteo.de>

* Update docs/website/docs/dlt-ecosystem/file-formats/parquet.md

Co-authored-by: Willi Müller <willi.mueller@posteo.de>

* Update docs/website/docs/dlt-ecosystem/file-formats/parquet.md

Co-authored-by: Willi Müller <willi.mueller@posteo.de>

* Update docs/website/docs/dlt-ecosystem/file-formats/parquet.md

Co-authored-by: Willi Müller <willi.mueller@posteo.de>

* Update docs/website/docs/dlt-ecosystem/file-formats/parquet.md

Co-authored-by: Willi Müller <willi.mueller@posteo.de>

* refactors writers and buffered code, improves docs

---------

Co-authored-by: Willi Müller <willi.mueller@posteo.de>
* adds methods to detect nested and root tables via parent hint

* skips linking in relational when no parent hint, removes linking skip for primary keys

* moves schema config and normalizer importers to schema module, braks cyclic deps with dest capabilities

* adds table_format override to pipeline run

* resolves merge strategy using adapter, uses default for a destination if strategy not explicit

* removes force_iceberg flag from athena, requires explicit table_format

* adds PreparedTableSchema to indicate TTableSchemas that are prepared for loading, makes verify_schema explicit method to be called by load, simplifies methods to prepare tables

* applies table and file format to run methods in all pipeline tests

* shortens temp table names in sql jobs

* adds filesystem to drop command tests

* fixes tests

* adds method to update table from diff into extract

* athena iceberg does not create dlt pipeline state as iceberg by default

* other test fixes

* deprecates force_icebergs, adds hive table format to opt out

* merges column props and hints, categorizes column props

* moves type mappers into destination capabilities

* fixes tests

* fixes cap data types verification errors not being raised

* adds missing deps

* fixes more tests

* allows precision and scale to be 0

* fixes more tests

* corrects connectorx for 3.12
…n jobs (#1781)

* defaults `raise_on_failed_jobs = True`. Adapts test_dummy_client.py

* updates docs on terminal exceptions on failed jobs

* undoes change of test assertion, changes test setup instead

* removes calls to raise_on_failed_jobs() in docs

* Enables setting of raise_on_failed_jobs in airflow_helper, removes fail_task_if_any_job_failed

* removes setting of os.environ["LOAD__RAISE_ON_FAILED_JOBS"] = "true" and calls to raise_on_failed_jobs()

* Removes redundant calls to raise_on_failed_jobs() in entire test suite. Refactors tests where necessary.

* fixes default arg overwriting config value in load of Pipeline

* fixes some test cases that started to abort

* requests errors set to transient for databrics

* fixes even more tests

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* adds fallback to complex variant column if it exists

* adds mogrations for comples data type and preferred dt

* renames complex in docs

* renames complex

* fixes bug with dynamic columns in make_hints

* adds v10 schema engine fixture

* finalizes comples -> json rename, adds more tests

* adds row_key and parent_key, drops foreign_key, adds migrations and updates test schemas

* test fixes

* deprecates skip_complex_types Pydantic config, updates trace contract
burnash and others added 30 commits December 4, 2024 17:37
* Add open/closed range arguments for incremental

* Docs for incremental range args

* Docstring

* Typo

* Ensure deduplication is disabled when range_start=='open'

* Cache transformer settings
* add ibis dataset in own class for now

* make error clearer

* fix some linting and fix broken test

* make most destinations work with selecting the right db and catalog, transpiling sql via postgres in some cases and selecting the right dialect in others

* add missing motherduck and sqlalchemy mappings

* casefold identifiers for ibis wrapper calss

* re-organize existing dataset code to prepare ibis relation integration

* integrate ibis relation into existing code

* re-order tests

* fall back to default dataset if table not in schema

* make dataset type selectable

* add dataset type selection test and fix bug in tests

* update docs for ibis expressions use

* ensure a bunch of ibis operations continue working

* add some more tests and typings

* fix typing (with brute force get_attr typing..)

* move ibis to dependency group

* move ibis stuff to helpers

* post devel merge, put in change from dataset, update lockfile

* add ibis to sqlalchemy tests

* improve docs a bit

* fix ibis dep group

* fix dataset snippets

* fix ibis version

* add support for column schema in certion query cases

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* add pyiceberg dependency and upgrade mypy

- mypy upgrade needed to solve this issue: apache/iceberg-python#768
- uses <1.13.0 requirement on mypy because 1.13.0 gives error
- new lint errors arising due to version upgrade are simply ignored

* extend pyiceberg dependencies

* remove redundant delta annotation

* add basic local filesystem iceberg support

* add active table format setting

* disable merge tests for iceberg table format

* restore non-redundant extra info

* refactor to in-memory iceberg catalog

* add s3 support for iceberg table format

* add schema evolution support for iceberg table format

* extract _register_table function

* add partition support for iceberg table format

* update docstring

* enable child table test for iceberg table format

* enable empty source test for iceberg table format

* make iceberg catalog namespace configurable and default to dataset name

* add optional typing

* fix typo

* improve typing

* extract logic into dedicated function

* add iceberg read support to filesystem sql client

* remove unused import

* add todo

* extract logic into separate functions

* add azure support for iceberg table format

* generalize delta table format tests

* enable get tables function test for iceberg table format

* remove ignores

* undo table directory management change

* enable test_read_interfaces tests for iceberg

* fix active table format filter

* use mixin for object store rs credentials

* generalize catalog typing

* extract pyiceberg scheme mapping into separate function

* generalize credentials mixin test setup

* remove unused import

* add centralized fallback to append when merge is not supported

* Revert "add centralized fallback to append when merge is not supported"

This reverts commit 54cd0bc.

* fall back to append if merge is not supported on filesystem

* fix test for s3-compatible storage

* remove obsolete code path

* exclude gcs read interface tests for iceberg

* add gcs support for iceberg table format

* switch to UnsupportedAuthenticationMethodException

* add iceberg table format docs

* use shorter pipeline name to prevent too long sql identifiers

* add iceberg catalog note to docs

* black format

* use shorter pipeline name to prevent too long sql identifiers

* correct max id length for sqlalchemy mysql dialect

* Revert "use shorter pipeline name to prevent too long sql identifiers"

This reverts commit 6cce03b.

* Revert "use shorter pipeline name to prevent too long sql identifiers"

This reverts commit ef29aa7.

* replace show with execute to prevent useless print output

* add abfss scheme to test

* remove az support for iceberg table format

* remove iceberg bucket test exclusion

* add note to docs on azure scheme support for iceberg table format

* exclude iceberg from duckdb s3-compatibility test

* disable pyiceberg info logs for tests

* extend table format docs and move into own page

* upgrade adlfs to enable account_host attribute

* Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/1996-iceberg-filesystem

* fix lint errors

* re-add pyiceberg dependency

* enabled iceberg in dbt-duckdb

* upgrade pyiceberg version

* remove pyiceberg mypy errors across python version

* does not install airflow group for dev

* fixes gcp oauth iceberg credentials handling

* fixes ca cert bundle duckdb azure on ci

* allow for airflow dep to be present during type check

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* explicitly adding docs for destination item size control

* alena's feedback

* revised for explicit note

* Update docs/website/docs/reference/performance.md

---------

Co-authored-by: hulmanaseer00 <163604758+hulmanaseer00@users.noreply.github.com>
Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>
* add databricks oauth authentication

* improve auth databricks test

* force token-based auth for azure external location tests
* make duckdb handle iceberg table with nested types

* replace duckdb views for iceberg tables

* remove unnecessary context closing and opening

* replace duckdb views for abfss protocol

* restore original destination for write path

* use dev_mode to work around leftover data from previous tests

leftover data caused by #2148
* drops tables from schema and relational

* documents custom sections for sql_database and source rename

* clones schema without data tables when resources without source are extacted, adds tests

* skips airflow tests if not installed

* adds doc on setting up FUSE on bucket

* adds doc on setting up FUSE on bucket

* adds row key propagation for table when its nested table require it

* fixes tests
* remove standalone dataset from exports

* make pipeline dataset factory public

* rework transformation section

* fix some linting errors

* add row counts feature for readabledataset

* add dataset access example to getting started scripts

* add notes about row_counts special query to datasets docs

* fix internal docusaurus links

* Update docs/website/docs/intro.md

* Update docs/website/docs/tutorial/load-data-from-an-api.md

* Update docs/website/docs/tutorial/load-data-from-an-api.md

* Update docs/website/docs/tutorial/load-data-from-an-api.md

* Update docs/website/docs/general-usage/dataset-access/dataset.md

* Update docs/website/docs/general-usage/dataset-access/dataset.md

* Update docs/website/docs/dlt-ecosystem/transformations/index.md

* Update docs/website/docs/dlt-ecosystem/transformations/index.md

* Update docs/website/docs/dlt-ecosystem/transformations/index.md

* Update docs/website/docs/dlt-ecosystem/transformations/index.md

* Update docs/website/docs/dlt-ecosystem/destinations/duckdb.md

* Update docs/website/docs/dlt-ecosystem/transformations/index.md

* Update docs/website/docs/dlt-ecosystem/transformations/index.md

* Update docs/website/docs/dlt-ecosystem/transformations/python.md

* Update docs/website/docs/dlt-ecosystem/transformations/python.md

* Update docs/website/docs/dlt-ecosystem/transformations/python.md

* Update docs/website/docs/dlt-ecosystem/transformations/python.md

* Update docs/website/docs/dlt-ecosystem/transformations/python.md

* Update docs/website/docs/dlt-ecosystem/transformations/python.md

* Update docs/website/docs/dlt-ecosystem/transformations/python.md

* Update docs/website/docs/dlt-ecosystem/transformations/python.md

* Update docs/website/docs/dlt-ecosystem/transformations/sql.md

* Update docs/website/docs/dlt-ecosystem/transformations/sql.md

* Update docs/website/docs/dlt-ecosystem/transformations/sql.md

* Update docs/website/docs/dlt-ecosystem/transformations/sql.md

* Update docs/website/docs/dlt-ecosystem/transformations/sql.md

* Update docs/website/docs/general-usage/dataset-access/dataset.md

---------

Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>
* try to fix ibis az problems on linux

* remove duckdb certs fix

* test explicitely setting transport options

* sets the ssl curl on a correct connection clone

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* allows data type diff and ensures valid migration separately

* removes dlt init flag to skip core sources, adds flag to eject core source
* convert add_limit to step based limiting

* prevent late arriving items to be forwarded from limit
add some convenience methods for pipe step management

* added a few more tests for limit

* add more limit functions from branch

* remove rate-limiting

* fix limiting bug and update docs

* revert back to inserting validator step at the same position if replaced

* make time limit tests more lenient for mac os tests

* tmp

* add test for testing incremental with limit

* improve limit tests with parallelized case

* add backfill example with sql_database

* fix linting

* remove extra file

* only wrap iterators on demand

* move items transform steps into extra file
* first draft

* updates

* add image

* add bottom list
* fix links (#1977)

* correctly converts dict arrow types into dlt types

* drops dbt compat code for versions below 1.5

* ignores encoding errors when reading from process pipe

---------

Co-authored-by: adrianbr <adrian.brudaru@gmail.com>
* fix: unravel Optional to inner generic arg from instance

* test: remove dependency on Incremental in common

* refactor: use extract_inner_type

* refactor: remove redundant conditional
- Add `ensure_pendulum_datetime_non_utc` to parse datetime strings into non-UTC datetime objects.
- Add `_datetime_obj_to_str` to preserve the colon in the timezone when converting datetime objects back to strings.
- Skip writing back state if no valid rows are found for `last_value` in the transformer, which may otherwise cause incorrect behavior.
…2095)

* Updated SQL server documentation

* Updated as per comments
…2184)

* do not normalize dates and datetimes in from / to arrow scalar

* uses the same logic for naive datetimes in arrow and object incrementals
…ple disable python 3.8 (#2185)

* import correct typeddict version for use in pydantic, disallow use of usual python typeddict imports

* add test

* update import for examples

* fixed some imports

* remove python 3.8 lint and test for now

* always use typeddict from typing_extensions
pin poetry in tests to 1.8.5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.