Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master merge for 1.0.0 release #1816

Merged
merged 55 commits into from
Sep 16, 2024
Merged

master merge for 1.0.0 release #1816

merged 55 commits into from
Sep 16, 2024

Conversation

rudolfix
Copy link
Collaborator

Description

master merge for 1.0.0 release

donotpush and others added 30 commits August 30, 2024 11:40
* feat: add timezone flag to configure timestamp data

* fix: delete timezone init

* test: add duckdb timestamps with timezone

* test: fix resource hints for timestamp

* test: correct duckdb timestamps

* test: timezone tests for parquet files

* exp: add notebook with timestamp exploration

* test: refactor timestamp tests

* test: simplified tests and extended experiments

* exp: timestamp exp for duckdb and parquet

* fix: add pyarrow reflection for timezone flag

* fix lint errors

* fix: CI/CD move tests pyarrow module

* fix: pyarrow timezone defaults true

* refactor: typemapper signatures

* fix: duckdb timestamp config

* docs: updated duckdb.md timestamps

* fix: revert duckdb timestamp defaults

* fix: restore duckdb timestamp default

* fix: duckdb timestamp mapper

* fix: delete notebook

* docs: added timestamp and timezone section

* refactor: duckdb precision exception message

* feat: postgres timestamp timezone config

* fix: postgres timestamp precision

* fix: postgres timezone false case

* feat: add snowflake timezone and precision flag

* test: postgres invalid timestamp precision

* test: unified timestamp invalid precision

* test: unified column flag timezone

* chore: add warn log for unsupported timezone or precision flag

* docs: timezone and precision flags for timestamps

* fix: none case error

* docs: add duckdb default precision

* fix: typing errors

* rebase: formatted files from upstream devel

* fix: warning message and reference TODO

* test: delete duplicated input_data array

* docs: moved timestamp config to data types section

* fix: lint and format

* fix: lint local errors
… with cursor_path missing or None value (#1576)

* allows specification of what happens on cursor_path missing or cursor_path having the value None: raise differentiated exceptions, exclude row, or include row.

* Documents handling None values at the incremental cursor

* fixes incremental extract crashing if one record has cursor_path = None

* test that add_map can be used to transform items before the incremental function is called

* Unifies treating of None values for python Objects (including pydantic), pandas, and arrow

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* - Change default vector column name to "vector" to conform with lancedb standard
- Add search tests with tantivy as search engine

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Format and fix linting

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add custom embedding function registration test

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Spawn process in test to make sure registry can be deserialized from arrow files

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Simplify null string handling

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Change NULL string replacement with random string, doc clarification

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Update default vector column name in docs

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

---------

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>
(cherry picked from commit 071135b)
* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

* small improvements

* fix layout

* added faqs

* fix code blocks

* fixing linting error

* minor fixes

* Update deploy-with-dagster.md

* Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-dagster.md

Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>

* Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-dagster.md

Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>

* Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-dagster.md

Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>

* Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-dagster.md

Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>

* Update deploy-with-dagster.md

* Update deploy-with-dagster.md

---------

Co-authored-by: Alena <alena@dlthub.com>
* updted the documentation

* Updated
the `initial_value` is a parameter of `dlt.sources.incremental`.
* small improvements

* Updated lancedb title
…t nullable (#1791)

* regression test & fix for arrow table with non-nullable cursor column

* regression test for arrow Table and arrow RecordBatch

* formats code
* copies rest_api source code and test suite, adjusts imports

* integrates rest_client/conftest.pi into rest_api/conftest.py. Fixes incompatibilities except for POST request (/search/posts)

* integrates POST search test

* do no longer skip test with typed dict config

* reuses tests/sources/helpers/rest_client/conftest.py in tests/sources/rest_api

* checks off TODO

* formats rest_api code according to dlt-core rules

* fixes typing errors and graphlib import error

* moves latest changes from rest_api into core (687e7ddab3a95fa621584741af543e561147ebe3). Formats and lints entire rest API
starts to reorganize test suite

* modularizes rest_api test suite

* formats code and imports

* updates signature of Paginator.update_state()

* moves source test suite after duckdb is installed

* end-to-end test rest_api_source on all destinations. Removes redundant helpers from test/utils.py

* adds example rest_api_pipeline.py, corrects sample rest_api_pipeline docs on secrets

* loads latest 30 days of issues instead of fixed date

* refactors types

* tests example rest_api pipelines, adds filesystem configs to load tests

* fix inheritance of incremental args, make typed_dict detection work with typing extensions dicts

* type incremental cursor_path as str

* refactors intersection of TResourceHints and ResourceBase into TResourceHintsBase

* uses str instead of generic TCursorValue

* configures github access token for CI

* copies sql source and tests

* adjusts import paths

* workaround for UUID type missing in sqlalchemy < 2.0

* extracts load tests to tests/load. Adds necessary test utility functions

* formats code

* corrects example postgres credentials for the test suite

* formats imports, removes duplicate definition

* conditionally skips test for range type detection

* fixes side effects of tests modifying os.environ.

* fixes lint errors

* moves tests to right places, runs on all destinations where applicable
moves filesystem source with tests and examples
rearranges old sources.filesystem
adds copy sig for transformers
fixes Windows tests
moves source test suite after duckdb is installed
Revert "attempt to make duckdb a minimal dependency by removing it from extras"

This reverts commit 6b7e670.
attempt to make duckdb a minimal dependency by removing it from extras
formats code
updates signature of Paginator.update_state()
formats imports
modularizes rest_api test suite
adds new files from 687e7ddab3a95fa621584741af543e561147ebe3, starts to reorganize test suite
moves latest changes from rest_api into core (687e7ddab3a95fa621584741af543e561147ebe3). Formats and lints entire rest API
fixes last type errors
fixes more type errors and formats code
fixes graphlib import error
fixes more type errors
fixes type errors except for test_configurations.py
fixes typing errors where optional field was required
formats rest_api code according to dlt-core rules
checks off TODO
reuses tests/sources/helpers/rest_client/conftest.py in tests/sources/rest_api
do no longer skip test with typed dict config
integrates POST search test
integrates rest_client/conftest.pi into rest_api/conftest.py. Fixes incompatibilities except for POST request (/search/posts)
copies rest_api source code and test suite, adjusts imports

* post rebase fixes and formatting

* first simple version of init command that can use core sources

* update tests for core sources

* improve tests a bit more

* move init / generic source to core

* detect explicit repo url in init command

* update output and clean up structure in init command a  bit

* fix tests

* add option for omitting core sources and reverting to the old behavior

* add core sources to the dlt init -l list

* add init template files to build

* remove one unneded file

* revert common tests file

* move sources tests to dedicated file

* remove destination tests for now, revert later

* upgrade sqlalchemy for local source tests

* create sql_database extra

* fix bug in transform

* set up timezone fixtures properly, still does not work right

* fallback to timezone on duckdb with timestamp

* separate common from load tests properly

* update duckdb timezone test

* add sql_alchemy dependency to last part of common tests

* updates imports

* add sql_database_pipeline file, update dlt init commands, add basic tests for sql_database_pipeline

* only import sqlalchemy in tests if present

* fix linter errors

* bump connectorx for python 3.12 support

* move sql_alchemy shims to shims file and use the original file for the same dependency system as with other libs

* Fix linter errors (reverts back to wilis version from a few commits ago)

* exclude connectorx from python 3.8

* make rest api example pipeline also work without a token

* remove secrets from local sources tests

* change test setup to work with both sqlalchemy versions

* adds secrets to a part of common tests

* make sql database pipeline tests succeed on both sqlalchemy versions

* add excel dependenices to common tests

* fix bug in schema inference of sql_alchemy backed sources

* fix tests running for sql alchemy 1.4

* add concept of single file templates in the core

* update tests and fix some

* add some example pipelines

* fixes some issues

* sort source names

* fix unsupported columns

* fix all sql database tests for sqlalchemy 2.0

* fix some tests for sqlalchemy 1.4

* deselect connectorx incremental tests on sqlalchemy 1.4

* fixes some more tests

* some cleanup

* fix bug in init script

* Revert "remove destination tests for now, revert later"

This reverts commit 47e1933.

* exclude sources load tests from destination workflows

* fix openpyxl install

* disable requests tests for now

* fix commen tests

* add dataframe example pipeline
clean up other examples a bit

* add intro examples

* update cleaning scripts for athena and redshift

* make timezone tests slightly more strict

* reorders sql_database import to get user friendly dependency error

---------

Co-authored-by: dave <shrps@posteo.net>
Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* concats tables and record batches before being written to control row group size

* flushes the item buffer for empty tables

* Update dlt/common/data_writers/buffered.py

Co-authored-by: Willi Müller <willi.mueller@posteo.de>

* Update docs/website/docs/dlt-ecosystem/file-formats/parquet.md

Co-authored-by: Willi Müller <willi.mueller@posteo.de>

* Update docs/website/docs/dlt-ecosystem/file-formats/parquet.md

Co-authored-by: Willi Müller <willi.mueller@posteo.de>

* Update docs/website/docs/dlt-ecosystem/file-formats/parquet.md

Co-authored-by: Willi Müller <willi.mueller@posteo.de>

* Update docs/website/docs/dlt-ecosystem/file-formats/parquet.md

Co-authored-by: Willi Müller <willi.mueller@posteo.de>

* refactors writers and buffered code, improves docs

---------

Co-authored-by: Willi Müller <willi.mueller@posteo.de>
* adds methods to detect nested and root tables via parent hint

* skips linking in relational when no parent hint, removes linking skip for primary keys

* moves schema config and normalizer importers to schema module, braks cyclic deps with dest capabilities

* adds table_format override to pipeline run

* resolves merge strategy using adapter, uses default for a destination if strategy not explicit

* removes force_iceberg flag from athena, requires explicit table_format

* adds PreparedTableSchema to indicate TTableSchemas that are prepared for loading, makes verify_schema explicit method to be called by load, simplifies methods to prepare tables

* applies table and file format to run methods in all pipeline tests

* shortens temp table names in sql jobs

* adds filesystem to drop command tests

* fixes tests

* adds method to update table from diff into extract

* athena iceberg does not create dlt pipeline state as iceberg by default

* other test fixes

* deprecates force_icebergs, adds hive table format to opt out

* merges column props and hints, categorizes column props

* moves type mappers into destination capabilities

* fixes tests

* fixes cap data types verification errors not being raised

* adds missing deps

* fixes more tests

* allows precision and scale to be 0

* fixes more tests

* corrects connectorx for 3.12
…n jobs (#1781)

* defaults `raise_on_failed_jobs = True`. Adapts test_dummy_client.py

* updates docs on terminal exceptions on failed jobs

* undoes change of test assertion, changes test setup instead

* removes calls to raise_on_failed_jobs() in docs

* Enables setting of raise_on_failed_jobs in airflow_helper, removes fail_task_if_any_job_failed

* removes setting of os.environ["LOAD__RAISE_ON_FAILED_JOBS"] = "true" and calls to raise_on_failed_jobs()

* Removes redundant calls to raise_on_failed_jobs() in entire test suite. Refactors tests where necessary.

* fixes default arg overwriting config value in load of Pipeline

* fixes some test cases that started to abort

* requests errors set to transient for databrics

* fixes even more tests

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* adds fallback to complex variant column if it exists

* adds mogrations for comples data type and preferred dt

* renames complex in docs

* renames complex

* fixes bug with dynamic columns in make_hints

* adds v10 schema engine fixture

* finalizes comples -> json rename, adds more tests

* adds row_key and parent_key, drops foreign_key, adds migrations and updates test schemas

* test fixes

* deprecates skip_complex_types Pydantic config, updates trace contract
Move sources and destinations to the top level
ruudwelten and others added 24 commits September 11, 2024 22:24
* structural and content changes to the sql_database doc

* fixing language in code snippets

* fixing broken link

* updating content + structure based on feedback

* fixing formatting

* fixing code formatting

* fixing indentation

* modifying based on comments and splitting into multiple pages

* updating broken links

* removing problematic relative paths

* small formatting and language change + adding a line about column reflection

* fix outdated info

* fix description

---------

Co-authored-by: akelad <akela@dlthub.com>
* adding the sql_database tutorial

* fixing language snippets

* fixing broken link

* grammar and formatting fixes

* Update docs/website/docs/tutorial/sql_database.md

Co-authored-by: mariarice15 <123215798+mariarice15@users.noreply.github.com>

* Update docs/website/docs/tutorial/sql_database.md

* Update docs/website/docs/tutorial/sql_database.md

* Update docs/website/docs/tutorial/sql_database.md

* Update docs/website/docs/tutorial/sql_database.md

* Apply suggestions from code review

Co-authored-by: mariarice15 <123215798+mariarice15@users.noreply.github.com>

* Update docs/website/docs/tutorial/sql_database.md

* Update docs/website/docs/tutorial/sql_database.md

* Update docs/website/docs/tutorial/sql_database.md

* Update docs/website/docs/tutorial/sql_database.md

---------

Co-authored-by: Anton Burnashev <anton.burnashev@gmail.com>
Co-authored-by: mariarice15 <123215798+mariarice15@users.noreply.github.com>
* skips tables without jobs when creating table chain jobs, deletes delta table and arrow dataset instances

* adds tests for tables without jobs

* fixes merge key and primary key OR clause for clickhouse
…as list (#1535)

* creates a single source in extract for all resource instances passed as a list

* decomposes dicts of resources so names are split accross many sources
…g hints (#1806)

* Add autodetect schema with hints test for BigQuery table builder

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Use SDK to set hints for autodetect_schema path

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Pass timestamp test

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Remove redundant test

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Extract BigQuery load job configuration into own method

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* moves pipeline tests to pipelines

---------

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>
Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* Implement sqlalchemy loader

Begin implementing sqlalchemy loader

SQLA load job, factory, schema storage, POC

sqlalchemy tests attempt

Implement SqlJobClient interface

Parquet load, some tests running on mysql

update lockfile

Limit bulk insert chunk size, sqlite create/drop schema, fixes

Generate schema update

Get more tests running with mysql

More tests passing

Fix state, schema restore

* Support destination name in tests

* Some job client/sql client tests running on sqlite

* Fix more tests

* ALl sqlite tests passing

* Add sqlalchemy tests in ci

* Type errors

* Test sqlalchemy in own workflow

* Fix tests, type errors

* Fix config

* CI fix

* Add alembic to handle ALTER TABLE

* FIx workflow

* Install mysqlclient in venv

* Mysql service version

* Single fail

* mysql healtcheck

* No localhost

* Remove weaviate

* Change ubuntu version

* Debug sqlite version

* Revert

* Use py datetime in tests

* Test on sqlalchemy 1.4 and 2

remove secrets toml

remove secrets toml

Revert "remove secrets toml"

This reverts commit 7dd189c.

Fix default pipeline name test

* Lint, no cli tests

* Update lockfile

* Fix test, complex -> json

* Refactor type mapper

* Update tests destination config

* Fix tests

* Ignore sources tests

* Fix overriding destination in test pipeline

* Fix time precision in arrow test

* Lint

* Fix destination setup in test

* Fix

* Use nullpool, lazy create engine, close current connection
Co-authored-by: Akela Drissner-Schmid <32450038+akelad@users.noreply.github.com>
* chore: add paramiko dev dependency

* test: add container for sftp localhost

* chore: add tmp bash scripts

* exp: sftp client with fsspec

* chore: sftp timestamp metadata discovered

* fix: docs lint

* feat: add fsspec protocol sftp

* fix: lint errors from devel

* test: sftp server localhost

* fix: filesystem SFTP docker-compose tests

* fix: json import

* chore: clean tests and dockerfile

* refactor: ci test exec for sftp server

* feat: sftp file url parser

* test: sftp reading using file samples

* chore: extended SFTP credentials class

* docs: filesystem SFTP credentials and authentication

* chore: add bobby password protected key-based authentication

* docs: sftp correction for ssh-agent

* chore: add docker volume

* chore: revert ci changes

* test: refactor sftp with auth methods

* test: sftp skip test when agent not configured

* fix: poetry lock

* fix: github workflow

* fix: run only sftp tests

* fix: merge conflict regression

* fix: ssh-agent for tests

* fix: pytest executions excluding sftp

* fix: CI test execution

* test: sftp login with signed certificate

* fix: poetry lock regenerated

* refactor: filesystem sftp tests

* fix: filesystem tests for sftp

* refactor: reduce redundancy

* fix: lint and remove duplicated test

* chore: change ubuntu version

* fix: enforce test marker

* fix: ignore sftp tests

* fix: exclude sftp from filesystem tests

* adds sftp extra dep

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* Move sources and destinations to the top level
* Update the css
* Update sidebars.js
* Adjust icons

---------

Co-authored-by: Violetta Mishechkina <sansiositres@gmail.com>
Co-authored-by: akelad <akela@dlthub.com>
Co-authored-by: Anton Burnashev <anton.burnashev@gmail.com>
* Masks secrets in traces.

* tests that secrets are masked in stringified trace

* generates secrets in deployments from dlt.secrets provider instead of pipeline trace

* corrects masking and looks up secret value in dlt.secrets

* removes secret masking and replaces credentials with None.

* fixes deploy help when deploy type missing

* fixes always_choose restore defaults in echo

* tests deploy command with and without secrets

* fixes dumping secret vals for toml in deploy

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* removes blog files

* updates schema docs for nested references

* updates docs to use nested instead of parent child

* adds more migration tests

* bumps to 1.0.0

* adds scd2 tests
Copy link

netlify bot commented Sep 16, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit e48f641
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66e830b99eee0b000830ef96
😎 Deploy Preview https://deploy-preview-1816--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@rudolfix rudolfix force-pushed the devel branch 2 times, most recently from 346d7c9 to 2ee3eab Compare September 16, 2024 13:12
@rudolfix rudolfix merged commit 1750663 into master Sep 16, 2024
60 of 61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.