master merge for 1.4.0 release #2063

rudolfix · 2024-11-14T12:10:05Z

Description

master merge for 1.4.0 release

* Added info to trouble shooting * just to rerun tests * Update docs/website/docs/dlt-ecosystem/verified-sources/sql_database/troubleshooting.md --------- Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>

…parents exception() method. (#1992)

* Makes zendesk tests resilient to data changes * Use requests as a module

* feat: add incremental lag for datetime, int, and float cursors * chore: reverve eng datetime format * test: incremental lag datetime * chore: add lag in merge function * chore: extended ISO compliance datetime detect format * chore: changed deduplication_disabled * fix: native_representation and merge * fix: _deduplication_disabled function * test: changed expected results * test: test incremental lag for datetime and int * chore: incremental lag disabled with custom last_value_func * test: lag incremental for min function * chore: lag date tests and adjustments * chore: edge case end_values * chore: edge case initial_values * test: incremental lag float * supports lag in rest-api * moves lag to separate module, simplifies applying initial_value to lag result * fix: add test missing variable * docs: lag incremental loading --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

* prepares tables before generating duckdb views in filesystem * allows streamlit app to work with any pipelines-dir * wakes load when job completes * skips warning timestamp with precision=6 on duckdb * drops staging dataset in drop_storage * resets event before sleeping * fixes comitted but unsaed schema getitem in live schema storage * loads toml files from colab userdata and integrates with streamlit * adds pipeline to section context properly * makes dropped pipeline defunct * switches to tmp folder if home not writable * adds initial cwd to pipeline and uses it to locate duckdb databases for pipelines * allows for explicit values to set final and non resolvable fields * toml fixes and pipeline drop airflow wip * fixes drop command with and without rename, fixes airflow tests * exposes staging and destination as props in pipeline

Co-authored-by: Alexander Fife <alexander@dlthub.com>

* add interface for limiting and selecting columns * add implementation (tests pending) * make read interfaces tests nices (no tests yet for limit and columns) * add tests for limit, head and select * escape column names in select clause * add unit tests and basic exceptions for readable dbapi dataset * fix limit on mssql/synapse fix column selection selection tests for snowflake * make schema and client properly lazy loading * disable cache * adds pandas like column selector && fixes path norm * fixes test * improves compression detection * fixes compression on jsonl in duckdb remote views * fixes column_schemas * fixes duckdb secrets drop in tests --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

…ng cursor value (#2016) Adjusts the `update_state()` method to set `_next_reference` to None when the cursor value extracted from the JSON response is an empty string, preventing unintended pagination requests. Fixed #2012

* bumps duckdb version in dbt tests * defines type for factory of any source * supports writing default values for specs

* add delta arrow load id partition handling * handle None case * handle delta arrow load id partition column merge disposition

* lint and type check all snippets at once * start fixing snippets * don't ignore missing imports * small changes * fixes many snippets * fix a bunch of more stuff * fix linter * more small fixes * make pendulum and datetime top level imports * fixed timedelta occurences * fix linter

Fixes #2015

* Add tests for LanceDB chunking and merging functionality Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add TSplitter type alias for LanceDB document splitting function Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Refine typing for chunks Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add type definitions for chunk splitter function and related types Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove unused ChunkInputT, ChunkOutputT, and TSplitter type definitions Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement efficient update strategy for chunked documents in LanceDB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement efficient update strategy for chunked documents in LanceDB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Refactor LanceDB client and tests for improved readability and type safety Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Linting Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add document_id parameter to lancedb_adapter and update merge logic Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove resolved comments Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement efficient orphan removal for chunked documents in LanceDB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement efficient update strategy for chunked documents in LanceDB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add test for removing orphaned records in LanceDB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Update LanceDB orphaned records removal test for chunked documents Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Set test pipeline as dev mode Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix write disposition check in LanceDBRemoveOrphansJob execute method Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add FollowupJob trait to LoadLanceDBJob Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix file type Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix file typing Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add test for removing orphaned records in LanceDB root table Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Enhance LanceDB test to cover nested child removal and update scenarios Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Use doc id hint for top level tables Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Only join on join columns for orphan removal job Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add ollama to supported embedding providers and test orphaned record removal with embeddings Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add merge_key to document resource for efficient updates in LanceDB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Formatting Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Set default file size to 128MB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Only use parquet loader file formats Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Import pyarrow.parquet Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove recommended file size from LanceDB destination capabilities Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Update LanceDB client to use more efficient batch processing methods on loading for Load Jobs Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Refactor unique identifier handling for LanceDB tables Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Optimize UUID column generation for LanceDB tables Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Refactor LanceDBClient to use string type hints for Table Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Minor refactor Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement efficient schema update with Nullability support Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Optimize orphaned chunks removal for large datasets Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Projection pushdown Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Format Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Prevent primary key and document ID hint conflict in merge disposition Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add recommended file size for LanceDB destination Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Improve comment clarity for projection push-down in LanceDB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Update to new load interface Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove unnecessary LanceDBLoadJob attributes Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Change instance attributes to `run` method as variables Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Schedule follow up refernce job Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add follow up lancedb remove orphan job skeleron Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Write empty follow up file Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Write parquet Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add support for reference file format in LanceDB destination Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Handle parent table name resolution if it doesn't exist in Lance db remove orphan job Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Refactor specialised orphan follow up job back to reference job Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Refactor orphan removal for chunked documents Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix dlt system table check for name instead of object Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement staging methods Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Override staging client methods Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Docs Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Override staging client methods Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Delete with inserts Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Keep with batch reader Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove Lancedb client's staging implementation Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Insert in memory arrow table. This will be optimized Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Rename classes to the new job implementation classes Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Use namedtuple for table chain to improve readability Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove orphans by loading all ancestor IDs simultaneously Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix doc_id adapter Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix doc_id adapter Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Revert to previous Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Revert "Remove orphans by loading all ancestor IDs simultaneously" This reverts commit 06e04d9. * Remove doc_id hint Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Infer merge key if not supplied from provided primary key Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove unused utility functions Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove LanceDB doc ID hints and use schema normalizer Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * LanceDB writes strange code Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Minor Formatting Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Support compound primary and merge keys Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove old comment Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * - Change default vector column name to "vector" to conform with lancedb standard - Add search tests with tantivy as search engine Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Format and fix linting Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add custom embedding function registration test Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Spawn process in test to make sure registry can be deserialized from arrow files Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Simplify null string handling Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Change NULL string replacement with random string, doc clarification Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Update default vector column name in docs Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Set `remove_orphans` flag to False on tests that don't require it Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement starter arrow string placeholder function Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add test for empty arrow string element vectorised replacement utility function Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Handle NULL values in addition to empty strings in arrow substitution method Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * More efficient empty value replacement with canonical arrow usage Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Format Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Bump pyarrow version Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Use pa.nulls instead of [None]*len Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Update tests Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Invert remove orphans flag Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement root table orphan deletion, only integer doc_ids Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Cater for string ids as well in doc_id removal process Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix test with wrong primary key Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Just send list of ids as is. don't pc.compute on client end Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Extract schema matching into utils Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add utils Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Pass all tests Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Minor format and cleanup Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Docs Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Amend replace test to test with large number of records to catch race conditions with replace disposition Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix replace race conditions by delegating truncation to dlt Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Update lock file Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Refactor type mapping and schema handling in LanceDB client Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Change 'complex' column type to 'json' in LanceDB client Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * update lock file Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * fixes generating lancedb literals * verifies merge key early, fixes column override in adapters * fixes linting errors --------- Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

* added partial loading example * Updated formatting * Updated * Updated * Updated the logic according to comment

* Added deploy with modal. * A few minor fixes * updated links as per comment * Updated as per the comments. * Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-modal.md * Updated * Updated as per comments * Updated * minor fix for relative link * Incorporated comments and new script provided. * Added the snippets * Updated * Updated * updated poetry.lock * Updated "poetry.lock" * Added "__init__.py" * Updated snippets.py * Updated path in MAKEFILE * Added __init__.py in walkthroughs * Adjusted for black * Modified mypy.ini added a pattern module_name_pattern = '[a-zA-Z0-9_\-]+' * updated * renamed deploy-a-pipeline with deploy_a_pipeline * Updated for errors in linting * small changes * bring back deploy-a-pipeline * bring back deploy-a-pipeline in sidebar * fix path to snippet * update lock file * fix path to snippet in tags * fix Duplicate module named "snippets" * rename snippets to code, refactor article, fix mypy errors * fix black errors * rename code to deploy_snippets * add pytest testing for modal function * move example article to the bottom * update lock file --------- Co-authored-by: Anton Burnashev <anton.burnashev@gmail.com> Co-authored-by: Alena <alena@dlthub.com>

#2026) * fix max table nesting, updated tests to come * completely rework tests * calculate max nesting only once, and count nesting level backwards * fix normalizer tests in common * cache shorten fragments (saves about 20-25% of time) * cache normalizing identifiers

* move default pipelines of cores sources into source folders * move core source templates to own folder * rename single file template folder * update variables * update test imports

…2036) * add warning for large delta memory footprint * fix wording

* fix indenting * add serialized=True to pass tests

* Added docs on how to deploy a pipeline using google cloud run * Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-run.md * Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-run.md --------- Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>

* disable secrets clearing on CI (cherry picked from commit cc7fddf) * first version of custom secrets directory * add test and remove secret directory from create_authentication method * revert changes to authentication creation for in mem secrets * format * fix warning message test secrets on all platforms * don't actually load anything in secrets test * run test with mock spy * add s3 extra * close connection after use * ensure at least aiohttp 3.9 for python 3.12 * move secrets tests further down and fix deps * update lockfile move secrets tests back to load tests remove extra deps from common tests

* keeps staging dataset from layout if dataset name empty * allows for empty dataset in clickhouse * sets staging destination dataset name in pipeline if destination dataset name empty * creates default dataset name only when destination is known * drop sentinel explicitly only when no dataset name * improves tests * adds docs * fixes more tests * fixes drop dataset so it is not dropped if no sentinel table

* add gcp default credential handling for delta table format * mark object store credentials test essential * make secrets dir if not exists * reset failed default credentials

* fix delta tests * update workflows

* part 1: only truncate existing tables * some more changes * only run "if exists truncating" on autodetect tables * create final table from staging tables only for autodetect tables on bigquery * add test * fix linter

* logs warning if deduplication state is large * tests for ALL_TEST_DATA_ITEM_FORMATS * improves error message, refactors magic number * Make threshold configurable, display the duplication warning only once, update the warning message, change the test to check for single warning * Move `duplicate_cursor_warning_threshold` to a ClassVar --------- Co-authored-by: Anton Burnashev <anton.burnashev@gmail.com>

* Extends _get_row_key_col to enable using single primary key as a fallback in gen_merge_sql(). * Remove fallback from `_get_row_key_col`, always fallback to primary key * Update the docstring in _get_row_key_col * Move arrow merge to test_merge_disposition * Adjust for delta tables * Includes table format in resource definition

* re-enable delta tests for read access add rudimentary tests for destination configs fixes small problems in test setup * fix cloudflare compat tests * cleans up env variables set by delta-rs * bumps to version 1.4.0 * fixes delta tests --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

netlify · 2024-11-14T12:10:21Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`73b79ee`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/673665799f284c000824a0ad

burnash and others added 30 commits October 22, 2024 14:16

Move exclude_keys() to dlt.common.utils (#1966)

9fdd1ee

Added info to troubleshooting (#1773)

058c09f

* Added info to trouble shooting * just to rerun tests * Update docs/website/docs/dlt-ecosystem/verified-sources/sql_database/troubleshooting.md --------- Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>

minor fix to athena.md documentation (#1983)

de7cd1f

Docs: Fix header to exclude (#1976)

11bfae5

Docs: updated databricks destination documentation (#1984)

774ad5c

Docs: fix capitalization of some terms, fix typos (#1988)

00f631d

Explicitly check that _bg_load_job is not None otherwise fallback to …

de43399

…parents exception() method. (#1992)

fix typo (#1995)

11a3ab8

Fix Zendesk example: make test resilient to data changes (#1999)

1933c3d

* Makes zendesk tests resilient to data changes * Use requests as a module

Remove incorrect example (#1932)

74cb8a4

bumps to pre-release 1.3.1a0

4091eb6

Fix missing GitHub API access token on CI (#2002)

12a6d4a

added session token to duckdb s3 secret (#2007)

75ae23f

Add user agent for Databricks (#1987)

41299ae

Fix an incorrect missing dependency error (#2001)

21c6363

Excludes examles from mypy and flake8 checks (#1969)

d7bf4a8

fix s3 credentials environment variable names (#2010)

350fcb0

remove ga add tm (#2008)

4312332

Co-authored-by: Alexander Fife <alexander@dlthub.com>

fix: if name of distribution is None (#2024)

2c82939

allows to pass default values when writing specs (#2018)

e264755

* bumps duckdb version in dbt tests * defines type for factory of any source * supports writing default values for specs

enable delta partitioning on arrow normalizer load id (#2022)

775c41f

* add delta arrow load id partition handling * handle None case * handle delta arrow load id partition column merge disposition

Fix the deprecation warning in .common.configuration.container (#2025)

e6bd8ea

Fixes #2015

Added partial loading example (#1993)

17847f1

* added partial loading example * Updated formatting * Updated * Updated * Updated the logic according to comment

nake title shorter (#2032)

0c6fd65

dat-a-man and others added 22 commits November 7, 2024 07:35

Updated google cloud function documentation (#2034)

9c14458

move default pipelines of cores sources into source folders (#1888)

25550bb

* move default pipelines of cores sources into source folders * move core source templates to own folder * rename single file template folder * update variables * update test imports

add warning for large delta memory footprint on filesystem docs page (#…

95ca6e6

…2036) * add warning for large delta memory footprint * fix wording

Deploy with Modal: fix indenting (#2040)

5bdf8a3

* fix indenting * add serialized=True to pass tests

Fix the indent in the modal snippet in the docs (#2043)

e086bbc

simplify advanced section (#2037)

461a5e1

add GCP default credential handling for delta table format (#2048)

21432c7

* add gcp default credential handling for delta table format * mark object store credentials test essential * make secrets dir if not exists * reset failed default credentials

disable sftp + delta test (#2052)

0192a73

* fix delta tests * update workflows

change sanity check duration (#2053)

3ef1d9b

fix bigquery autodetect (#2035)

1346b3b

* part 1: only truncate existing tables * some more changes * only run "if exists truncating" on autodetect tables * create final table from staging tables only for autodetect tables on bigquery * add test * fix linter

attempt to disable google discovery cache logging for tests (#2054)

675b309

Add core sources extras to requirements in dlt init (#2028)

0ea9de7

Formats Delta table section (#2057)

1387767

Docs: add 'Table formats' category (#2060)

23b5fd2

rudolfix added the ci full run the full load tests on pr label Nov 14, 2024

fixes clickhouse adapter test

73b79ee

rudolfix merged commit 0fce1c8 into master Nov 14, 2024
29 of 30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

master merge for 1.4.0 release #2063

master merge for 1.4.0 release #2063

rudolfix commented Nov 14, 2024

netlify bot commented Nov 14, 2024 •

edited

Loading

master merge for 1.4.0 release #2063

master merge for 1.4.0 release #2063

Conversation

rudolfix commented Nov 14, 2024

Description

netlify bot commented Nov 14, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

netlify bot commented Nov 14, 2024 •

edited

Loading