Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master merge for 1.4.0 release #2063

Merged
merged 53 commits into from
Nov 14, 2024
Merged

master merge for 1.4.0 release #2063

merged 53 commits into from
Nov 14, 2024

Conversation

rudolfix
Copy link
Collaborator

Description

master merge for 1.4.0 release

burnash and others added 30 commits October 22, 2024 14:16
* Added info to trouble shooting

* just to rerun tests

* Update docs/website/docs/dlt-ecosystem/verified-sources/sql_database/troubleshooting.md

---------

Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>
* Makes zendesk tests resilient to data changes

* Use requests as a module
* feat: add incremental lag for datetime, int, and float cursors

* chore: reverve eng datetime format

* test: incremental lag datetime

* chore: add lag in merge function

* chore: extended ISO compliance datetime detect format

* chore: changed deduplication_disabled

* fix: native_representation and merge

* fix: _deduplication_disabled function

* test: changed expected results

* test: test incremental lag for datetime and int

* chore: incremental lag disabled with custom last_value_func

* test: lag incremental for min function

* chore: lag date tests and adjustments

* chore: edge case end_values

* chore: edge case initial_values

* test: incremental lag float

* supports lag in rest-api

* moves lag to separate module, simplifies applying initial_value to lag result

* fix: add test missing variable

* docs: lag incremental loading

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* prepares tables before generating duckdb views in filesystem

* allows streamlit app to work with any pipelines-dir

* wakes load when job completes

* skips warning timestamp with precision=6 on duckdb

* drops staging dataset in drop_storage

* resets event before sleeping

* fixes comitted but unsaed schema getitem in live schema storage

* loads toml files from colab userdata and integrates with streamlit

* adds pipeline to section context properly

* makes dropped pipeline defunct

* switches to tmp folder if home not writable

* adds initial cwd to pipeline and uses it to locate duckdb databases for pipelines

* allows for explicit values to set final and non resolvable fields

* toml fixes and pipeline drop airflow wip

* fixes drop command with and without rename, fixes airflow tests

* exposes staging and destination as props in pipeline
Co-authored-by: Alexander Fife <alexander@dlthub.com>
* add interface for limiting and selecting columns

* add implementation (tests pending)

* make read interfaces tests nices (no tests yet for limit and columns)

* add tests for limit, head and select

* escape column names in select clause

* add unit tests and basic exceptions for readable dbapi dataset

* fix limit on mssql/synapse
fix column selection selection tests for snowflake

* make schema and client properly lazy loading

* disable cache

* adds pandas like column selector && fixes path norm

* fixes test

* improves compression detection

* fixes compression on jsonl in duckdb remote views

* fixes column_schemas

* fixes duckdb secrets drop in tests

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
…ng cursor value (#2016)

Adjusts the `update_state()` method to set `_next_reference` to None
when the cursor value extracted from the JSON response is an empty
string, preventing unintended pagination requests.

Fixed #2012
* bumps duckdb version in dbt tests

* defines type for factory of any source

* supports writing default values for specs
* add delta arrow load id partition handling

* handle None case

* handle delta arrow load id partition column merge disposition
* lint and type check all snippets at once

* start fixing snippets

* don't ignore missing imports

* small changes

* fixes many snippets

* fix a bunch of more stuff

* fix linter

* more small fixes

* make pendulum and datetime top level imports

* fixed timedelta occurences

* fix linter
* Add tests for LanceDB chunking and merging functionality

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add TSplitter type alias for LanceDB document splitting function

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Refine typing for chunks

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add type definitions for chunk splitter function and related types

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Remove unused ChunkInputT, ChunkOutputT, and TSplitter type definitions

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Implement efficient update strategy for chunked documents in LanceDB

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Implement efficient update strategy for chunked documents in LanceDB

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Refactor LanceDB client and tests for improved readability and type safety

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Linting

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add document_id parameter to lancedb_adapter and update merge logic

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Remove resolved comments

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Implement efficient orphan removal for chunked documents in LanceDB

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Implement efficient update strategy for chunked documents in LanceDB

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add test for removing orphaned records in LanceDB

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Update LanceDB orphaned records removal test for chunked documents

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Set test pipeline as dev mode

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Fix write disposition check in LanceDBRemoveOrphansJob execute method

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add FollowupJob trait to LoadLanceDBJob

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Fix file type

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Fix file typing

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add test for removing orphaned records in LanceDB root table

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Enhance LanceDB test to cover nested child removal and update scenarios

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Use doc id hint for top level tables

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Only join on join columns for orphan removal job

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add ollama to supported embedding providers and test orphaned record removal with embeddings

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add merge_key to document resource for efficient updates in LanceDB

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Formatting

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Set default file size to 128MB

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Only use parquet loader file formats

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Import pyarrow.parquet

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Remove recommended file size from LanceDB destination capabilities

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Update LanceDB client to use more efficient batch processing methods on loading for Load Jobs

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Refactor unique identifier handling for LanceDB tables

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Optimize UUID column generation for LanceDB tables

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Refactor LanceDBClient to use string type hints for Table

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Minor refactor

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Implement efficient schema update with Nullability support

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Optimize orphaned chunks removal for large datasets

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Projection pushdown

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Format

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Prevent primary key and document ID hint conflict in merge disposition

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add recommended file size for LanceDB destination

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Improve comment clarity for projection push-down in LanceDB

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Update to new load interface

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Remove unnecessary LanceDBLoadJob attributes

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Change instance attributes to `run` method as variables

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Schedule follow up refernce job

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add follow up lancedb remove orphan job skeleron

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Write empty follow up file

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Write parquet

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add support for reference file format in LanceDB destination

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Handle parent table name resolution if it doesn't exist in Lance db remove orphan job

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Refactor specialised orphan follow up job back to reference job

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Refactor orphan removal for chunked documents

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Fix dlt system table check for name instead of object

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Implement staging methods

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Override staging client methods

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Docs

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Override staging client methods

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Delete with inserts

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Keep with batch reader

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Remove Lancedb client's staging implementation

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Insert in memory arrow table. This will be optimized

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Rename classes to the new job implementation classes

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Use namedtuple for table chain to improve readability

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Remove orphans by loading all ancestor IDs simultaneously

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Fix doc_id adapter

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Fix doc_id adapter

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Revert to previous

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Revert "Remove orphans by loading all ancestor IDs simultaneously"

This reverts commit 06e04d9.

* Remove doc_id hint

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Infer merge key if not supplied from provided primary  key

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Remove unused utility functions

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Remove LanceDB doc ID hints and use schema normalizer

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* LanceDB writes strange code

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Minor Formatting

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Support compound primary and merge keys

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Remove old comment

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* - Change default vector column name to "vector" to conform with lancedb standard
- Add search tests with tantivy as search engine

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Format and fix linting

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add custom embedding function registration test

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Spawn process in test to make sure registry can be deserialized from arrow files

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Simplify null string handling

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Change NULL string replacement with random string, doc clarification

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Update default vector column name in docs

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Set `remove_orphans` flag to False on tests that don't require it

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Implement starter arrow string placeholder function

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add test for empty arrow string element vectorised replacement utility function

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Handle NULL values in addition to empty strings in arrow substitution method

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* More efficient empty value replacement with canonical arrow usage

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Format

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Bump pyarrow version

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Use pa.nulls instead of [None]*len

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Update tests

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Invert remove orphans flag

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Implement root table orphan deletion, only integer doc_ids

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Cater for string ids as well in doc_id removal process

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Fix test with wrong primary key

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Just send list of ids as is. don't pc.compute on client end

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Extract schema matching into utils

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Add utils

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Pass all tests

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Minor format and cleanup

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Docs

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Amend replace test to test with large number of records to catch race conditions with replace disposition

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Fix replace race conditions by delegating truncation to dlt

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Update lock file

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Refactor type mapping and schema handling in LanceDB client

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* Change 'complex' column type to 'json' in LanceDB client

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* update lock file

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>

* fixes generating lancedb literals

* verifies merge key early, fixes column override in adapters

* fixes linting errors

---------

Signed-off-by: Marcel Coetzee <marcel@mooncoon.com>
Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* added partial loading example

* Updated formatting

* Updated

* Updated

* Updated the logic according to comment
dat-a-man and others added 22 commits November 7, 2024 07:35
* Added deploy with modal.

* A few minor fixes

* updated links as per comment

* Updated as per the comments.

* Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-modal.md

* Updated

* Updated as per comments

* Updated

* minor fix for relative link

* Incorporated comments and new script provided.

* Added the snippets

* Updated

* Updated

* updated poetry.lock

* Updated "poetry.lock"

* Added "__init__.py"

* Updated snippets.py

* Updated path in MAKEFILE

* Added __init__.py in walkthroughs

* Adjusted for black

* Modified mypy.ini added a pattern module_name_pattern = '[a-zA-Z0-9_\-]+'

* updated

* renamed deploy-a-pipeline with deploy_a_pipeline

* Updated for errors in linting

* small changes

* bring back deploy-a-pipeline

* bring back deploy-a-pipeline in sidebar

* fix path to snippet

* update lock file

* fix path to snippet in tags

* fix Duplicate module named "snippets"

* rename snippets to code, refactor article, fix mypy errors

* fix black errors

* rename code to deploy_snippets

* add pytest testing for modal function

* move example article to the bottom

* update lock file

---------

Co-authored-by: Anton Burnashev <anton.burnashev@gmail.com>
Co-authored-by: Alena <alena@dlthub.com>
#2026)

* fix max table nesting, updated tests to come

* completely rework tests

* calculate max nesting only once, and count nesting level backwards

* fix normalizer tests in common

* cache shorten fragments (saves about 20-25% of time)

* cache normalizing identifiers
* move default pipelines of cores sources into source folders

* move core source templates to own folder

* rename single file template folder

* update variables

* update test imports
…2036)

* add warning for large delta memory footprint

* fix wording
* fix indenting

* add serialized=True to pass tests
* Added docs on how to deploy a pipeline using google cloud run

* Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-run.md

* Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-run.md

---------

Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>
* disable secrets clearing on CI

(cherry picked from commit cc7fddf)

* first version of custom secrets directory

* add test and remove secret directory from create_authentication method

* revert changes to authentication creation for in mem secrets

* format

* fix warning message
test secrets on all platforms

* don't actually load anything in secrets test

* run test with mock spy

* add s3 extra

* close connection after use

* ensure at least aiohttp 3.9 for python 3.12

* move secrets tests further down and fix deps

* update lockfile
move secrets tests back to load tests
remove extra deps from common tests
* keeps staging dataset from layout if dataset name empty

* allows for empty dataset in clickhouse

* sets staging destination dataset name in pipeline if destination dataset name empty

* creates default dataset name only when destination is known

* drop sentinel explicitly only when no dataset name

* improves tests

* adds docs

* fixes more tests

* fixes drop dataset so it is not dropped if no sentinel table
* add gcp default credential handling for delta table format

* mark object store credentials test essential

* make secrets dir if not exists

* reset failed default credentials
* fix delta tests

* update workflows
* part 1: only truncate existing tables

* some more changes

* only run "if exists truncating" on autodetect tables

* create final table from staging tables only for autodetect tables on bigquery

* add test

* fix linter
* logs warning if deduplication state is large

* tests for ALL_TEST_DATA_ITEM_FORMATS

* improves error message, refactors magic number

* Make threshold configurable, display the duplication warning only once, update the warning message, change the test to check for single warning

* Move `duplicate_cursor_warning_threshold` to a ClassVar

---------

Co-authored-by: Anton Burnashev <anton.burnashev@gmail.com>
* Extends _get_row_key_col to enable using single primary key as a fallback in gen_merge_sql().

* Remove fallback from `_get_row_key_col`, always fallback to primary key

* Update the docstring in _get_row_key_col

* Move arrow merge to test_merge_disposition

* Adjust for delta tables

* Includes table format in resource definition
* re-enable delta tests for read access
add rudimentary tests for destination configs
fixes small problems in test setup

* fix cloudflare compat tests

* cleans up env variables set by delta-rs

* bumps to version 1.4.0

* fixes delta tests

---------

Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
@rudolfix rudolfix added the ci full run the full load tests on pr label Nov 14, 2024
Copy link

netlify bot commented Nov 14, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 73b79ee
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/673665799f284c000824a0ad

@rudolfix rudolfix merged commit 0fce1c8 into master Nov 14, 2024
29 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci full run the full load tests on pr
Projects
None yet
Development

Successfully merging this pull request may close these issues.