-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
master merge for 1.4.0 release #2063
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Added info to trouble shooting * just to rerun tests * Update docs/website/docs/dlt-ecosystem/verified-sources/sql_database/troubleshooting.md --------- Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>
…parents exception() method. (#1992)
* Makes zendesk tests resilient to data changes * Use requests as a module
* feat: add incremental lag for datetime, int, and float cursors * chore: reverve eng datetime format * test: incremental lag datetime * chore: add lag in merge function * chore: extended ISO compliance datetime detect format * chore: changed deduplication_disabled * fix: native_representation and merge * fix: _deduplication_disabled function * test: changed expected results * test: test incremental lag for datetime and int * chore: incremental lag disabled with custom last_value_func * test: lag incremental for min function * chore: lag date tests and adjustments * chore: edge case end_values * chore: edge case initial_values * test: incremental lag float * supports lag in rest-api * moves lag to separate module, simplifies applying initial_value to lag result * fix: add test missing variable * docs: lag incremental loading --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* prepares tables before generating duckdb views in filesystem * allows streamlit app to work with any pipelines-dir * wakes load when job completes * skips warning timestamp with precision=6 on duckdb * drops staging dataset in drop_storage * resets event before sleeping * fixes comitted but unsaed schema getitem in live schema storage * loads toml files from colab userdata and integrates with streamlit * adds pipeline to section context properly * makes dropped pipeline defunct * switches to tmp folder if home not writable * adds initial cwd to pipeline and uses it to locate duckdb databases for pipelines * allows for explicit values to set final and non resolvable fields * toml fixes and pipeline drop airflow wip * fixes drop command with and without rename, fixes airflow tests * exposes staging and destination as props in pipeline
Co-authored-by: Alexander Fife <alexander@dlthub.com>
* add interface for limiting and selecting columns * add implementation (tests pending) * make read interfaces tests nices (no tests yet for limit and columns) * add tests for limit, head and select * escape column names in select clause * add unit tests and basic exceptions for readable dbapi dataset * fix limit on mssql/synapse fix column selection selection tests for snowflake * make schema and client properly lazy loading * disable cache * adds pandas like column selector && fixes path norm * fixes test * improves compression detection * fixes compression on jsonl in duckdb remote views * fixes column_schemas * fixes duckdb secrets drop in tests --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* bumps duckdb version in dbt tests * defines type for factory of any source * supports writing default values for specs
* lint and type check all snippets at once * start fixing snippets * don't ignore missing imports * small changes * fixes many snippets * fix a bunch of more stuff * fix linter * more small fixes * make pendulum and datetime top level imports * fixed timedelta occurences * fix linter
* Add tests for LanceDB chunking and merging functionality Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add TSplitter type alias for LanceDB document splitting function Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Refine typing for chunks Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add type definitions for chunk splitter function and related types Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove unused ChunkInputT, ChunkOutputT, and TSplitter type definitions Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement efficient update strategy for chunked documents in LanceDB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement efficient update strategy for chunked documents in LanceDB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Refactor LanceDB client and tests for improved readability and type safety Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Linting Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add document_id parameter to lancedb_adapter and update merge logic Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove resolved comments Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement efficient orphan removal for chunked documents in LanceDB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement efficient update strategy for chunked documents in LanceDB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add test for removing orphaned records in LanceDB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Update LanceDB orphaned records removal test for chunked documents Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Set test pipeline as dev mode Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix write disposition check in LanceDBRemoveOrphansJob execute method Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add FollowupJob trait to LoadLanceDBJob Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix file type Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix file typing Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add test for removing orphaned records in LanceDB root table Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Enhance LanceDB test to cover nested child removal and update scenarios Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Use doc id hint for top level tables Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Only join on join columns for orphan removal job Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add ollama to supported embedding providers and test orphaned record removal with embeddings Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add merge_key to document resource for efficient updates in LanceDB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Formatting Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Set default file size to 128MB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Only use parquet loader file formats Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Import pyarrow.parquet Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove recommended file size from LanceDB destination capabilities Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Update LanceDB client to use more efficient batch processing methods on loading for Load Jobs Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Refactor unique identifier handling for LanceDB tables Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Optimize UUID column generation for LanceDB tables Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Refactor LanceDBClient to use string type hints for Table Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Minor refactor Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement efficient schema update with Nullability support Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Optimize orphaned chunks removal for large datasets Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Projection pushdown Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Format Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Prevent primary key and document ID hint conflict in merge disposition Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add recommended file size for LanceDB destination Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Improve comment clarity for projection push-down in LanceDB Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Update to new load interface Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove unnecessary LanceDBLoadJob attributes Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Change instance attributes to `run` method as variables Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Schedule follow up refernce job Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add follow up lancedb remove orphan job skeleron Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Write empty follow up file Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Write parquet Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add support for reference file format in LanceDB destination Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Handle parent table name resolution if it doesn't exist in Lance db remove orphan job Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Refactor specialised orphan follow up job back to reference job Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Refactor orphan removal for chunked documents Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix dlt system table check for name instead of object Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement staging methods Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Override staging client methods Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Docs Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Override staging client methods Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Delete with inserts Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Keep with batch reader Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove Lancedb client's staging implementation Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Insert in memory arrow table. This will be optimized Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Rename classes to the new job implementation classes Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Use namedtuple for table chain to improve readability Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove orphans by loading all ancestor IDs simultaneously Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix doc_id adapter Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix doc_id adapter Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Revert to previous Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Revert "Remove orphans by loading all ancestor IDs simultaneously" This reverts commit 06e04d9. * Remove doc_id hint Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Infer merge key if not supplied from provided primary key Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove unused utility functions Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove LanceDB doc ID hints and use schema normalizer Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * LanceDB writes strange code Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Minor Formatting Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Support compound primary and merge keys Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Remove old comment Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * - Change default vector column name to "vector" to conform with lancedb standard - Add search tests with tantivy as search engine Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Format and fix linting Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add custom embedding function registration test Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Spawn process in test to make sure registry can be deserialized from arrow files Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Simplify null string handling Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Change NULL string replacement with random string, doc clarification Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Update default vector column name in docs Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Set `remove_orphans` flag to False on tests that don't require it Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement starter arrow string placeholder function Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add test for empty arrow string element vectorised replacement utility function Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Handle NULL values in addition to empty strings in arrow substitution method Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * More efficient empty value replacement with canonical arrow usage Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Format Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Bump pyarrow version Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Use pa.nulls instead of [None]*len Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Update tests Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Invert remove orphans flag Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Implement root table orphan deletion, only integer doc_ids Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Cater for string ids as well in doc_id removal process Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix test with wrong primary key Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Just send list of ids as is. don't pc.compute on client end Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Extract schema matching into utils Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Add utils Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Pass all tests Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Minor format and cleanup Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Docs Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Amend replace test to test with large number of records to catch race conditions with replace disposition Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Fix replace race conditions by delegating truncation to dlt Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Update lock file Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Refactor type mapping and schema handling in LanceDB client Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * Change 'complex' column type to 'json' in LanceDB client Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * update lock file Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> * fixes generating lancedb literals * verifies merge key early, fixes column override in adapters * fixes linting errors --------- Signed-off-by: Marcel Coetzee <marcel@mooncoon.com> Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
* added partial loading example * Updated formatting * Updated * Updated * Updated the logic according to comment
* Added deploy with modal. * A few minor fixes * updated links as per comment * Updated as per the comments. * Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-modal.md * Updated * Updated as per comments * Updated * minor fix for relative link * Incorporated comments and new script provided. * Added the snippets * Updated * Updated * updated poetry.lock * Updated "poetry.lock" * Added "__init__.py" * Updated snippets.py * Updated path in MAKEFILE * Added __init__.py in walkthroughs * Adjusted for black * Modified mypy.ini added a pattern module_name_pattern = '[a-zA-Z0-9_\-]+' * updated * renamed deploy-a-pipeline with deploy_a_pipeline * Updated for errors in linting * small changes * bring back deploy-a-pipeline * bring back deploy-a-pipeline in sidebar * fix path to snippet * update lock file * fix path to snippet in tags * fix Duplicate module named "snippets" * rename snippets to code, refactor article, fix mypy errors * fix black errors * rename code to deploy_snippets * add pytest testing for modal function * move example article to the bottom * update lock file --------- Co-authored-by: Anton Burnashev <anton.burnashev@gmail.com> Co-authored-by: Alena <alena@dlthub.com>
#2026) * fix max table nesting, updated tests to come * completely rework tests * calculate max nesting only once, and count nesting level backwards * fix normalizer tests in common * cache shorten fragments (saves about 20-25% of time) * cache normalizing identifiers
* move default pipelines of cores sources into source folders * move core source templates to own folder * rename single file template folder * update variables * update test imports
…2036) * add warning for large delta memory footprint * fix wording
* fix indenting * add serialized=True to pass tests
* Added docs on how to deploy a pipeline using google cloud run * Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-run.md * Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-google-cloud-run.md --------- Co-authored-by: Alena Astrakhantseva <alena@dlthub.com>
* disable secrets clearing on CI (cherry picked from commit cc7fddf) * first version of custom secrets directory * add test and remove secret directory from create_authentication method * revert changes to authentication creation for in mem secrets * format * fix warning message test secrets on all platforms * don't actually load anything in secrets test * run test with mock spy * add s3 extra * close connection after use * ensure at least aiohttp 3.9 for python 3.12 * move secrets tests further down and fix deps * update lockfile move secrets tests back to load tests remove extra deps from common tests
* keeps staging dataset from layout if dataset name empty * allows for empty dataset in clickhouse * sets staging destination dataset name in pipeline if destination dataset name empty * creates default dataset name only when destination is known * drop sentinel explicitly only when no dataset name * improves tests * adds docs * fixes more tests * fixes drop dataset so it is not dropped if no sentinel table
* add gcp default credential handling for delta table format * mark object store credentials test essential * make secrets dir if not exists * reset failed default credentials
* fix delta tests * update workflows
* part 1: only truncate existing tables * some more changes * only run "if exists truncating" on autodetect tables * create final table from staging tables only for autodetect tables on bigquery * add test * fix linter
* logs warning if deduplication state is large * tests for ALL_TEST_DATA_ITEM_FORMATS * improves error message, refactors magic number * Make threshold configurable, display the duplication warning only once, update the warning message, change the test to check for single warning * Move `duplicate_cursor_warning_threshold` to a ClassVar --------- Co-authored-by: Anton Burnashev <anton.burnashev@gmail.com>
* Extends _get_row_key_col to enable using single primary key as a fallback in gen_merge_sql(). * Remove fallback from `_get_row_key_col`, always fallback to primary key * Update the docstring in _get_row_key_col * Move arrow merge to test_merge_disposition * Adjust for delta tables * Includes table format in resource definition
* re-enable delta tests for read access add rudimentary tests for destination configs fixes small problems in test setup * fix cloudflare compat tests * cleans up env variables set by delta-rs * bumps to version 1.4.0 * fixes delta tests --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
✅ Deploy Preview for dlt-hub-docs canceled.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
master merge for 1.4.0 release