-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adds MERGE INTO transform #109
feat: Adds MERGE INTO transform #109
Conversation
b7b996b
to
3d2ce14
Compare
This commit adds the ability to convert snowflakes [MERGE INTO](https://docs.snowflake.com/en/sql-reference/sql/merge) functionality into a functional equivalent implementation in duckdb. To do this we need to break apart the WHEN [NOT] MATCHED syntax into separate statements to be executed indepedently. This commit only adds the transform, there is more refactoring required in fakes.py in order to handle a transform that transforms a single expression into multiple expressions.
3d2ce14
to
c842900
Compare
Thanks for starting this! It's a tricky feature. I hope you don't mind but I've pushed 40ada94 as a test case to cover. It comes from the Snowflake MERGE docs. It provides good coverage with a mix of delete, update and insert branches with some conditionals: MERGE INTO t1 USING t2 ON t1.t1Key = t2.t2Key
WHEN MATCHED AND t2.marked = 1 THEN DELETE
WHEN MATCHED AND t2.isNewStatus = 1 THEN UPDATE SET val = t2.newVal, status = t2.newStatus
WHEN MATCHED THEN UPDATE SET val = t2.newVal
WHEN NOT MATCHED THEN INSERT (t1Key, val, status) VALUES (t2.t2Key, t2.newVal, t2.newStatus); It also demonstrate that MERGE needs to:
One way to solve these is to calculate the merge operations up front using the original table and store then as a MERGE_OP column in a temporary table. This ensures the branch conditions work with the state of the original table, not the table after it has already been modified by a DML statement run for a previous branch. So something like this for the test case above: CREATE OR REPLACE TEMPORARY TABLE merge_candidates AS
SELECT
t1.t1Key AS t1_t1Key,
t1.val AS t1_val,
t1.status AS t1_status,
t2.t2Key AS t2_t2Key,
t2.marked AS t2_marked,
t2.isNewStatus AS t2_isNewStatus,
t2.newVal AS t2_newVal,
t2.newStatus AS t2_newStatus,
CASE
WHEN t1.t1Key = t2.t2Key AND t2.marked = 1 THEN 1 -- DELETE
WHEN t1.t1Key = t2.t2Key and t2.isNewStatus = 1 THEN 2 -- UPDATE with conditional
WHEN t1.t1Key = t2.t2Key THEN 3 -- UPDATE the rest
WHEN t1.t1Key IS NULL THEN 4 -- INSERT remaining matches
ELSE 0 -- IGNORE
END AS MERGE_OP
FROM t1
FULL OUTER JOIN t2 ON t1.t1Key = t2.t2Key
WHERE MERGE_OP > 0; This can then be used to drive the DML statements for each branch, and provide the result set, like this: DELETE FROM t1
USING merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1key
AND mc.merge_op = 1;
UPDATE t1
SET val = mc.t2_newVal,
status = mc.t2_newStatus
FROM merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1Key
AND mc.merge_op = 2;
UPDATE t1
SET val = mc.t2_newVal
FROM merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1Key
AND mc.merge_op = 3;
INSERT INTO t1 (t1Key, val, status)
SELECT mc.t2_t2Key, mc.t2_newVal, mc.t2_newStatus
FROM merge_candidates AS mc
WHERE mc.merge_op = 4;
SELECT COUNT_IF(merge_op in (4)) AS "number of rows inserted", COUNT_IF(merge_op in (2,3)) AS "number of rows updated", COUNT_IF(merge_op in(1)) AS "number of rows deleted"
from merge_candidates; Let me know what you think, and if this makes sense! |
Thanks for the detailed feedback, seems merge is a lot more intricate than my usecase of upsert 😂 . I'm on holiday next week so it might be awhile before you see more progress on this. Thanks for the testcase. |
* main: chore(main): release 0.9.20 (tekumara#115) fix: $$ not considered a variable fix: concurrent connection write-write conflict refactor: extract test_connect.py feat: SHOW PRIMARY KEYS for table (tekumara#114) chore(main): release 0.9.19 (tekumara#113) feat: Implements basic snowflake session variables via SET/UNSET (tekumara#111) chore(deps-dev): bump pyright from 1.1.366 to 1.1.369 (tekumara#112) chore(main): release 0.9.18 (tekumara#110)
Got basic case for merge working with a merge2 function, needs to be cleaned up, deleting original merge, fix result statement.
Marked PR as WIP while I work through getting this functionality fully running. I've got the new test case running with a second attempt but I've got lots to clean up before its ready for review. I'll mention you when its ready for review again. |
* main: fix: Only set variables for SetItem expressions (tekumara#116)
* main: feat: alter table cluster by chore: cruft update chore: bump sqlglot 25.5.1
How did you go @jsibbison-square? Do you need any help with this? |
Hi @tekumara sorry I've run short on time at the moment. This was working as far as I was concerned but I haven't had time to refactor as its a bit messy. Happy for it to be merged as is and refactor one day or you can take over if you want the feature sooner. |
* main: (38 commits) feat(server): support time & timestamp types ci(server): description - cover more types feat(server): handle snowflake ProgrammingError refactor: use rowtype for sf metadata in arrow schema feat(server): support empty result set chore(main): release 0.9.24 (tekumara#128) fix: don't require pandas at import time chore(main): release 0.9.23 (tekumara#126) feat: support conn.is_closed() refactor: extract conn, cursor, pandas_tools refactor: describe_as_rowtype uses sf_type refactor: describe_as_rowtype refactor: extract describe_as_result_metadata feat(server): support cur.description feat(server): support bool, int, float types chore(main): release 0.9.22 (tekumara#124) wip(server): add /queries/v1/abort-request fix: fetchmany supports irregular sizes fix: column types for DESCRIBE feat: describe view information_schema.* ...
* main: chore(deps): update ruff requirement from ~=0.5.1 to ~=0.6.3 (tekumara#130) chore(deps-dev): bump pyright from 1.1.374 to 1.1.378 (tekumara#133)
@tekumara I've now got a passing build, isolated merge logic into its own files and added some documentation so I can hope to recall how it works in the future. Please review 🙏 |
This reverts commit dfa68e1. because it doesn't work when the alias is for a select expression eg: tekumara#24 (comment)
Thanks so much @jsibbison-square for this significant piece of work. I've pushed some refactoring to use pure functions rather than a class, as there's no permanent state to capture, and turned your comment documenting the sql output into a test. I'm also considering whether it's worth refactoring this to have a single The main thing blocking merging this though, is it doesn't yet solve for a |
Thanks again @jsibbison-square - I've merged this and will address the above in a follow-up PR. |
I did start with one table but then went with two, I don't recall exactly why sadly. Thanks for merging in its partial support form, we have a lot of usecases that are the most basic of merges so it'll be great to start using. |
Glad to hear you'll be using this soon. I've raised #136 |
…136) Supports MERGE with - multiple join columns, that may have the same name - when the source is a subquery To do this, I've refactored the way merge works to align with the [approach suggested here](#109 (comment)), ie: - use a single `merge_candidates` temporary table to map rows to mutation operations (delete, update, insert) - separate candidate table creation, mutation operations (delete, update, insert), and counts into separate functions, rather than co-mingling this functionality, for better readability - use join columns rather than rowids, and avoid using a transaction because duckdb doesn't support nested transactions so the current implementation will error if an existing transaction is active resolves #24
🤖 I have created a release *beep* *boop* --- ## [0.9.25](v0.9.24...v0.9.25) (2024-09-16) ### Features * Adds MERGE INTO transform ([#109](#109)) ([d5e14a7](d5e14a7)) * close duckdb connection ([223f8e2](223f8e2)) * **server:** handle snowflake ProgrammingError ([9455a43](9455a43)) * **server:** support empty result set ([b967b69](b967b69)) * **server:** support FAKESNOW_DB_PATH ([af79f77](af79f77)) * **server:** support time & timestamp types ([1606a3e](1606a3e)) * support MERGE with multiple join columns and source subqueries ([#136](#136)) ([9b5a7a0](9b5a7a0)), closes [#24](#24) ### Chores * **deps-dev:** bump pyright from 1.1.374 to 1.1.378 ([#133](#133)) ([593a420](593a420)) * **deps:** update ruff requirement from ~=0.5.1 to ~=0.6.3 ([#130](#130)) ([6b37d8b](6b37d8b)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: potatobot-prime[bot] <132267321+potatobot-prime[bot]@users.noreply.github.com>
This commit adds the ability to convert snowflakes MERGE INTO functionality into a functional equivalent implementation in duckdb. To do this we need to break apart the WHEN [NOT] MATCHED syntax into separate statements to be executed independently.
This commit only adds the transform, there is more refactoring required in
fakes.py
in order to handle a transform that transforms a single expression into multiple expressions so it cannot be used yet. I have a change baking for adding it intofakes
but wanted to separate PRs.Let me know if you need more tests or anything as its quite a complicated feature.