-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging with null matching causes extreme performance degradation. #2891
Comments
Hello Bendan, sorry to hear that my pr introduced this huge delay in timings. As an alternative to the control check
can you please try the following:
Looking at the documentation https://trino.io/docs/current/functions/comparison.html#is-distinct-from-and-is-not-distinct-from this is the the proper way to have some sort of equality check on NULL values, but we must be aware that this doesn't work if any of the columns involved contains ALL NULL values (due to a bug on trino, idk honestly). Please let me know. |
I just ran a test using |
@brendan-cook-87 The drastic increase in query execution time when adding the
Hope, this helps ad let me know the further issues |
I agree merging on id = id works fine. That's exactly why I raised the issue. |
Describe the bug
When we add the join clause
OR (target."id" IS NULL AND source."id" IS NULL)
to enable matching of nulls in the id columns, it causes some queries to take orders of magnitude longer to complete.#2872
How to Reproduce
I have a table with approximately 17 million rows. And the approximate data scanned for an insert operation is 350MB.
The previous behaviour of running:
MERGE INTO "production_mobile"."tracks" target USING "production_mobile"."temp_table_f30f8023428b4dab816840e62ba40699" source ON (target."id" = source."id")
to insert ~300 new rows would execute in around 5s.
Running it with this clause:
MERGE INTO "production_mobile"."tracks" target USING "production_mobile"."temp_table_f30f8023428b4dab816840e62ba40699" source ON (target."id" = source."id" OR (target."id" IS NULL AND source."id" IS NULL)
is taking 12+ minutes across several attempts.
This table has no nulls in the id column. It is processing event logs and merging on UUIDs.
Expected behavior
This behaviour should not be the default given that it causes live production systems such a degradation in performance.
I am unable to update to the latest version of this layer at this time.
Your project
No response
Screenshots
No response
OS
Mac
Python version
3.11
AWS SDK for pandas version
3.9.0
Additional context
No response
The text was updated successfully, but these errors were encountered: