-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize lift + shift for cross-db macros #359
Conversation
The
This macro does compile SQL that is just about unreadably long. Perhaps it could be "minified"? :) FWIW, it's passing on all other methods, so we might just mark it with |
{% if order_by_clause %} | ||
{{ exceptions.warn("order_by_clause is not supported for listagg on Spark/Databricks") }} | ||
{% endif %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs make it pretty clear that:
The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.
The only way I've found to this requires a subquery to calculate rank()
first, then passed into collect_list
(with a struct
and array_sort
to boot, probably)
4b5ec8a
to
44fd7a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I feel good about shipping this since it's mainly moving the code from one repo to another.
For long-term peace-of-mind, would prefer that we invest in finding or creating a robust collection of test cases for dateadd
and datediff
. That way, we can have more confidence that these macros will produce the same results across adapters for the tricky edge cases. See inline comment for slightly more detail.
@pytest.mark.skip_profile('spark_session') | ||
class TestDateDiff(BaseDateDiff): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the tests for datediff
skipped for spark_session
?
Time logic is complicated, and we have a highly custom implementation, so it feels crucial to test it if at all possible.
On the other hand, I think this logic has been battle-tested for a couple years, which gives a nice vote of confidence.
I'd also like to re-review the BaseDateDiff implementation to see if it has coverage of each possible datepart
.
Would be awesome if we could find a robust suite of a test cases that cover well-known edge cases like timestamps with a non-00 UTC offset, daylight savings boundaries, leap years, leap seconds, etc.
Spark changed its Julian vs. (Proleptic) Gregorian calendar handling between Spark 2.4 and 3.0, but not sure if we need to worry about that piece at all (talk, slides).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be awesome if we could find a robust suite of a test cases that cover well-known edge cases like timestamps with a non-00 UTC offset, daylight savings boundaries, leap years, leap seconds, etc.
Agreed. This feels important for our work around foundational data types + current_timestamp
as well.
A bunch of tests have been failing for spark_session
. I'm skipping them now for expediency, but these tests are running on the four other connection types for which we have testing. Those connection types are:
thrift
for local (self-hosted) Spark- Databricks interactive cluster via HTTP
- Databricks interactive cluster via ODBC
- Databricks SQL endpoint via ODBC
I'll be the first to admit that the spark_session
connection method is an advanced capability that I don't know lots about :) It's useful for advanced users / PySpark superusers when testing locally, but I would not recommend folks to use in production. It will never be supported in dbt Cloud. We've documented it as such.
It's true that the datediff
(and datedd
) macros produce a LUDICROUS amount of compiled SQL. The alternatives to tons of repeated code would be to run each snippet as an introspective query, store its result, and template it into the subsequent operation.
That talk about calendar switching is ... cool!
I'm sure there are edge cases with these implementations. The work involved in reaching parity / consistency for just the integration test cases that we already have in place was immense. I think that's the feature, for now..?
🥳 |
### Description Ports tests for lift + shift for cross-db macros from [dbt-labs/dbt-spark#359](dbt-labs/dbt-spark#359).
Follow-up to
dbt-labs/dbt-core#5265dbt-labs/dbt-core#5298No more
spark-utils
??? Not quite, but close. I've opened a follow-on PR there to ensure backwards compatibility for those who celebrate: dbt-labs/spark-utils#25Checklist
CHANGELOG.md
and added information about my change to the "dbt-spark next" section.