-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restrict table alias length to avoid truncation #1097
Restrict table alias length to avoid truncation #1097
Conversation
Thanks for your PR! So far, this only solves parts of the issue. In case of more than two joins there could still be issues with the length of the table names, e.g. library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)
lf <- lazy_frame(x = 1, .name = "who_uses_such_long_table_names_as_they_are_hard_to_write")
lf2 <- lazy_frame(x = 1, .name = "some_other_table")
left_join(lf, lf, by = "x") %>%
left_join(lf2, by = "x")
#> <SQL>
#> SELECT `who_uses_such_long_table_names_as_they_are_hard_to_write...1`.`x` AS `x`
#> FROM `who_uses_such_long_table_names_as_they_are_hard_to_write` AS `who_uses_such_long_table_names_as_they_are_hard_to_write...1`
#> LEFT JOIN `who_uses_such_long_table_names_as_they_are_hard_to_write` AS `who_uses_such_long_table_names_as_they_are_hard_to_write...2`
#> ON (`who_uses_such_long_table_names_as_they_are_hard_to_write...1`.`x` = `who_uses_such_long_table_names_as_they_are_hard_to_write...2`.`x`)
#> LEFT JOIN `some_other_table`
#> ON (`who_uses_such_long_table_names_as_they_are_hard_to_write...1`.`x` = `some_other_table`.`x`) Created on 2023-01-13 with reprex v2.0.2 If you want you can try to fix this (probably in |
@mgirlich thank you for the review! I've pushed what I think is a more robust approach based on your great feedback. let me know what you think!
Unfortunately, I'm dealing with tables already materialized by another team as |
We could consider creating shorter names by abbreviate but I think this should be done in a different PR. I also noticed that abbreviate("abbreviate_does_like_underscores_too_much")
#> abbreviate_does_like_underscores_too_much
#> "a___" Created on 2023-02-10 with reprex v2.0.2 |
R/lazy-join-query.R
Outdated
table_names_prepared, | ||
# arbitrarily floor at 63 (Postgres limit) to avoid unnecessarily truncating reasonable lengths | ||
minlength = max(table_name_length_max, 63), | ||
# (explicit, default) err on the side of unique table names instead of exact length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this comment is saying. Can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is seemingly a fundamental limitation to abbreviate()
which does not allow for both (a) max length to be binding and (b) names to be guaranteed unique.
From the documentation:
strict | logical: should minlength be observed strictly? Note that setting strict = TRUE may return non-unique strings.
I've opted for strict = FALSE
to preserve uniqueness (since dbplyr
will assert as much downstream, anyway), but there remains a non-zero risk that aliases still exceed the database limit. As one mitigation, I've updated the method
to both.sides
which tends to add a bit more scrambling (slightly less obvious alias names) but the resultant length is more likely to be aligned with the requested length.
I've spent a decent amount of time playing with combinations of args (including discussing with ChatGPT) but couldn't come up with a fully-error proof solution using abbreviate
or an alternate function.
tests/testthat/_snaps/verb-joins.md
Outdated
self_join3 %>% remote_query() | ||
Output | ||
<SQL> SELECT | ||
`a012345678901234567890123456789012345678901234567890123456789012...1`.`x` AS `x`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That table name is 68 characters long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After using method = "both.sides"
in abbreviate()
, this case is no longer an issue. This is a good example where method = "left.keep"
preserves too much of the LHS at the expense of having unnecessary length.
self_join2 %>% collect(), | ||
tibble(x = 1:3, y = "a") | ||
) | ||
expect_snapshot( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you want to make specific assertions about the length of the generate table names, is it possible to avoid the snapshot tests in favour of something more specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thinking! Added some to look at the actual values from generate_join_table_names()
; this catches the error observed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
Fixes #1096
It was my first time working with snapshot testing (very cool!) but my approach was to review / accept the proposed changes per
devtools::test()
messaging only but noting that it updated some unrelated sections (e.g.,can explain
) in case that's undesired.