Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict table alias length to avoid truncation #1097

Merged
merged 27 commits into from
Feb 24, 2023

Conversation

fh-mthomson
Copy link
Contributor

@fh-mthomson fh-mthomson commented Jan 12, 2023

Fixes #1096

It was my first time working with snapshot testing (very cool!) but my approach was to review / accept the proposed changes per devtools::test() messaging only but noting that it updated some unrelated sections (e.g., can explain) in case that's undesired.

@mgirlich
Copy link
Collaborator

Thanks for your PR!
Do you actually have to use such long table names that you stumbled over this issue? 😱

So far, this only solves parts of the issue. In case of more than two joins there could still be issues with the length of the table names, e.g.

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

lf <- lazy_frame(x = 1, .name = "who_uses_such_long_table_names_as_they_are_hard_to_write")
lf2 <- lazy_frame(x = 1, .name = "some_other_table")

left_join(lf, lf, by = "x") %>% 
  left_join(lf2, by = "x")
#> <SQL>
#> SELECT `who_uses_such_long_table_names_as_they_are_hard_to_write...1`.`x` AS `x`
#> FROM `who_uses_such_long_table_names_as_they_are_hard_to_write` AS `who_uses_such_long_table_names_as_they_are_hard_to_write...1`
#> LEFT JOIN `who_uses_such_long_table_names_as_they_are_hard_to_write` AS `who_uses_such_long_table_names_as_they_are_hard_to_write...2`
#>   ON (`who_uses_such_long_table_names_as_they_are_hard_to_write...1`.`x` = `who_uses_such_long_table_names_as_they_are_hard_to_write...2`.`x`)
#> LEFT JOIN `some_other_table`
#>   ON (`who_uses_such_long_table_names_as_they_are_hard_to_write...1`.`x` = `some_other_table`.`x`)

Created on 2023-01-13 with reprex v2.0.2

If you want you can try to fix this (probably in generate_join_table_names()).

R/verb-joins.R Outdated Show resolved Hide resolved
@fh-mthomson fh-mthomson changed the title Simplify alias to LHS and RHS for self joins Restrict table alias length to avoid truncation Jan 13, 2023
R/lazy-join-query.R Outdated Show resolved Hide resolved
R/verb-joins.R Outdated Show resolved Hide resolved
@fh-mthomson
Copy link
Contributor Author

@mgirlich thank you for the review! I've pushed what I think is a more robust approach based on your great feedback. let me know what you think!

Do you actually have to use such long table names that you stumbled over this issue? 😱

Unfortunately, I'm dealing with tables already materialized by another team as {long_project_id_[1, 2, 3, ...}_{very_descriptive_table_name}. Not ideal, but it works! 😅

@mgirlich mgirlich added this to the 2.3.1 milestone Feb 2, 2023
@mgirlich mgirlich requested a review from hadley February 10, 2023 10:17
@mgirlich
Copy link
Collaborator

We could consider creating shorter names by abbreviate but I think this should be done in a different PR. I also noticed that abbreviate() loves underscores a bit too much

abbreviate("abbreviate_does_like_underscores_too_much")
#> abbreviate_does_like_underscores_too_much 
#>                                    "a___"

Created on 2023-02-10 with reprex v2.0.2

@mgirlich mgirlich mentioned this pull request Feb 10, 2023
19 tasks
R/lazy-join-query.R Show resolved Hide resolved
R/lazy-join-query.R Outdated Show resolved Hide resolved
table_names_prepared,
# arbitrarily floor at 63 (Postgres limit) to avoid unnecessarily truncating reasonable lengths
minlength = max(table_name_length_max, 63),
# (explicit, default) err on the side of unique table names instead of exact length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this comment is saying. Can you clarify?

Copy link
Contributor Author

@fh-mthomson fh-mthomson Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is seemingly a fundamental limitation to abbreviate() which does not allow for both (a) max length to be binding and (b) names to be guaranteed unique.

From the documentation:

strict | logical: should minlength be observed strictly? Note that setting strict = TRUE may return non-unique strings.

I've opted for strict = FALSE to preserve uniqueness (since dbplyr will assert as much downstream, anyway), but there remains a non-zero risk that aliases still exceed the database limit. As one mitigation, I've updated the method to both.sides which tends to add a bit more scrambling (slightly less obvious alias names) but the resultant length is more likely to be aligned with the requested length.

I've spent a decent amount of time playing with combinations of args (including discussing with ChatGPT) but couldn't come up with a fully-error proof solution using abbreviate or an alternate function.

self_join3 %>% remote_query()
Output
<SQL> SELECT
`a012345678901234567890123456789012345678901234567890123456789012...1`.`x` AS `x`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That table name is 68 characters long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After using method = "both.sides" in abbreviate(), this case is no longer an issue. This is a good example where method = "left.keep" preserves too much of the LHS at the expense of having unnecessary length.

self_join2 %>% collect(),
tibble(x = 1:3, y = "a")
)
expect_snapshot(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you want to make specific assertions about the length of the generate table names, is it possible to avoid the snapshot tests in favour of something more specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking! Added some to look at the actual values from generate_join_table_names(); this catches the error observed above.

tests/testthat/test-verb-joins.R Show resolved Hide resolved
tests/testthat/test-verb-joins.R Outdated Show resolved Hide resolved
Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

@mgirlich mgirlich merged commit e7e3086 into tidyverse:main Feb 24, 2023
@fh-mthomson fh-mthomson deleted the mthomson/issue_1096_alias branch February 24, 2023 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restrict table alias length
3 participants