Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict table alias length to avoid truncation #1097

Merged
merged 27 commits into from
Feb 24, 2023
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
d672e9c
update snaps and tests to match LHS and RHS aliases
fh-mthomson Jan 12, 2023
c44fa25
Merge branch 'tidyverse:main' into mthomson/issue_1096_alias
fh-mthomson Jan 13, 2023
0a6c130
revert snaps
fh-mthomson Jan 13, 2023
1dae6fb
add back context
fh-mthomson Jan 13, 2023
d949e2d
generalize truncation capping to be independent of number of joins
fh-mthomson Jan 13, 2023
3b1ff56
add floor for reasonable table lengths
fh-mthomson Jan 13, 2023
320fdb5
revert test-verb-joins also
fh-mthomson Jan 13, 2023
64c39af
account for NA table names
fh-mthomson Jan 13, 2023
c080d79
reset tests from main
fh-mthomson Jan 13, 2023
abad91d
remove stringr
fh-mthomson Jan 13, 2023
059b4c9
add test
fh-mthomson Jan 13, 2023
69375ff
replace postgres with msSQL and nits
fh-mthomson Jan 13, 2023
de2c0c0
add snaps
fh-mthomson Jan 13, 2023
ebb5c71
cleaner join
fh-mthomson Jan 13, 2023
712ae8d
Merged upstream/main into fh-mthomson-mthomson/issue_1096_alias
mgirlich Feb 10, 2023
c3bcbcb
Remove unnecessary pipe
mgirlich Feb 10, 2023
fa72746
Use names that are easier to read
mgirlich Feb 10, 2023
2cc909e
remove coalesce
fh-mthomson Feb 15, 2023
d98db49
clarify comment for Postgres flooring
fh-mthomson Feb 15, 2023
6cf5580
use postgres backend instead of SQLite
fh-mthomson Feb 15, 2023
5d64882
put back some things
fh-mthomson Feb 16, 2023
51f81c3
update tests
fh-mthomson Feb 16, 2023
65589ea
remove space
fh-mthomson Feb 16, 2023
b0d0ecf
Merge branch 'main' into mthomson/issue_1096_alias
fh-mthomson Feb 16, 2023
431d05b
nit formatting from merge conflict
fh-mthomson Feb 16, 2023
3cd5d2f
add NEWS
fh-mthomson Feb 23, 2023
9c20fde
change tense to past
fh-mthomson Feb 23, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 14 additions & 3 deletions R/lazy-join-query.R
Original file line number Diff line number Diff line change
Expand Up @@ -199,15 +199,26 @@ sql_build.lazy_multi_join_query <- function(op, con, ...) {
}

generate_join_table_names <- function(table_names) {
table_name_length_max <- dplyr::coalesce(max(nchar(table_names$name)), 0)
mgirlich marked this conversation as resolved.
Show resolved Hide resolved

if (length(table_names$name) != 2) {
table_names_repaired <- vctrs::vec_as_names(table_names$name, repair = "unique", quiet = TRUE)
auto_name <- table_names$from != "as"
table_names$name[auto_name] <- table_names_repaired[auto_name]

return(table_names$name)
table_names_prepared <- table_names$name
} else{
table_names_prepared <- join_two_table_alias(table_names$name, table_names$from)
}

join_two_table_alias(table_names$name, table_names$from)
# avoid database aliases exceeding the database-specific maximum length
abbreviate(
table_names_prepared,
# arbitrarily floor at 63 (Postgres limit) to avoid unnecessarily truncating reasonable lengths
fh-mthomson marked this conversation as resolved.
Show resolved Hide resolved
minlength = max(table_name_length_max, 63),
# (explicit, default) err on the side of unique table names instead of exact length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this comment is saying. Can you clarify?

Copy link
Contributor Author

@fh-mthomson fh-mthomson Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is seemingly a fundamental limitation to abbreviate() which does not allow for both (a) max length to be binding and (b) names to be guaranteed unique.

From the documentation:

strict | logical: should minlength be observed strictly? Note that setting strict = TRUE may return non-unique strings.

I've opted for strict = FALSE to preserve uniqueness (since dbplyr will assert as much downstream, anyway), but there remains a non-zero risk that aliases still exceed the database limit. As one mitigation, I've updated the method to both.sides which tends to add a bit more scrambling (slightly less obvious alias names) but the resultant length is more likely to be aligned with the requested length.

I've spent a decent amount of time playing with combinations of args (including discussing with ChatGPT) but couldn't come up with a fully-error proof solution using abbreviate or an alternate function.

strict = FALSE,
named = FALSE
)
}

#' @export
Expand Down
31 changes: 31 additions & 0 deletions tests/testthat/_snaps/verb-joins.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,37 @@
LEFT JOIN `foo2`.`df` AS `df_RHS`
ON (`df_LHS`.`x` = `df_RHS`.`x`)

# alias truncates long table names at database limit

Code
self_join2 %>% remote_query()
Output
<SQL> SELECT `a012345678901234567890123456789012345678901234567890123456789012_L`.*
FROM `a012345678901234567890123456789012345678901234567890123456789012` AS `a012345678901234567890123456789012345678901234567890123456789012_L`
LEFT JOIN `a012345678901234567890123456789012345678901234567890123456789012` AS `a012345678901234567890123456789012345678901234567890123456789012_R`
ON (
`a012345678901234567890123456789012345678901234567890123456789012_L`.`x` = `a012345678901234567890123456789012345678901234567890123456789012_R`.`x` AND
`a012345678901234567890123456789012345678901234567890123456789012_L`.`y` = `a012345678901234567890123456789012345678901234567890123456789012_R`.`y`
)

---

Code
self_join3 %>% remote_query()
Output
<SQL> SELECT
`a012345678901234567890123456789012345678901234567890123456789012...1`.`x` AS `x`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That table name is 68 characters long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After using method = "both.sides" in abbreviate(), this case is no longer an issue. This is a good example where method = "left.keep" preserves too much of the LHS at the expense of having unnecessary length.

`a012345678901234567890123456789012345678901234567890123456789012...1`.`y` AS `y.x`,
`b012345678901234567890123456789012345678901234567890123456789012`.`y` AS `y.y`
FROM `a012345678901234567890123456789012345678901234567890123456789012` AS `a012345678901234567890123456789012345678901234567890123456789012...1`
LEFT JOIN `a012345678901234567890123456789012345678901234567890123456789012` AS `a012345678901234567890123456789012345678901234567890123456789012...2`
ON (
`a012345678901234567890123456789012345678901234567890123456789012...1`.`x` = `a012345678901234567890123456789012345678901234567890123456789012...2`.`x` AND
`a012345678901234567890123456789012345678901234567890123456789012...1`.`y` = `a012345678901234567890123456789012345678901234567890123456789012...2`.`y`
)
INNER JOIN `b012345678901234567890123456789012345678901234567890123456789012`
ON (`a012345678901234567890123456789012345678901234567890123456789012...1`.`x` = `b012345678901234567890123456789012345678901234567890123456789012`.`x`)

# join check `x_as` and `y_as`

Code
Expand Down
36 changes: 36 additions & 0 deletions tests/testthat/test-verb-joins.R
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,42 @@ test_that("join works with in_schema", {
)
})

test_that("alias truncates long table names at database limit", {
# Postgres has max table name length of 63; ensure it's not exceeded when generating table alias
withr::local_db_connection(con <- dbConnect(RSQLite::SQLite(), ":memory:"))
fh-mthomson marked this conversation as resolved.
Show resolved Hide resolved

nm1 <- paste0("a", paste0(0:62 %% 10, collapse = ""))
DBI::dbWriteTable(con, nm1, tibble(x = 1:3, y = "a"))
mf1 <- tbl(con, nm1)

nm2 <- paste0("b", paste0(0:62 %% 10, collapse = ""))
DBI::dbWriteTable(con, nm2, tibble(x = 2:3, y = "b"))
mf2 <- tbl(con, nm2)

# 2 tables
self_join2 <- left_join(mf1, mf1, by = c("x", "y"))

expect_equal(
fh-mthomson marked this conversation as resolved.
Show resolved Hide resolved
self_join2 %>% collect(),
tibble(x = 1:3, y = "a")
)
expect_snapshot(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you want to make specific assertions about the length of the generate table names, is it possible to avoid the snapshot tests in favour of something more specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking! Added some to look at the actual values from generate_join_table_names(); this catches the error observed above.

self_join2 %>% remote_query()
)

# 3 tables
self_join3 <- left_join(mf1, mf1, by = c("x", "y")) %>%
inner_join(mf2, by = "x")

expect_equal(
self_join3 %>% collect(),
tibble(x = 2:3, y.x = "a", y.y = "b")
)
expect_snapshot(
self_join3 %>% remote_query()
)
})

test_that("joins with non by variables gives cross join", {
df1 <- memdb_frame(x = 1:5)
df2 <- memdb_frame(y = 1:5)
Expand Down