Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: dm_from_con() gains .names argument for pattern-based construction of table names in the dm object #1790

Merged
merged 18 commits into from
Jul 11, 2023

Conversation

owenjonesuob
Copy link
Contributor

N.B. This adds to #1789, and should probably be merged after that PR has been merged.


We could enable name repairing for a dm object, in a similar way to how column names can be repaired within tibble::tibble().

With SQL, we should be able to assume that:

  • Schema names are unique
  • Within a schema, table names are unique

So using a default repair strategy of "check_unique" shouldn't actually ever result in failure.

In more drastic cases, e.g. if the user passes a custom function to .name_repair, we might end up with name clashes - in which case, we fail gracefully with an informative error message.


See #1534 for further discussion. If we assume schema/table names do not contain ., then the "snakecase" approach for tidy_names as described there could now be achieved with:

dm_from_con(
  ...,
  .name_repair = ~ snakecase::to_snake_case(.x, sep_in = "[^[:alnum:].]")
)

@owenjonesuob
Copy link
Contributor Author

Following discussion in #1789, this feature now follows a different approach. Quoting from a comment in that PR:

Add a .names = NULL argument to dm_from_con(), where NULL is a "smart default" like the one used in dplyr::across(). So the user can specify their own pattern, taking advantage of special variables .schema and .table if they would like to - but by default:

  • When length(schema) == 1, use "{.table}", akin to the "single" default of "{.col}" in across()
  • Otherwise use "{.schema}.{.table}", akin to the "plural" default of "{.col}_{.fn}" in across(), but adhering more closely to how entities are usually specified in SQL

Regarding possible name clashes: for now, I've kept the warning-based approach implemented in #1789, but we could upgrade this to an error if we think that's more appropriate.

@owenjonesuob owenjonesuob changed the title Add name repair functionality Add pattern-based construction of local names Mar 22, 2023
@krlmlr krlmlr changed the title Add pattern-based construction of local names feat: Add pattern-based construction of local names Jun 13, 2023
@krlmlr
Copy link
Collaborator

krlmlr commented Jun 21, 2023

Thank you for your patience. Merging the other PR soon, and perhaps #1184 on top of that. Are you open to extending this PR to allow for user-defined resolution of clashes? Note that the internals have changed quite a bit with #1894.

@owenjonesuob
Copy link
Contributor Author

owenjonesuob commented Jun 22, 2023

Could I check what you mean by "user-defined resolution of clashes"? Would we want some kind of interactive process for that?

Fwiw, with the "smart default" as described above (implemented in this branch at present), we are guaranteed to avoid clashes, due to SQL naming constraints: table names must be unique within a schema, and schema names must be unique within a database. So it is only possible for a user to encounter clashes, if they have intentionally passed a non-default value to the .names parameter in dm_from_con() - in which case, can we assume that the current warning is sufficient to help them diagnose and fix their custom naming pattern?

Copy link
Collaborator

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Another way to disambiguate is to offer name repair via vctrs::vec_as_names() . If the default is to fail in case of clashes, we could offer an error message (instead of a warning) with a clear remedy.

R/db-helpers.R Outdated
Comment on lines 161 to 165
names_pattern <- if (length(schema) == 1) {
names %||% "{.table}"
} else {
names %||% "{.schema}.{.table}"
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be surprising for users of this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know this package's audience far better than I do! So I'll ultimately take your advice here. But I'll share why I believe this default isn't too surprising for users with either a very little bit of SQL/database experience, or some slightly broader tidyverse experience:

With (a very little) SQL experience:

  • In SQL (and assuming we are working within a single database), entities are usually specified as schema.table, e.g. select * from schema.table
  • Although it is strongly discouraged when writing SQL directly, it's possible (for at least some versions of SQL) to omit the schema name - in which case, the table is searched for in the user's "default-schema" (or, in each schema in turn along their "search path" in some versions e.g. Postgres)
  • The default-schema has a default value (e.g. dbo for SQL Server, public for Postgres), but it is possible for a user to override it via an explicit command (e.g. StackOverflow instructions here for SQL Server)
  • So with this in mind, I think users fall into one of three categories:
    • "Huh, what's a schema?": no problem, they use the default default-schema for whichever database they connect to (i.e. they do not specify schema = in dm_from_con()), and can reach all the tables in that schema via a single-part name like table
    • "I only need things in one particular schema": they either intentionally use the default schema, or specify a single non-default schema in dm_from_con() - but either way, they are conscious that they are using a single schema of their choice, i.e. they have knowingly chosen their personal default-schema and can therefore use single-part names like table
    • "I need multiple schemas!": they specify multiple schemas in dm_from_con() - they are conscious of this decision, and are effectively setting their "search path" for tables. The default naming pattern is guaranteed to give unique two-part names like schema.table, but the user can, against best practice, choose to use a different pattern, e.g. {.table} to generate one-part names. Only the first table on the search path with each name will be kept, with a warning raised about tables not kept, thanks to feat: dm_from_con() can retrieve multiple schemas, pass a character vector to the schema argument #1789.

With some broader tidyverse experience:

There are already well-established examples of using glue patterns and "special variables" (for us, .schema and .table) for construction of object names elsewhere in the tidyverse - for example:

  • .names in dplyr::across() - which has a near-identical "smart default" to ours here, i.e. switches from one-part to two-part names in order to avoid ambiguity when multiple "sources" are involved (schemas in our case, functions in across()'s case)
  • names_glue in tidyr::pivot_wider()

@owenjonesuob
Copy link
Contributor Author

Thanks. Another way to disambiguate is to offer name repair via vctrs::vec_as_names() . If the default is to fail in case of clashes, we could offer an error message (instead of a warning) with a clear remedy.

The first attempt to implement this feature did actually use vctrs::vec_as_names() - see 9d92ebe and preceding commits on this branch if you like! But I think pattern-based name construction is a more intuitive, and more powerful alternative to name repair - unless you think both could coexist somehow?

Re warning/error for name clashes: we could upgrade to an error, but I think this comment from #1789 still applies:

... I think the only advice we can provide is for the user to rename the one of the tables in the database itself, which is probably not feasible. So an error might completely block this functionality for users, whereas an unambiguous warning would allow them to carefully sidestep the problem by making sure schemas is ordered correctly for their use case.

Perhaps there is one additional piece of advice we could provide now: if there is a name clash, it MUST be because the user has overridden the "smart default" for .names in dm_from_con(), as reasoned in a previous comment. So rather than assuming that the user will figure that out, perhaps we could provide a tidyverse-esque hint that they might want to check their .names pattern and adjust it to avoid clashes.

@krlmlr krlmlr changed the title feat: Add pattern-based construction of local names feat: dm_from_con() gains .names argument for pattern-based construction of table names in the dm object Jul 11, 2023
@krlmlr krlmlr merged commit 7f2dc65 into cynkra:main Jul 11, 2023
16 checks passed
@krlmlr
Copy link
Collaborator

krlmlr commented Jul 11, 2023

Thanks! I'm not in love with the automagic, but your arguments all make sense. Now there's a message when the slightly surprising option is chosen, and the argument and behavior is marked experimental.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants