-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implicit vs Explicit column arguments to join
#1168
Comments
For the actual syntax, if we do need it to be explicit: one option to consider is |
This is a great write-up on the issue. On scale from -2 to +2, my stance on implicit behavior is +1. I'm not concerned with the ambiguity, because using a bool column in join's filter does not make sense. I also agree that having an operator just for joins adds unneeded complexity to the syntax of the language. The problem with implicit behavior is the implementation -- I cannot figure out how to set the rules on when to apply the conversion from Imagine we have to write down the implementation of def join(left_table: Table, right_table: Table, join_filter: Column[bool]):
cross = cartesian_product(left_table, right_table)
return filter(cross, join_filter) (Let's imagine that How could you check that You essentially cannot, because from within the function, you only see the value of the param and not its expression. You could check the type of the column, but in our case, this information will not be available at-compile time in most cases. If we add arguments to join, we solved the issue partially: def join(left_table: Table, right_table: Table, join_filter: Column[bool], using: Column):
cross = cartesian_product(left_table, right_table)
join_filter.append(
left_table[using] == right_table[using]
)
return filter(cross, join_filter) We can differentiate between boolean filters and using columns. But the problem now is that argument For such approach to work, we would need to have lookup of the column ( def join(left_table: Table, right_table: Table, using_left: Column, using_right: Column):
cross = cartesian_product(left_table, right_table)
join_filter = [using_left == using_right]
return filter(cross, join_filter)
join(employees, salaries, employees.emp_id, salaries.emp_id) But this defeats the whole purpose of common operations being concise. I hope I made the problem clear, please do ask if somethings needs more explanation. I really want someone else than me to see the problem, because it may be that I'm looking at it from the wrong angle. (I'm using these examples to highlight how semantics of PRQL behave with the new merge - they are much closer to consistent semantics of general programming languages like Python or Rust and further from SQL which defines different semantics for different constructs like USING.) |
I like this very much! It may clash with assigns, and |
I don't understand the issue here well enough to make an informed comment. Instead I'm going to advise you to consider avoiding the
But... (This is the apocryphal part...) They realized that they had screwed up the command structure, but they had "dozens of users" so they were reluctant to change it. So it inflicted that pain on generations of users. It seems we're in the "dozens of users" stage. Break things if we must... Thanks. |
As @snth suggested, there is an option of having arguments that don't get resolved at all (either via s-strings or This is actually a viable option: # std declaration
- func join<table> `default_db.with`<table> filter `noresolve.side`:inner tbl<table> -> null
+ func join<table> `default_db.with`<table> filter `noresolve.using`:null `noresolve.side`:inner tbl<table> -> null from e=employees
join salaries using:id
join l=locations [e.office_address == l.address] The downside is that we give up name resolution and table inference, which may cause errors in some cases, where CTEs are created. |
This also violates the current rules on function parameters, because positional params are required: https://prql-lang.org/book/queries/functions.html |
I think this is a key point. For the implicit mode to work, we need to be able to inspect the contents of the expression, and given we don't materialize the expression, that's not possible. It's a corollary of the "semantic simplicity" point above — the model of encapsulated expressions is in the compiler as well as the language. So that does weigh me towards going Explicit. I like the |
This is a fast moving discussion! (Sorry, I've been putting my comments on Discord and forgot about this issue.) What I was suggesting is generalising the I was asking @aljazerzen if the So the syntax would still be as originally be proposed by @aljazerzen but it would have slightly more general semantics. So from employees
join salaries [~id] would still produce SELECT *
FROM employees
JOIN salaries USING (id) while from user_emails
join marketing_campaigns [hasnt_unsubscribed] would produce SELECT *
FROM user_emails
JOIN marketing_campaigns
ON hasnt_unsubscribed |
With the func json_get json_col key `noresolve.type` -> s"({json_col}->>{key})::{type}"
from sometable
derive [
id = (json_get "abc" "id" text),
] could then be rewritten as func json_get json_col key ~type -> s"({json_col}->>{key})::{type}"
from sometable
derive [
id = (json_get "abc" "id" text),
] |
This use of https://www.datacamp.com/tutorial/r-formula-tutorial
|
Another thing that came up on the dev call yesterday, which I think we should bring into this discussion, is how strict the name resolution should be. The two viewpoints as I understood them were:
It was mentioned that the burden of the strict name resolution option could be lessened through tooling, which at authoring time could suggest the ambiguous parent object names as possible autocompletions. |
I am in favour of option 1, strict name resolution, especially in light of the possibility of a The from employees
join salaries [~id] would be the strict mode equivalent of from employees
join salaries [employees.id==salaries.id] In the first case the user takes responsibility that |
I think this should be a separate issue.
Unfortunately, no. For the same reasons that we cannot change behavior of join upon detecting if column has boolean or some other type. If I understand correctly, this noresolve I'm familiar with ~ in R, but I never understood how does it work. My current understanding is that functions in R take arguments by expression and not by value (as they do in all other languages I know). This allows R to unpack the input expression and treat it as a formula. For me, this is one of magic things that R does that I really don't want in PRQL. That's because in simple cases, magic helps a little and in complex cases there is no way to guess what the magic will do. Invoking DoOcracy, I'll change "self equality operator" syntax to But let's keep this issue open for a few weeks to get more feedback. |
Agree re separate issue for strictness. And great re DoOcracy (I hadn't heard of that but very much subscribe!) And, not to agree with everything in that comment :) , but also re |
Ok, I thought that one difference might be that the column type would depend on the database and would only be knowable at runtime whereas a distinction between ident and "unresolved ident" if specified in the PRQL code could be known at compile time and could therefore be used to branch the behaviour. I was just operating on the reply to my comment on Discord that
but I see from your description of my comment above that we had different understandings of what I meant.
Probably, yes. I was proposing it as essentially syntactic sugar for the
However rereading that comment again, my understanding now is that that only referred to params and I was kinda taking it out of context and applying it more generally. So to clarify, Thanks for the name-resolving PRQL Book link. |
Yes, exactly. For example But again, this is implementation detail and I don't want it to become a part of the language, because we may once declare |
I think we can close this? If there are any docs to write, feel free to assign them to me @aljazerzen |
I've already added a section in the PR: https://prql-lang.org/book/transforms/join.html Take a look and expand if needed. |
Ah yes, I already did a couple of tweaks in 39f2fb2, I forgot |
I was planning to release 0.3 as discussed, with the change in
join
such that supplying columns to an equi-join requires an additional character to make explicit columns vs. conditions:This is because a bare
id
is treated as a bool condition, likee.office_address == l.address
is treated. (and in theory,id
could be a bool column)I'm still fine to do the release and assess feedback, as discussed. But I wanted to raise whether we should:
main
, rather than big user-facing changes in a version number.On the specific language change, I see it as a tradeoff between syntactic simplicity and semantic simplicity:
USING
/ an equi-join / a join between identically named columns] with~
"? (Maybe it feels bigger to us because it's a change and it's so prominent in our docs and examples?)[id]
mean something different from[id==true]
in ajoin
breaks the encapsulation of the expression. The compiler needs to understand what's inside the expression; uncorrelated concepts become correlated, the language becomes less general and less orthogonal.For example, an unlikely but possible example — is
bar
a condition or a column in the join? I guess it's a condition because we know it'sbaz==bax
. But ifbar
were materialized in a column in the DB, then the behavior suddendly changes.I've been supportive of Approach for expressions in
from
#919, which increases generality, this would go against that themeIf anyone has ideas for an alternative representation rather than
~
, then feel free to suggest! Though I actually think that~
is pretty good.One alternative would be to have a different parameter; e.g.
using:[id]
, but then given the conditions parameter would still be required, we'd have an awkwardjoin locations using:[id] []
.If we do go the explicit route, is there something we can do to make this clearer for users? I would find this quite confusing if I weren't watching the releases and all of a sudden this compiles to something completely different:
PRQL has a higher ratio of expectations&excitement vs. users than most projects, so it's fine to make breaking changes atm. But this is potentially quite severe. Assuming we go the explicit route, should we raise an error for a bare column name for a few versions so it's at least obvious when people do this?
Without wanting to zoom out too far, possibly it's worth considering this in the context of overall joins; e.g. #716 & #723
Where do folks end up? As I said prior, @aljazerzen has full rights to respond with 😫, and I'll do the release.
semantic
was really Herculean, and we're still young enough that we probably underrate velocity.For transparency, if we do decide to make the change, I'm flat-out with non-PRQL stuff until mid-week, after which I have more time and would be happy to work on this. I'm quite excited to get into working with the new compiler!
Footnotes
"small" here means both in character-count and syntactic complexity, in this case
~
is small in character-count but adds to syntactic complexity. For theory around compression, check out source-coding, and I can find better references if folks are interested ↩The text was updated successfully, but these errors were encountered: