Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join should not include columns twice #1138

Closed
cottrell opened this issue Nov 17, 2022 · 9 comments
Closed

Join should not include columns twice #1138

cottrell opened this issue Nov 17, 2022 · 9 comments
Labels
bug Invalid compiler output or panic compiler language-design Changes to PRQL-the-language

Comments

@cottrell
Copy link

Consider

from first_table
join side:left second_table [some_index]

which currently outputs

SELECT
  first_table.*,
  second_table.*,
  some_index
FROM
  first_table
  LEFT JOIN second_table USING(some_index)

but probably should output

SELECT
  *
FROM
  first_table
  LEFT JOIN second_table USING(some_index)
@max-sixty max-sixty added bug Invalid compiler output or panic compiler semantic labels Nov 17, 2022
@max-sixty
Copy link
Member

I'm going to start tagging issues that we're relying on the semantic branch with the semantic label.

If anyone needs a fix urgently to unblock them, please post and I can try and rush something. Some will be possible, some less so...

(CC @aljazerzen, hope that's OK your end)

@aljazerzen
Copy link
Member

aljazerzen commented Nov 21, 2022

(I'm ok with tagging semantic)

Unfortunately, this is not possible in general, at least not without knowledge of table DDL. A counter example is this:

from first_table
select [some_index, foo]
join second_table [some_index]
join third_table [some_index]

which would, in your proposal, probably translate to:

SELECT
  *
FROM
  first_table
  JOIN second_table USING(some_index)
  JOIN third_table USING(some_index)

But this includes all columns from first_table too, even though we wanted only some_index and foo.

I know that current behavior is not ergonomic at all, because results contains basically three copies of some_index and because it (maybe) doesn't even compile in all DB engines, due to ambiguity of some_index in SELECT.

Fortunately, upcomming language changes on semantic branch include how USING is handled. I haven't written about it anywhere yet, so I'll use this issue to open the discussion:

@aljazerzen
Copy link
Member

The change

I added "self equality" operator ~, which does exactly what plain ident did before in join:

from first_table
join second_table [~some_index]

which translates to:

SELECT
  first_table.*
  second_table.*
FROM first_table
JOIN second_table ON first_table.some_index = second_table.some_index

It does not translate to USING, but to ON left = right, which means that the result also contains both columns from left and right.

The reason

We currently have custom behavior on join where plain idents are interpreted as USING and resolved in each of the tables separately. This is not possible anymore, because transforms are normal functions that cannot have custom resolving behavior.

I may be able to get around this limitation by extending rules on functions, but I'm not sure if this is possible.


It's a small language change, but I want your opinions on it anyway, @max-sixty @snth .

I think it's unfortunate consequence that reduces ergonomics of the language a little, but is tolerable compared to upsides that it brings.

@snth
Copy link
Member

snth commented Nov 21, 2022

I will need to ponder this a bit as I don't know enough about the resolving behaviour to understand what you mean by

because transforms are normal functions that cannot have custom resolving behavior.


In the example

from first_table
select [some_index, foo]
join second_table [some_index]
join third_table [some_index]

, can it not turn the first two rows into a CTE and then use the SELECT * FROM ... form?

What if in first_table the column was called first_index so that it has to be renamed with a derive, would it generate a CTE then as in?

from first_table
derive [some_index=first_index, foo]
join second_table [some_index]
join third_table [some_index]

The current playground version translates this to:

SELECT
  first_table.*,
  first_table.foo,
  second_table.*,
  third_table.*,
  some_index
FROM
  first_table
  JOIN second_table USING(some_index)
  JOIN third_table USING(some_index)

which seems wrong to me since in my hypothetical example first_table has no some_index column.


I'd also like to highlight that for a LEFT JOIN, first_table.some_index and second_table.some_index are not equivalent since the latter will contain NULLs when no match is found.


I don't know if this is much help but that's all that I could think of for now.

@aljazerzen
Copy link
Member

aljazerzen commented Nov 22, 2022

With "normal functions" I mean that we resolve their arguments the same way as we resolve arguments to any other function. Which allows you to define things like:

func left_join_by_name right_tbl filter left_tbl -> (
  join side:left right_tbl [left_tbl.name == right_tbl.name] left_tbl
)

from a
left_join_by_name b

More about that is written here.


can it not turn the first two rows into a CTE?

Like so?

WITH _table_0 AS (
  SELECT *
  FROM first_table
  JOIN second_table USING(some_index)
)
SELECT *
FROM _table_0
JOIN third_table USING(some_index)

This still has the problem in _table_0 CTE: we have included all colums of first_table, where we only wanted some_index and foo.

(and disregard current implementation, it has many problem like this)


are not equivalent since the latter will contain NULLs when no match is found

Good point. That's another reason to include both columns in the table after join.

(which does not happen if we compile to USING)

@snth
Copy link
Member

snth commented Nov 22, 2022

Regarding the CTE example, what I envisioned was

from first_table
select [some_index, foo]
join second_table [some_index]
join third_table [some_index]

turning into

WITH _table_0 AS (
  SELECT
    some_index,
    foo
  FROM first_table
)
SELECT *
FROM _table_0
JOIN second_table USING(some_index)
JOIN third_table USING(some_index)

and

from first_table
derive [some_index=first_index, foo]
join second_table [some_index]
join third_table [some_index]

turning into

WITH _table_0 AS (
  SELECT
    first_index AS some_index,
    foo
  FROM first_table
)
SELECT *
FROM _table_0
JOIN second_table USING(some_index)
JOIN third_table USING(some_index)

I tend to write the joins out manually and haven't used USING much so I'd have to refresh my memory on how most RDBMs treat SELECT * FROM t1 LEFT JOIN t2 USING (index). My guess would be that they only include t1.index and leave out t2.index.

@aljazerzen
Copy link
Member

Ah, yeah that does make sense and would work!

It's not the easiest to implement, though. In any case, it's not needed if we stop compiling to USING.


I too haven't used USING before @max-sixty mentioned he uses it a lot. It's convenient, but only if you have columns with the same names in both tables, so I'm guessing there is a lot of people that are not familiar with its behavior, who would want to have something similar in PRQL.

@aljazerzen
Copy link
Member

Update: original issue has been circumvented with semantic branch, which now compiles:

from first_table
join side:left second_table [~some_index]

... to ...

SELECT
  first_table.*,
  second_table.*
FROM
  first_table
  LEFT JOIN second_table ON first_table.some_index = second_table.some_index

We have discussed new behavior of join and self equality operator on the call and we will be adopting this new behavior in 0.3 release and possibly amend it in patch releases.

@aljazerzen
Copy link
Member

Self equality is now discussed in #1168

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Invalid compiler output or panic compiler language-design Changes to PRQL-the-language
Projects
None yet
Development

No branches or pull requests

4 participants