Ambigious wildcard references #1193

aljazerzen · 2022-11-30T08:23:41Z

I'm opening this issue as a followup on a recent call and @snth's comment here.

First I suggest to read current name resolution algorithm.

This issue is about this paragraph:

In the case that there are multiple namespaces with a wildcard, we don’t match with neither of the namespaces, but match as . name. This is a special feature that allows a column name which may reside in two different tables not to be associated with any of them. Instead, it is translated into the column name only, so database can determine which table it belongs to. Note that this may lead to PRQL passing on ambiguous queries to the database, instead of resulting error early.

In practice this means that such queries:

from employees
join salaries []
select first_name

... are translated to:

SELECT 
  first_name  -- no table name, even though there are multiple tables in the query
FROM employees JOIN salaries ON ...;

Note that this happens only when both tables names still contain wildcards.

In this case, first_name is know to be in employees and possibility that it may be in salaries is not considered:

from employees
select first_name
join salaries []
select first_name

In this case, first_name is known to be in both, so ambiguity error is raised:

table sal = (from salaries | select first_name)

from employees
select first_name
join sal []
select first_name # error here

As @snth put it, there are two options:

Name references resolving to multiple wildcards should throw an error straight away at compile time, forcing the user to specify which parent table the column references.

Ambiguity is ok and we simply produce the SQL and we let the underlying database/RDBMs resolve the ambiguity at runtime.

It was mentioned that the burden of the strict name resolution option could be lessened through tooling, which at authoring time could suggest the ambiguous parent object names as possible autocompletions.

With semantic branch, this option 2 became a little worse, because ambiguous idents are converted to s-strings earlier which means that anchoring does not know about them and may cause problems.

The text was updated successfully, but these errors were encountered:

aljazerzen · 2022-11-30T08:31:58Z

Originally, I was for option 2, because it's convenient similar to how SQL operates.

But I vote for option 1.

This basically compiler saying "you have to provide more information" so it can create SQL queries that are not ambiguous. Recently I find that many of problem I need to fix in my professional life are caused because we allow certain expressions, commands or protocols to be ambiguous. In my experience it would actually save time if we would report that at compile-time.

On scale from -2 to +2, I feel +1 for changing this behavior from current option 2 to option 1.

snth · 2022-12-01T12:58:34Z

Looking through the code changes in #1205, it appears to be mostly examples and tests.

I like how the new examples look because it makes it much easier for the reader to see what source table a column comes from.

The litmus test will be to see whether it will be annoying when writing queries. However, similarly to the experience with Rust, I suspect that after an initial adjustment period one will actually come to appreciate the concreteness this brings.

We have quite a few changes with this 0.3.x line of releases and it looks like we'll plan to bring quite a few new user facing syntax features in so I think this is a good time to try this.

It's a +1 from me (to go with option 1).

mklopets · 2022-12-01T15:01:29Z

+1 for option 1 from me as well!

max-sixty · 2022-12-02T09:15:21Z

(FYI I have seen this and am reflecting! I was quite ardent on allowing the ambiguity, given how nice PRQL is for exploratory use as well as robust use. I do hear the points above and want to think through some cases; I will be "on" on Saturday, so will respond then at the latest)

max-sixty · 2022-12-06T02:33:30Z

Broadly: I'm open to trying this and iterating on feedback. I'm still a -1, assuming we're taking averages rather than minimums

Context

PRQL has many benefits over SQL. Two of them are:

Better, faster errors
Friendly & efficient data exploration, because of the pipeline and the general language

On this issue, those two priorities are in opposition.

It is genuinely quite useful to be able to do

from employees
join salaries [==id]
group [city] (
  aggregate avg_salary = average salary
  )
]

...and not have to add in the table names, which are kinda obvious (to the user, not the compiler...) given the tables:

from employees
join salary [==id]
-group [city] (
-  aggregate avg_salary = average salary
+group [employees.city] (
+  aggregate avg_salary = average salaries.salary
  )
]

While it's ironic to say "trust SQL", I do place some weight in there being no SQL engine that forces these —

Alternatives

I think that this could be a lint, similar to clippy, which people can then opt-into. That enables the query to run, but also enables folks to spot potential future issues (e.g. if a city column gets added to the salaries table above later, it won't break)

When we build the DB backend, we could also use that to provide hints (I think this was @mklopets 's point), though I'm not sure whether that means it's worth forcing it now ("the help is coming") or leaving it until later until the help arrives.

I think rust does this quite well! Or the python typing ecosystem is getting good at this.

Compiler simplicity

I would place a decent amount of weight on keeping the compiler simple. I have not been involved enough here — a regret for me that I promise to fix — so I defer to @aljazerzen . Keeping the compiler simple will mean less work to build, less work to change, and easier for new contributors to join (the last of which is something we're not doing so well on...)

So to the extent this makes the compiler simpler, great. To the extent it's just "convert to an s-string", then it's less of an issue. (And to the extent this means we need to deal with more cases because we can't fall back to ambiguity, then that would detract).

Meta

On a meta level, I worry a little bit that as the language builders, we're likely overweighing Robustness and underweighing Exploration.

At the extreme, if folks are using PRQL as a just a target for other tools, then there's no real Exploration priority.

Thanks and forgive the delay. I'm fully back online now.

aljazerzen · 2022-12-06T08:01:30Z

Context

As I wrote, it seems to me that stricter name checking may even improve exploration, because it eliminates debugging banal bugs that stem from compiler not understating exactly what you mean.

... no SQL engine that forces these --

This is true, but PRQL is able to operate with much less knowledge of schema compared to all RDBMS.

Compiler simplicity

The change is minimal - about 10 lines of code total. So this is not the concern.

Given all opinions expressed, I conclude to change the behavior to option 1.

For anyone in the future: feel free to add your opinion - at least at time of writing, reverting or adding an option to revert this behavior is not hard to do compiler-wise.

aljazerzen added the language-design Changes to PRQL-the-language label Nov 30, 2022

This was referenced Dec 1, 2022

Missing select with join #1198

Closed

feat: prohibit ambigious wildcards #1205

Merged

aljazerzen mentioned this issue Dec 2, 2022

Join three tables on same column name #1206

Open

aljazerzen closed this as completed in #1205 Dec 6, 2022

snth added this to the 0.4 milestone Dec 21, 2022

max-sixty mentioned this issue Jan 14, 2023

Strictness over supplying namespaces #1503

Closed

This was referenced Jan 21, 2023

Dealing with nested derived columns with s-strings #1594

Closed

Selecting join columns post join appears to be ambiguous #1595

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambigious wildcard references #1193

Ambigious wildcard references #1193

aljazerzen commented Nov 30, 2022

aljazerzen commented Nov 30, 2022

snth commented Dec 1, 2022

mklopets commented Dec 1, 2022

max-sixty commented Dec 2, 2022

max-sixty commented Dec 6, 2022

aljazerzen commented Dec 6, 2022 •

edited

Loading

Ambigious wildcard references #1193

Ambigious wildcard references #1193

Comments

aljazerzen commented Nov 30, 2022

aljazerzen commented Nov 30, 2022

snth commented Dec 1, 2022

mklopets commented Dec 1, 2022

max-sixty commented Dec 2, 2022

max-sixty commented Dec 6, 2022

aljazerzen commented Dec 6, 2022 • edited Loading

aljazerzen commented Dec 6, 2022 •

edited

Loading