Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambigious wildcard references #1193

Closed
aljazerzen opened this issue Nov 30, 2022 · 6 comments · Fixed by #1205
Closed

Ambigious wildcard references #1193

aljazerzen opened this issue Nov 30, 2022 · 6 comments · Fixed by #1205
Labels
language-design Changes to PRQL-the-language
Milestone

Comments

@aljazerzen
Copy link
Member

I'm opening this issue as a followup on a recent call and @snth's comment here.

First I suggest to read current name resolution algorithm.

This issue is about this paragraph:

In the case that there are multiple namespaces with a wildcard, we don’t match with neither of the namespaces, but match as . name. This is a special feature that allows a column name which may reside in two different tables not to be associated with any of them. Instead, it is translated into the column name only, so database can determine which table it belongs to. Note that this may lead to PRQL passing on ambiguous queries to the database, instead of resulting error early.

In practice this means that such queries:

from employees
join salaries []
select first_name

... are translated to:

SELECT 
  first_name  -- no table name, even though there are multiple tables in the query
FROM employees JOIN salaries ON ...;

Note that this happens only when both tables names still contain wildcards.

In this case, first_name is know to be in employees and possibility that it may be in salaries is not considered:

from employees
select first_name
join salaries []
select first_name

In this case, first_name is known to be in both, so ambiguity error is raised:

table sal = (from salaries | select first_name)

from employees
select first_name
join sal []
select first_name # error here

As @snth put it, there are two options:

  1. Name references resolving to multiple wildcards should throw an error straight away at compile time, forcing the user to specify which parent table the column references.
  2. Ambiguity is ok and we simply produce the SQL and we let the underlying database/RDBMs resolve the ambiguity at runtime.

It was mentioned that the burden of the strict name resolution option could be lessened through tooling, which at authoring time could suggest the ambiguous parent object names as possible autocompletions.

With semantic branch, this option 2 became a little worse, because ambiguous idents are converted to s-strings earlier which means that anchoring does not know about them and may cause problems.

@aljazerzen aljazerzen added the language-design Changes to PRQL-the-language label Nov 30, 2022
@aljazerzen
Copy link
Member Author

Originally, I was for option 2, because it's convenient similar to how SQL operates.

But I vote for option 1.

This basically compiler saying "you have to provide more information" so it can create SQL queries that are not ambiguous. Recently I find that many of problem I need to fix in my professional life are caused because we allow certain expressions, commands or protocols to be ambiguous. In my experience it would actually save time if we would report that at compile-time.

On scale from -2 to +2, I feel +1 for changing this behavior from current option 2 to option 1.

@snth
Copy link
Member

snth commented Dec 1, 2022

Looking through the code changes in #1205, it appears to be mostly examples and tests.

I like how the new examples look because it makes it much easier for the reader to see what source table a column comes from.

The litmus test will be to see whether it will be annoying when writing queries. However, similarly to the experience with Rust, I suspect that after an initial adjustment period one will actually come to appreciate the concreteness this brings.

We have quite a few changes with this 0.3.x line of releases and it looks like we'll plan to bring quite a few new user facing syntax features in so I think this is a good time to try this.

It's a +1 from me (to go with option 1).

@mklopets
Copy link
Collaborator

mklopets commented Dec 1, 2022

+1 for option 1 from me as well!

@max-sixty
Copy link
Member

(FYI I have seen this and am reflecting! I was quite ardent on allowing the ambiguity, given how nice PRQL is for exploratory use as well as robust use. I do hear the points above and want to think through some cases; I will be "on" on Saturday, so will respond then at the latest)

@max-sixty
Copy link
Member

Broadly: I'm open to trying this and iterating on feedback. I'm still a -1, assuming we're taking averages rather than minimums

Context

PRQL has many benefits over SQL. Two of them are:

  • Better, faster errors
  • Friendly & efficient data exploration, because of the pipeline and the general language

On this issue, those two priorities are in opposition.

It is genuinely quite useful to be able to do

from employees
join salaries [==id]
group [city] (
  aggregate avg_salary = average salary
  )
]

...and not have to add in the table names, which are kinda obvious (to the user, not the compiler...) given the tables:

from employees
join salary [==id]
-group [city] (
-  aggregate avg_salary = average salary
+group [employees.city] (
+  aggregate avg_salary = average salaries.salary
  )
]

While it's ironic to say "trust SQL", I do place some weight in there being no SQL engine that forces these —

Alternatives

I think that this could be a lint, similar to clippy, which people can then opt-into. That enables the query to run, but also enables folks to spot potential future issues (e.g. if a city column gets added to the salaries table above later, it won't break)

When we build the DB backend, we could also use that to provide hints (I think this was @mklopets 's point), though I'm not sure whether that means it's worth forcing it now ("the help is coming") or leaving it until later until the help arrives.

I think rust does this quite well! Or the python typing ecosystem is getting good at this.

Compiler simplicity

I would place a decent amount of weight on keeping the compiler simple. I have not been involved enough here — a regret for me that I promise to fix — so I defer to @aljazerzen . Keeping the compiler simple will mean less work to build, less work to change, and easier for new contributors to join (the last of which is something we're not doing so well on...)

So to the extent this makes the compiler simpler, great. To the extent it's just "convert to an s-string", then it's less of an issue. (And to the extent this means we need to deal with more cases because we can't fall back to ambiguity, then that would detract).

Meta

On a meta level, I worry a little bit that as the language builders, we're likely overweighing Robustness and underweighing Exploration.

At the extreme, if folks are using PRQL as a just a target for other tools, then there's no real Exploration priority.


Thanks and forgive the delay. I'm fully back online now.

@aljazerzen
Copy link
Member Author

aljazerzen commented Dec 6, 2022

Context

As I wrote, it seems to me that stricter name checking may even improve exploration, because it eliminates debugging banal bugs that stem from compiler not understating exactly what you mean.

... no SQL engine that forces these --

This is true, but PRQL is able to operate with much less knowledge of schema compared to all RDBMS.

Compiler simplicity

The change is minimal - about 10 lines of code total. So this is not the concern.


Given all opinions expressed, I conclude to change the behavior to option 1.

For anyone in the future: feel free to add your opinion - at least at time of writing, reverting or adding an option to revert this behavior is not hard to do compiler-wise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-design Changes to PRQL-the-language
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants