-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aliases on relations #4751
Comments
Would there be a different way of creating aliases (e.g. |
I would tend to agree here, but for different reasons. Global dependencies between transforms of the pipeline make the queries non-composable, but relational aliases don't really expose these aliases as global "query variables". By this I mean that the following is not valid PRQL: from x = employees
select {age}
select {age, x.name} # error: unknown name x.name Relational aliases are instead encoded into the resulting type of the pipeline. It has the same effect as wrapping the tuple in the relation into another tuple. from employees # type: [{id = int, name = text}]
select {x = this} # type: [{x = {id = int, name = text}}]
select {x.name} # <-- this is a tuple field lookup In this understanding of PRQL, it is hard to imagine how So in this view, relation aliases are very weirdly specific behavior, which I would prefer PRQL not to have. |
@snth and @aljazerzen write:
I saw manifestations of this recently - I was joining several tables, some of which had the same column names, and I had trouble knowing whether to use a relation alias, and/or which one to use in subsequent select statements. I ultimately just started "trying stuff" and got it to work, but that's a sign of a less-than-perfect language design. I also saw a comment about BUT... If relation aliases were to go away, how would this be expressed?
Thanks |
That's a good question and one that we have been circling around for some time now. Here is what I propose: join construct a tuple with two unnamed fields, containing left and right tuples:
Edit: I've messed-up the last type, cantaloupe_id was not enclosed in a tuple |
I'm not sure that more use of unnamed fields is going to be easy for users to follow - it puts more weight on the structure of the query rather than less, which feels very un-PRQL-like. Could the fields inherit the names from their source relations?
Note that I also flattened the types a bit and changed the bias of the "original" relation from the right to the left. With this, |
+1... I can see the appeal of some ability to have a nested column structure — Footnotes
|
I was thinking along the same lines. I need to take some time to think through my ideas (and my naivety), and will respond by this weekend. |
Thank you for opening the issue @aljazerzen . My main bugbear is the "global dependencies" or, as you nicely put it, the breaking of composability of pipelines. That together with the fact that I really want to implement some backends other than SQL for PRQL - some of these have eager execution models where backreferences to previous pipeline steps won't work (think Pandas or stream processing) as that requires lazy evaluation (or fusion of steps). My mental (or "logical") model of PRQL is very simple and is really based on stream processing, for each pipeline step or transform it's "stream of tuples in, stream of tuples out". If we look at the signatures for the major transforms, we have (I started with
So We could look at some prior art for other pipeline based query builders as a start.
I'll address the previous two proposals in a separate comment. |
I often come back to M Lang (Power Query) because it's actually very similar to PRQL semantically (main difference is that each pipeline step is named) and in many cases generates SQL under the hood (you have to know where to look for it -> it's under "Native Query"). I think they deal with both the proposals by @aljazerzen and @kgutwin above. Table.Join (Docs)
```m
let
customers = Table.FromRecords({
[TenantID = 1, CustomerID = 1, Name = "Bob", Phone = "123-4567"],
[TenantID = 1, CustomerID = 2, Name = "Jim", Phone = "987-6543"]
}),
orders = Table.FromRecords({
[TenantID = 1, OrderID = 1, CustomerID = 1, Name = "Fishing rod", Price = 100.0],
[TenantID = 1, OrderID = 2, CustomerID = 1, Name = "1 lb. worms", Price = 5.0],
[TenantID = 1, OrderID = 3, CustomerID = 2, Name = "Fishing net", Price = 25.0]
})
in
Table.Join(
customers,
{"TenantID", "CustomerID"},
Table.PrefixColumns(orders, "Order"),
{"Order.TenantID", "Order.CustomerID"}
)
```
**Output:**
```m
Table.FromRecords({
[TenantID = 1, CustomerID = 1, Name = "Bob", Phone = "123-4567", Order.TenantID = 1, Order.OrderID = 1, Order.CustomerID = 1, Order.Name = "Fishing rod", Order.Price = 100],
[TenantID = 1, CustomerID = 1, Name = "Bob", Phone = "123-4567", Order.TenantID = 1, Order.OrderID = 2, Order.CustomerID = 1, Order.Name = "1 lb. worms", Order.Price = 5],
[TenantID = 1, CustomerID = 2, Name = "Jim", Phone = "987-6543", Order.TenantID = 1, Order.OrderID = 3, Order.CustomerID = 2, Order.Name = "Fishing net", Order.Price = 25]
})
```
I think this is roughly equivalent to from apple # type: [ {apple_id} ]
join banana (id) # type: [ {apple_id, banana = {banana_id}} ]
select {
apple_id,
banana.banana_id,
} Notably the ResultSet has already been flattened and Table.NestedJoin (Docs)
```m
Table.NestedJoin(
Table.FromRecords({
[CustomerToCall = 1],
[CustomerToCall = 3]
}),
{"CustomerToCall"},
Table.FromRecords({
[CustomerID = 1, Name = "Bob", Phone = "123-4567"],
[CustomerID = 2, Name = "Jim", Phone = "987-6543"],
[CustomerID = 3, Name = "Paul", Phone = "543-7890"],
[CustomerID = 4, Name = "Ringo", Phone = "232-1550"]
}),
{"CustomerID"},
"CustomerDetails"
)
```
The big difference here is that the ResultSet has the same number of rows as the LHS table and the matching rows from the RHS are grouped into a nested table. This is very nice from a UX perspective (and I find myself reaching for this quite often actually) but it doesn't really translate cleanly into SQL or other backends. However it's nice from a conceptual perspective.
Output:
I think the equivalent here would be: from apple # type: [ { apple_id } ]
join banana (id) # type: [ { apple_id, banana=[{banana_id}] } ] It's not clear how we would do the subsequent flattening or select transforms afterwards. My conclusionWhile I find the Therefore, while it's a bit of a pain to deal with the flattening in the I'll try to illustrate some ideas for this by means of a Python example in the next comment. |
Here's a Python example I hacked together because I thought it might make it easier to demonstrate things. Full code is at the end, I'll just focus on the examples for now: X = [dict(a=1, b=10), dict(a=2, b=20), dict(a=3, b=30)]
Xs = from_(X) | []
print(f"{Xs=}")
# Xs=[{'a': 1, 'b': 10}, {'a': 2, 'b': 20}, {'a': 3, 'b': 30}] The pipelines produce iterators so the My main idea was that the operand for Y = from_(X) | select(c='a+b') | filter_('c<=b') | select(b='b', c='c')
Ys = list(Y.eval(context={'b': 25}))
print(f"{Ys=}")
# Ys=[{'b': 25, 'c': 11}, {'b': 25, 'c': 22}] We can now use this Z1 = from_(X) | join_parameterised(Y) | []
print(f"{Z1=}")
# Z1=[{'b': 20, 'c': 11}, {'b': 30, 'c': 11}, {'b': 30, 'c': 22}] In order to do the self-joins, we need to make the input tuples available in the parameterised query. For my examples I used X2 = from_(X) | filter_('a==_.a+1') | select(_a='_.a', _b='_.b', a='a', b='b')
Z2 = from_(X) | join_parameterised(X2) | []
print(f"{Z2=}")
# Z2=[{'_a': 1, '_b': 10, 'a': 2, 'b': 20}, {'_a': 2, '_b': 20, 'a': 3, 'b': 30}] In general, it seems to me that the operand for a join always needs the following three parts:
We can therefore alternatively pass these as explicit arguments to Fs = from_(X) | join_explicit(X, on=['a==_.a+1'], fields=dict(_a='_.a', _b='_.b', a='a', b='b')) | []
print(f"{Fs=}")
# Fs=[{'_a': 1, '_b': 10, 'a': 2, 'b': 20}, {'_a': 2, '_b': 20, 'a': 3, 'b': 30}] My sense is that this would make the lines with joins too long so is probably a less good UX and it would also be less general than the As an aside, in this setup, repeated print('X x Ys:', from_(X) | from_(Ys) | [])
# X x Ys: [{'a': 1, 'b': 25, 'c': 11}, {'a': 1, 'b': 25, 'c': 22}, {'a': 2, 'b': 25, 'c': 11}, {'a': 2, 'b': 25, 'c': 22}, {'a': 3, 'b': 25, 'c': 11}, {'a': 3, 'b': 25, 'c': 22}] EDIT: What I'm trying to express with this example is that I believe this model is rich enough to express the required logic. How you efficiently parse and extract the required parameters from this for the compiler, I don't know. I've got more to say about some theoretical considerations but that will have to wait for another time. Appendix: Full codefrom collections import namedtuple
class Pipeline:
def __init__(self):
self.steps = []
def __or__(self, step):
if isinstance(step, list):
step.extend(self)
return step
self.steps.append(step)
return self
def __iter__(self):
return self.eval()
def eval(self, steps=None, context={}):
if steps is None:
steps = self.steps
i = len(steps)
if i>1 and isinstance(steps[i-2], let):
context = context.copy()
context.update(steps[i-2]._bindings)
if i>0:
transform = steps[i-1]
relation = self.eval(steps[:i-1], context)
for result in transform.eval(relation, context):
yield result
else:
yield None
class Transform:
def eval(self, relation, context={}):
raise NotImplemented()
class let(Transform):
def __init__(self, **bindings):
self._bindings = bindings
def eval(self, relation, context={}):
return relation
class from_(Transform):
def __init__(self, relation: str|list):
self._relation = relation._relation if isinstance(relation, Transform) else relation
def __or__(self, transform):
pipeline = Pipeline() | self
return pipeline | transform
def eval(self, relation, context={}):
for this in relation:
if this is None:
this = {}
for that in self._relation:
result = this.copy()
result.update(that)
yield result
class select(Transform):
def __init__(self, **fields) -> None:
self._fields = fields
def eval(self, relation, context={}):
for this in relation:
result = {}
for alias, expr in self._fields.items():
result[alias] = eval(expr, context, this)
yield result
class filter_(Transform):
def __init__(self, *conditions) -> None:
self._conditions = conditions
def eval(self, relation, context={}):
for this in relation:
if all(eval(cond, context, this) for cond in self._conditions):
yield this
class join_parameterised(Transform):
def __init__(self, pipeline) -> None:
self._pipeline = pipeline
def eval(self, relation, context={}):
for this in relation:
context = dict(_=namedtuple('_', this)(**this), **this)
for that in self._pipeline.eval(context=context):
yield that
class join_explicit(Transform):
def __init__(self, pipeline, on, fields, kind='inner') -> None:
# NOTE: Can't do right joins
self._pipeline = pipeline
self._on = [on] if isinstance(on, str) else on
self._fields = fields
self._kind = kind
def eval(self, relation, context={}):
_select = select(**self._fields)
_search = self._pipeline | filter_(*self._on) | _select
for this in relation:
context = dict(_=namedtuple('_', this)(**this), **this)
j = 0
for that in _search.eval(context=context):
yield that
j += 1
if j==0 and self._kind in {'full', 'left'}:
_missing = self._fields.keys() - this.keys()
# FIXME: Need schema info to fill the NULLs correctly for left joins.
yield next(_select.eval([{fld:None for fld in _missing}], context=context))
if __name__ == "__main__":
data = [dict(a=1, b=10), dict(a=2, b=20), dict(a=3, b=30)]
X = from_(data)
Xs = from_(X) | []
print(f"{Xs=}")
Y = from_(X) | select(c='a+b') | filter_('c<=b') | select(b='b', c='c')
Ys = list(Y.eval(context={'b': 25}))
print(f"{Ys=}")
Z1 = from_(X) | join_parameterised(Y) | []
print(f"{Z1=}")
X2 = from_(X) | filter_('a==_.a+1') | select(_a='_.a', _b='_.b', a='a', b='b')
Z2 = from_(X) | join_parameterised(X2) | []
print(f"{Z2=}")
Fs = from_(X) | join_explicit(X, on=['a==_.a+1'], fields=dict(_a='_.a', _b='_.b', a='a', b='b')) | []
print(f"{Fs=}")
# In this setup, repeated `from` transforms produce cross-products.
# Think of them as nested for loops in list iterator context.
print('X x Ys:', from_(X) | from_(Ys) | []) |
There is a method called val adam = Person("Adam").apply {
age = 32
city = "London"
}
println(adam) Similarly, nix has let
a = {
x = 1;
y = 2;
z = 3;
};
in
with a; [ x y z ] In prql this might look like: from apple # type: [ {apple = {apple_id, color}} ]
join banana (id) # type: [ {apple = {apple_id, color}, banana = {banana_id, color}} ]
select (with banana) {
apple.apple_id,
banana_id,
color,
} So aliases look more logical now: from apple # type: [ {apple = {apple_id, color}} ]
join banana = apple (id) # type: [ {apple = {apple_id, color}, banana = {apple_id, color}} ]
select (with banana) {
apple.apple_id,
banana_id,
color,
} wdyt? |
(I quite like that principle @grihabor! I think that's what we're going for throughout, need to think about the exact semantics / when they get triggered...) |
[Please forgive my verbosity - it helps me understand issues to write things out explicitly so I can follow this conversation.] My (loosey-goosey) understanding of Here's my understanding of the notation we're using. This query:
would result in the following successive outcomes [*], where all three tables have a column named
To carry my naive understanding further, it's fair to think of the result having six columns: If each column had a unique name (say, The problem comes when identical column names appear in multiple tables. [*] So...
Do I have this right? Many thanks. |
@snth recently expressed an opinion that relation aliases in PRQL are mistake and should be removed
By relational aliases we mean the name
x
in following queries:Currently, relational aliases have effect of assigning a name to a relational variable within the query:
@snth argues that relational aliases are a carryover from SQL and should be removed from PRQL because
a) it is not a fundamental part of Relational Algebra, and
b) breaks the local nature of PRQL transforms and puts in global dependencies between different pipeline steps.
The text was updated successfully, but these errors were encountered: