Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notation for referencing previous pipeline steps/relations #2668

Open
snth opened this issue May 31, 2023 · 8 comments
Open

Notation for referencing previous pipeline steps/relations #2668

snth opened this issue May 31, 2023 · 8 comments
Assignees
Labels
language-design Changes to PRQL-the-language

Comments

@snth
Copy link
Member

snth commented May 31, 2023

The discussion in #2655 gave me an idea for something that had been bothering me for a while but I didn't have a palatable solution/suggestion for.

To me the bigger problem in #2655 is that each step in the pipeline is a new relation and should really be explicitly referenceable. Power Query M Language takes the approach of explicitly requiring a name for each step. That is obviously quite verbose and not really desirable. Perhaps we could have a convenient syntax for this. My suggestion would be ^ like in git (HEAD^). This seems quite intuitive to me because it points upward.

So to expand on that, unqualified column references refer to columns in the current frame at that step in the pipeline. The ^ would allow you to reference parent frames and multiple ^ reference n-many frames up the stack.

Say you wanted to write pathological code like

from tbl
derive [x = x*x]        # squared
derive [x = x*tbl.x]    # cubed
derive [x = x*tbl.x]    # hybercubed / tesseracted / biquadrate
select [x=tbl.x, x_squared=^3.x, x_cubed=^^.x, x_hypercubed=^.x]

this would allow you to do it. There are obviously better ways of writing this code but I seem to recall there being cases where this was desirable.

One case that would need special considerations is joins. HEAD^2 in git doesn't actually mean HEAD^^ as I was suggesting above (git uses HEAD~2 for that) but rather HEAD^2 in git means the second merge head in a merge. I think that would be equivalent to the right relation in a join for us - see What's the difference between HEAD^ and HEAD~ in Git? for more details. I actually wasn't aware of that until I researched it right now. In any case, git isn't really known for simplicity or great UX so if we were to adopt this we should probably come up with our own consistent design.

@aljazerzen
Copy link
Member

My mental model when writing a pipeline is the current relation. Reading your pathological code example, it would go like this:

from tbl
# a relation with possibly many unknown columns

derive [x = x * x]
# relation with column x in addition of columns I had before

derive [x = x * tbl.x]
# relation with:
# - a new column x,
# - previous column x which is now unnamed, and 
# - all the columns from tbl

derive [x = x*tbl.x]
# relation with:
# - a new column x,
# - two columns previously named x, but now both unnamed, and
# - all the columns from tbl

With this model, it does not make sense to want columns from "two steps back", you always use only the last step. The fact that I was using the same name x for all new columns prevents me from referencing the squared or cubed x. But I think this is fine: I was explicitly using the same name, overriding the name in the relation.

@snth
Copy link
Member Author

snth commented May 31, 2023

Agreed, that was a terrible example.

Here's something from my working life which perhaps makes more sense:

from fund_holdings
filter value_date >= @2023-01-01
group [value_date, fund_code] (
  aggregate [
    fund_market_value = sum market_value
    ]
  )
join ^ (==value_date && ==fund_code)
derive [weight = market_value / fund_market_value]

Here I changed the meaning of ^ slightly from my first example because there is no need to reference the current frame/relation so ^ should refer to the parent of the current relation.

Of course that query could be written in a more literate and self-document style below but it would be cool to write the query above when working interactively.

let fund_holdings_2023 = (
  from fund_holdings
  filter value_date >= @2023-01-01
  )

let fund_market_values = (
  from fund_holdings_2023
  group [value_date, fund_code] (
    aggregate [
      fund_market_value = sum market_value
      ]
    )
  )

from fund_holdings_2023
join fund_market_values (==value_date && ==fund_code)
derive [weight = market_value / fund_market_value]

@aljazerzen
Copy link
Member

What about this:

from fund_holdings
filter value_date >= @2023-01-01
into fund_holdings_2023

from fund_holdings_2023
group [value_date, fund_code] (
  aggregate [
    fund_market_value = sum market_value
  ]
)
join fund_holdings_2023 (==value_date && ==fund_code)
derive [weight = market_value / fund_market_value]

Quite ergonomic.

@snth
Copy link
Member Author

snth commented May 31, 2023

Ah yes, very good! I forgot about the "new" into transform.

One thing I disliked about my ^ suggestion is that it is quite brittle, i.e. if you insert another transform step then you are potentially suddenly referencing the wrong relation. What I really wanted was to "tag"/name a particular step in the pipeline and into does pretty much exactly that.

Only remaining question I have is whether into can function like tee in linux when it is not the last step in a pipeline? Did you discuss that in #2427 ? With that my example would only be one additional line from my original ^ example:

from fund_holdings
filter value_date >= @2023-01-01
into fund_holdings_2023
group [value_date, fund_code] (
  aggregate [
    fund_market_value = sum market_value
  ]
)
join fund_holdings_2023 (==value_date && ==fund_code)
derive [weight = market_value / fund_market_value]

I can also see reasons for why you might want to disallow that and want to force into to always be the last step in a pipeline because otherwise it obscures the origin of the fund_holdings_2023 relation.

One compromise to avoid that could be a two pronged approach:

  1. For references within a single pipeline into functions like tee and creates a named relation that can be referenced within the same pipeline. However references to that name are disallowed outside of that pipeline.
  2. For a relation created by into to be referenceable outside of the pipeline it is declared in, it has to be the last step of that pipeline.

I guess another way to phrase that would be that into transforms within a pipeline create local variables while an into at the end of a pipeline creates a global variable.

WDYT?

@max-sixty
Copy link
Member

I like the tee idea! It's basically:

into foo
from foo

What's a good name? IIRC there's a pipeline lang that use inspect for debugging, that's a similar idea.

(was not a fan of the ^ concept... agree too brittle)

(I also don't think it's high priority, given it's syntactic sugar for the into / from)

@aljazerzen
Copy link
Member

When we added into the major concern was that it is not obvious where a variable is defined. If we add tee, this is going to get worse.

Also, it would look like tee is part of a pipeline, like it's a functions. But its semantics do not fit into our definition of a function. We'd be mixing stuff that looks the same, but it not the same.

@snth
Copy link
Member Author

snth commented Jun 1, 2023

When we added into the major concern was that it is not obvious where a variable is defined.

Agreed and I tried to address that with the local vs global name separation. I will try to write out some more examples to get more of a feel for what it would be like (to be added later - no time right now).

I wasn't thinking of introducing a new keyword, rather just overriding the semantics of into when not the last step in a pipeline.

However I think this could be quite a nice QOL feature in line with the Language Pragmatics Engineering article @max-sixty posted on Discord earlier.

We're adding a lot of features at the moment that will make PRQL a solid language, I'm thinking of module system, types, etc ... . The current into semantics make a lot of sense in the context of a module where you want to prevent sprawl and spaghetti code. However for me the primary use case for PRQL remains writing ad hoc queries in a Jupyter Notebook, DBeaver, shell or some BI tool. Having the two extra lines of "<blank line>\nfrom" are not a deal breaker, but at the same time seem like the kind of tedium that makes you ask "why do I have to do this?".


Another angle to consider, which just occurred to me when comparing @aljazerzen and my fund_holdings_2023 examples above, is that with the current semantics of into, fund_holdings_2023 becomes a name in the current module when maybe I didn't want that. I just needed to reference that particular relation/line in that one pipeline I was working on and didn't want to make it a name that's exported to the module level. Currently there is no way for me to prevent it "polluting" the namespace.

@max-sixty
Copy link
Member

I would vote to put the tee idea onto our conceptual "list of ideas we should keep brewing and come back to soon":

  • I kinda like it
  • It would be useful in the context of debugging a pipeline — i.e. you want a tool to show you intermediate values too — but that's not something we can do (and it's not close)
  • I'm skeptical "just overriding the semantics of into when not the last step in a pipeline" — that's not really orthogonal, could have confusing semantics when there's a missing from below
  • It has an easy replacement — into foo; from foo

@aljazerzen aljazerzen added the language-design Changes to PRQL-the-language label Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-design Changes to PRQL-the-language
Projects
None yet
Development

No branches or pull requests

3 participants