A very-WIP implementation of the PRQL plugin, for discussion #5982

max-sixty · 2022-10-02T08:27:47Z

As discussed with some folks in #5796 & offline

This is a very WIP implementation of the PRQL plugin. I'm not overjoyed about the code I've written, but I figured it was better to get it out in the open.

There are lots of notes in the code; to summarize:

I was hoping to write something not overly generic, but at least modular.
This gets some of the way there — it takes the "API" rather than "Jinja" path for most of it, as discussed in dbt & Python: language agnosticism #5796.
But then eventually passes the compiled model off to ModelParser as a SQL file.
I added some notes on a better design, but as discussed, I'd like to socialize this to ensure we're on the same page before investing further. I found the current model parser code quite coupled, and SQL is high in a very long inheritance tree, so making it elegant is not trivial. The python parser is impressive, but def bolted on!
This code has unit tests but not an integration test (I also haven't hand-tested it end-to-end; I suspect there may be parts that don't work).
Note that it includes a file _dbt_prql.py that would be in dbt-prql, but I'm including here for the PR so we can iterate without coordinating dependencies. (That file also does a hack to emulate prqlc, the rust compiler that's bundled with prql-python / dbt-prql)

Thanks in advance for folks' feedback and guidance.

Description

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have opened an issue to add/update docs, or docs changes are not required/relevant for this PR
I have run changie new to create a changelog entry

max-sixty · 2022-10-03T07:49:39Z

It's in a state where it would be useful to:

discuss the current proposed implementation
discuss the best path between here and a nice generic approach, balancing elegance with practicality
confirm we're all aligned on having PRQL including in some form

On balancing elegance with practicality — my biggest concern is making this very generic would involve quite some refactoring & inversion, if I understand the code correctly (very possibly I don't though, please correct me!)

Specifically, the ModelParser seems to be the most congruent place to implement different languages — so each language would define a SQLModelParser / PythonModelParser / etc, with listing references, replacing them, etc (as discussed in #5796). That would likely involve splitting the parts of ModelParser that are SQL-specific from those that are generic. But ModelParser's mro is:

dbt.parser.models.ModelParser
dbt.parser.base.SimpleSQLParser
dbt.parser.base.SQLParser
dbt.parser.base.ConfiguredParser
dbt.parser.base.Parser
dbt.parser.base.BaseParser

...so it's long, and SQL is quite deep. So hence the heavy refactor that I think would be required to do this in an ideal way.

Comparing that to the current python parsing — python parsing is implemented with an additional method on the ModelParser (thanks @jtcohen6 for the original pointer, that was v helpful in getting me started). The current prql implementation takes a slightly more modular approach, still branching within ModelParser, but then calls out to a PrqlLanguageProvider, before passing the SQL back to ModelParser. In the more generic proposal above, the PrqlLanguageProvider would become a PrqlModelParser, and be composed with the parts of ModelParser that were relevant.

Is that right? Is there an easier approach here? What would be the best path there, assuming we don't want to wait until it's perfect before merging anything?

I'm open minded on the way forward. As discussed with folks, I'm very keen to have people write PRQL in dbt — I think it would be great for PRQL to have the reach of dbt & its users, and it would be great for dbt & its users to have an ergonomic language option. Given PRQL's focus is the data transformation, rather than project management & DAG management, there is a uniquely good fit between the two. So I'm happy to drive this and iterate to get this in.

(tests still fail b/c typing_extensions is not installed)

gshank · 2022-10-05T19:51:40Z

Lots of interesting things here!

We did look at creating a different node type for Python nodes, but in the end decided not to do it because there a fair amount of complications. Was there some reason you decided to go that direction? It also feels a bit wrong to have references to PrqlNode, if the actual prql code doesn't exist in the repo. So I'm thinking we might need something a bit more pluggable, such that if somebody install a language provider, a new value for the 'language' attribute becomes available, and the code to compile it is found by the plugin architecture.

max-sixty · 2022-10-06T01:58:21Z

Thanks for responding @gshank !

I very much agree with your assessment — there's a tradeoff between the "fair amount of complications" :) vs. having something pluggable.

In the currently proposed code, I avoided many of the complications, and so wasn't using the PrqlNode etc. I've just removed those for clarity.

At what level would you like to engage here? I realize these kinds of quite broad proposals can be difficult to connect with the lines of code. I've tried to summarize the current state & salient questions in prose above. Is that a reasonable starting point or do you prefer an alternative approach? Also happy to chat on Slack, or do another Zoom with the broader team if that lets us iterate faster.

jtcohen6

@max-sixty Thank you so much for the initial work to spec this out! This has prompted a lot of conversation for us over the past few weeks (including several mentions in casual conversation at Coalesce). There's definite interest in a future where dbt can support plugins for modeling languages, and PRQL would be an excellent early entrant in the category.

The question now is around how to get from here to there. Based on the code you've written thus far, the two biggest challenges I see are:

Decoupling our parsers (ModelParser et al) from Jinja-SQL, so that it's easier to provide a new language parser that works in parallel to the Jinja-SQL ones — with the awareness that we'd also be turning a very internal thing into a real public interface
Determining a plugin pattern/framework that we're happy with ([CT-1453] [Spike] A pattern for plugins #6184), since we're feeling the limits of the pattern we've been using for adapter plugins

My goal over the next few weeks is to identify an engineer on the team who's motivated by these challenges — and by what better interfaces here could help us unlock — to assist in the areas of technical design where I should certainly not have final (or even first) say :)

core/dbt/parser/_dbt_prql.py

jtcohen6 · 2022-11-01T10:57:05Z

test/unit/test_graph.py

@@ -60,6 +60,9 @@ def setUp(self):
        # Create file filesystem searcher
        self.filesystem_search = patch('dbt.parser.read_files.filesystem_search')
        def mock_filesystem_search(project, relative_dirs, extension, ignore_spec):
+            # Adding in `and "prql" not in extension` will cause a bunch of tests to
+            # fail; need to understand more on how these are constructed to debug.
+            # Possibly `sql not in extension` is a way of having it only run once.


There's a good chance this is some magic from years ago, which we'll need to reconsider/refactor. Good example of a place where we've hardcoded .sql to date, and need to instead be thinking about using a common provider to tell us the valid model file extensions

test/unit/test_graph_selector_methods.py

jtcohen6 · 2022-11-01T11:16:39Z

core/dbt/parser/base.py

+# TOOD: Currently `ModelParser` inherits from `SimpleSQLParser`, which inherits from
+# SQLParser. So possibly `ModelParser` should instead be `SQLModelParser`, and in
+# `ManifestLoader.load`, we should resolve which to use? (I think that's how it works;
+# though then why all the inheritance and generics if every model is parsed with `ModelParser`?)
+# class PRQLParser(
+#     ConfiguredParser[FileBlock, IntermediateNode, FinalNode], Generic[IntermediateNode, FinalNode]
+# ):


It sounds like this piece is the biggest current challenge, and the thing that would require the most refactoring to make pluggable.

Currently, SimpleSQLParser is used in all of these places:

ModelParser (even for Python models)

AnalysisParser

SingularTestParser

SqlBlockParser (for semantic layer queries)

SeedParser (!), even though render_with_context is a total no-op

We shouldn't be looking to create new node types for each language — these should still be models, analyses, tests, etc — but it's clear that we need some way to allow a language plugin to define its own parser, and then provide it to resources matching that language input.

It also looks like the real Jinja-SQL entrypoint is ConfiguredParser.render_with_context. We'd probably want to split that out, to live on a dedicated JinjaSQLParser instead!

Caveats:

There's some additional Jinja business in update_parsed_node_config_dict and update_parsed_node_name, to respect rules around hooks and database/schema/alias — those have more to do with model configuration than model language, and the fact that we've stretched Jinja into a "rules engine" for configuration that goes beyond its more-straightforwardly understood role as a SQL template.

I'd like to punt on snapshots, which are defined as Jinja blocks (and mostly config, anyway)

I don't know if we should even think/talk about macros, at this early point, but we know we'll want some way to define and reuse functions

max-sixty · 2022-11-01T21:56:27Z

Thanks for the review @jtcohen6, very much agree with your comments. As discussed offline, let's touch base in a few weeks on the best path forward.

jtcohen6 · 2022-11-22T15:59:25Z

core/dbt/parser/models.py

+            try:
+                context = self._context_for(node, config)
+                references = provider.list_references(node.raw_code)
+                sql = provider.compile(node.raw_code, references)


After taking another look here — with plans to do some hacking on/around this soon! — I think we'd want the actual .compile() call (PRQL → SQL transpilation) to happen within compilation, rather than during parsing. Here's where we do that for Python.

I think the goal here, within the parser, should just be to provide the set of references, and then pick up again at compile time. Something to also think about: Certain modeling languages will want a database connection at compile time (e.g. Jinja-SQL, Cody's prototype work on Ibis); some won't need one (e.g. Snowpark/PySpark Python, PRQL), and so shouldn't require one.

OK great, that seems very reasonable.

Would you have it as an if in Compiler like the current python implementation, or inherit a class for each lang; i.e. CompilerPython / CompilerPrql?

binste · 2023-09-07T15:16:08Z

It's great to see all the work that already went into making prql available in dbt in a convenient way! I'm currently developing dbt-ibis so that users can write dbt models as Ibis expressions and I've learned a lot from looking at dbt-prql and this PR. Thanks! Curious to see how this continues :)

wearpants · 2024-02-19T18:18:58Z

Is this dead?

max-sixty · 2024-02-26T06:48:26Z

Is this dead?

I would love to revive it! There are 22 👍s, so there is a decent amount of community interest... If there were interest from the dbt team I'd be happy to make a significant time investment to make something that works.

My read is that the dbt team has other priorities at the moment — understandable — and so I'm not sure there is a viable path without that changing.

But if there is a path forward, let me know!

For any future travelers, the precursor to this was https://github.com/PRQL/dbt-prql. But that doesn't work with some dialects, and the implementation is extremely hacky, so we're not supporting it, unfortunately. There's a note stating this on the repo.

jtcohen6 · 2024-02-26T15:10:57Z

@max-sixty Long time no talk!

You're right that we're focused on other priorities at the moment. The primary ones are adding unit testing (for SQL models) and stabilizing dbt-core's external interfaces (especially for adapter maintainers). There's more detail about this in our last roadmap post.

I still find this work fun & interesting. A number of the internals have changed since we sketched out the rough initial implementation over a year ago. I don't think we can do anything here before the v1.8 release (planned for April), but I would be up for taking another look after, once the dust has settled.

randypitcherii · 2024-02-29T14:12:14Z

I’m here for it!

github-actions · 2024-03-22T23:21:53Z

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

max-sixty requested a review from a team October 2, 2022 08:27

max-sixty requested review from a team as code owners October 2, 2022 08:27

max-sixty requested review from stu-k and emmyoop October 2, 2022 08:27

cla-bot bot added the cla:yes label Oct 2, 2022

max-sixty force-pushed the prql branch 5 times, most recently from 1a864fe to 7c5919d Compare October 3, 2022 05:44

max-sixty force-pushed the prql branch from 6462be7 to 92e0c92 Compare October 5, 2022 18:26

max-sixty added 8 commits October 5, 2022 11:34

A very-WIP implementation of the PRQL parser

506f2c9

Add a mock return from prql_python

fa3f172

Ignore types in the import hacks

5a8fd1e

(tests still fail b/c typing_extensions is not installed)

Revert to importing builtins from typing

ebff2ce

Always use the mock method to align the snapshot tests

c9572c3

flake

8eece38

Add test to test_graph.py

86eb68f

Add language on error nodes

bc8b650

max-sixty force-pushed the prql branch from 92e0c92 to bc8b650 Compare October 5, 2022 18:34

max-sixty added 2 commits October 5, 2022 18:02

Merge branch 'main' into prql

dddb0bf

Remove unused PrqlNode & friends

4729404

max-sixty added 2 commits October 6, 2022 13:00

Merge branch 'main' into prql

90223ed

Merge branch 'main' into prql

e0c32f4

jtcohen6 reviewed Nov 1, 2022

View reviewed changes

lostmygithubaccount mentioned this pull request Nov 1, 2022

[CT-1227] [Spike] Ibis on Snowflake dbt-labs/dbt-snowflake#272

Closed

jtcohen6 added discussion Team:Language labels Nov 17, 2022

lostmygithubaccount mentioned this pull request Nov 21, 2022

#5982 but for Ibis #6296

Closed

6 tasks

jtcohen6 reviewed Nov 22, 2022

View reviewed changes

aljazerzen mentioned this pull request Dec 1, 2022

Allowing dbt-prql to work with Postgres PRQL/prql#918

Closed

emmyoop removed their request for review March 14, 2023 20:25

jtcohen6 removed the Team:Language label Jul 19, 2023

dataders mentioned this pull request Sep 26, 2023

Submit scala jobs beta #8701

Closed

5 tasks

dbeatty10 added the community This PR is from a community member label Mar 22, 2024

jtcohen6 self-assigned this Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A very-WIP implementation of the PRQL plugin, for discussion #5982

A very-WIP implementation of the PRQL plugin, for discussion #5982

max-sixty commented Oct 2, 2022 •

edited

Loading

max-sixty commented Oct 3, 2022

gshank commented Oct 5, 2022 •

edited

Loading

max-sixty commented Oct 6, 2022

jtcohen6 left a comment

jtcohen6 Nov 1, 2022

jtcohen6 Nov 1, 2022

max-sixty commented Nov 1, 2022

jtcohen6 Nov 22, 2022

max-sixty Nov 25, 2022

binste commented Sep 7, 2023

wearpants commented Feb 19, 2024

max-sixty commented Feb 26, 2024

jtcohen6 commented Feb 26, 2024

randypitcherii commented Feb 29, 2024

github-actions bot commented Mar 22, 2024

A very-WIP implementation of the PRQL plugin, for discussion #5982

Are you sure you want to change the base?

A very-WIP implementation of the PRQL plugin, for discussion #5982

Conversation

max-sixty commented Oct 2, 2022 • edited Loading

Description

Checklist

max-sixty commented Oct 3, 2022

gshank commented Oct 5, 2022 • edited Loading

max-sixty commented Oct 6, 2022

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 Nov 1, 2022

Choose a reason for hiding this comment

jtcohen6 Nov 1, 2022

Choose a reason for hiding this comment

max-sixty commented Nov 1, 2022

jtcohen6 Nov 22, 2022

Choose a reason for hiding this comment

max-sixty Nov 25, 2022

Choose a reason for hiding this comment

binste commented Sep 7, 2023

wearpants commented Feb 19, 2024

max-sixty commented Feb 26, 2024

jtcohen6 commented Feb 26, 2024

randypitcherii commented Feb 29, 2024

github-actions bot commented Mar 22, 2024

max-sixty commented Oct 2, 2022 •

edited

Loading

gshank commented Oct 5, 2022 •

edited

Loading