Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAGnabit #9

Merged
merged 26 commits into from
Apr 3, 2016
Merged

DAGnabit #9

merged 26 commits into from
Apr 3, 2016

Conversation

drewbanin
Copy link
Contributor

Walk the DAG inferred form SQL select statements. Run CREATE statements in the appropriate order.

I tested this with the models repo, but unit tests are definitely a good idea

Additionally, models can be individually configured. The defaults are

materalized: false,
enabled: true

These configs can be overridden in the dbt_project.yml file.

models:
  base:
    materialized: false,
    enabled: true
  pardot:
    enabled: false
    pardot_visitoractivity:
      enabled: true
      materialized: true

This will result in all models outside of Pardot being created as views. Only the pardot_visitoractivity model will be created (as a table) within the Pardot directory.

@drewbanin drewbanin changed the title first cut DAGnabit Mar 23, 2016
@drewbanin drewbanin modified the milestone: v0.1.0 release Mar 23, 2016
@drewbanin
Copy link
Contributor Author

cc @jthandy

@jthandy
Copy link
Member

jthandy commented Mar 30, 2016

I'm a big fan of this approach. Agree that it could use refinement but this will significantly help my workflow; I had anticipated config for view/table to come later so that's great.

@cmerrick we're hoping to have the launch completed by the time you head to SF next week. do you anticipate any challenges working through code review within the next couple of days? if so, no prob but might try to push it to someone else.

@jthandy jthandy mentioned this pull request Mar 30, 2016
@@ -0,0 +1,93 @@
import sqlparse

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the choice of namespace here is confusing - it's under the compiler namespace, but it's used by the run task? I recommend just moving this into run.py for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmerrick I imagine we're going to want to do much more in the compilation step than we're currently doing. For instance, sort/dist keys, column encodings, etc can all be interpolated at compile time. I think it makes sense to leave the actual sql parsing code in a separate dir for now. I did change the name from compile to build

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like speculative design. We can always break it off into its own file later, if we decide we need to.

@drewbanin
Copy link
Contributor Author

@cmerrick is consolidating the linker.py code into the run.py file the only thing standing in the way of getting this merged?

@drewbanin
Copy link
Contributor Author

Cool, will revise asap

Sent from my iPhone

On Apr 1, 2016, at 6:14 PM, Christopher Merrick notifications@github.com wrote:

I agree it's not pretty, but I think this approach has structural issues

On Fri, Apr 1, 2016 at 6:09 PM, Drew Banin notifications@github.com wrote:

The alternative is putting a big config blob in the dbt_project.yml file.
Do you think that's preferable?

Sent from my iPhone

On Apr 1, 2016, at 5:57 PM, Christopher Merrick <
notifications@github.com> wrote:

I don't think we should mix configuration like the enabled and
materialized flags into the models directories. The models are source code,
those parameters are essentially compilation flags.

In a world where models can be packaged up and utilized by another
party, we don't want that user to have to modify anything in the source
directory. It would be like having to configure ruby on rails by modifying
its source code directly.


You are receiving this because you authored the thread.

Reply to this email directly or view it on GitHub


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#9 (comment)


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub

def link(self, sql):
sql = sql.strip()
for statement in sqlparse.parse(sql):
if statement.get_type().startswith('CREATE'):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is expecting the model to be specified as create view as... or create table as... and then re-writes that create statement if necessary? I see that's a convenient way to allow the user to specify the name of the thing, but I think it will either break or produce surprising results if anything other than the most generic forms for create table/view as... are used. For example, what's going to happen if someone writes create temporary table as ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code runs after the compilation step. Input to the system should be a SELECT statement. This is compiled to a CREATE statement, which is then run. I suppose this code is redundant, but it was originally intended to avoid running destructive SQL unintentionally

There is a single select statement per file, and the table/view name is inferred from the file name containing the SELECT statement

Sent from my iPhone

On Apr 1, 2016, at 6:29 PM, Christopher Merrick notifications@github.com wrote:

In dbt/task/run.py:

  •    order = nx.topological_sort(self.graph, reverse=True)
    
  •    for node in order:
    
  •        if node in self.node_sql_map: # TODO :
    
  •            yield (node, self.node_sql_map[node])
    
  •        else:
    
  •            pass
    
  • def register(self, node, sql):
  •    if node in self.node_sql_map:
    
  •        raise RuntimeError("multiple declarations of node: {}".format(node))
    
  •    self.node_sql_map[node] = sql
    
  • def link(self, sql):
  •    sql = sql.strip()
    
  •    for statement in sqlparse.parse(sql):
    
  •        if statement.get_type().startswith('CREATE'):
    
    so this is expecting the model to be specified as create view as... or create table as... and then re-writes that create statement if necessary? I see that's a convenient way to allow the user to specify the name of the thing, but I think it will either break or produce surprising results if anything other than the most generic forms for create table/view as... are used. For example, what's going to happen if someone writes create temporary table as ...?


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, the simultaneous introduction of dependency-graphing and create-wrapping got me mixed up. That sounds right. In the future it would be good to provide that blurb when you create the PR, it's useful for understanding the approach.

@cmerrick cmerrick assigned drewbanin and unassigned cmerrick Apr 1, 2016
@cmerrick
Copy link

cmerrick commented Apr 1, 2016

back to you

@drewbanin
Copy link
Contributor Author

@cmerrick model configs are now be specified like this:

models:
  base:
    enabled: true
    materialized: false

  pardot:
    materialized: true
    pardot_visitoractivity:
      materialized: false

The precedence goes model, model group, base

By default, all models are enabled and are not materialized

@drewbanin drewbanin assigned cmerrick and unassigned drewbanin Apr 3, 2016

return model_config
config = model_configs['base'].copy()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that "base" is a magic word here, it means a model/model group can't be called that. Some alternative proposals:

  • Don't allow this to be set hierarchically. Either only allow a global setting, or only allow per-model settings, but not both.
  • Use regexps to specify the models to apply the settings to. In that case, "base" would become "*".
  • Use two different keys in the parent map - e.g. "default_model_settings" that is equivalent to "base", and "model_settings" that is an array of per-model settings

@cmerrick
Copy link

cmerrick commented Apr 3, 2016

The precedence goes model, model group, base

What's a model group? Could there be a conflict between a model group name and the model name?

@cmerrick cmerrick assigned drewbanin and unassigned cmerrick Apr 3, 2016
@drewbanin
Copy link
Contributor Author

@cmerrick we don't have a good name for this concept -- A model group is something like "pardot". The actual model is "pardot_visitoractivity", for instance. I think we threw around the name of "module" or "service", but I don't think either of those is descriptive.

Name conflicts should be avoided for clarity (and it might break something else...?) but I think this config isn't affected.

For example

models:
  pardot:
    enabled: false
    pardot:
      enabled: true

Should work fine. models/pardot/*.sql will be disabled, but models/pardot/pardot.sql will be enabled (because it overwrites it's parent).

I think it's a moot point, because I can't foresee a situation in which the directory (model group) name is the the same as the model name. We should clarify by convention (or in code) that you shouldn't do this

@cmerrick
Copy link

cmerrick commented Apr 3, 2016

So a model group is essentially a namespace? Could you have more deeply nested namespaces e.g. pardot.email.opens or something like that?

@drewbanin
Copy link
Contributor Author

@cmerrick yes indeed, a model group is a namespace. Right now, the namespace can't contain any subdirectories. I can foresee that being useful, but it's not implemented in a way that supports that currently

@drewbanin drewbanin assigned cmerrick and unassigned drewbanin Apr 3, 2016
@cmerrick
Copy link

cmerrick commented Apr 3, 2016

approved

@cmerrick cmerrick assigned drewbanin and unassigned cmerrick Apr 3, 2016
@drewbanin drewbanin merged commit 58a06c7 into master Apr 3, 2016
@jthandy jthandy deleted the dagnabit branch August 7, 2016 20:57
@jthandy jthandy restored the dagnabit branch August 7, 2016 20:57
@jthandy jthandy deleted the dagnabit branch August 7, 2016 20:57
yu-iskw pushed a commit to yu-iskw/dbt that referenced this pull request Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants