Model groups & access #6730

jtcohen6 · 2023-01-25T17:01:19Z

jtcohen6
Jan 25, 2023
Maintainer

Part of the larger initiative for Multi-project collaboration (#6725)

A law of nature: DAGs get messier as they get bigger. We'd like to provide constructs that make it easier for dbt project developers and maintainers to manage, and reason about, large DAGs.

There are two big ideas at play here:

Groups of models, each representing a coherent business domain or sequence of transformations, should be demarcated explicitly.
Access: Some models ought to be "public," and other models "private." Public models bring with them an intentional set of guarantees; someone has consciously marked them final & ready to use. Private models are all the intermediate, modular steps along the way, often materialized ephemerally or as views.

By developing constructs for groups and access within one project, we aim to provide mechanisms for scaling monoliths more gracefully. At the same time, we believe these same capabilities will extend to deployments that span multiple projects—whether monorepo or polyrepo.

Proposals

dbt developers can mark a model as "public."

This is an enum attribute: access: public | private. (In the future, there could be additional options. We're taking inspiration by access modifiers in object-oriented programming languages. If this is your first time hearing these terms, that's totally okay: inspiration != prerequisite.)

This should be a model-level attribute, not a configuration set & inherited for many models at once in dbt_project.yml. Every public model must be consciously and individually marked as such. This adds a teensy bit of friction, with the aim of ensuring intentionality.

Our initial & primary intent for access control is models. I expect readers of this discussion, and dbt developers, to spend 90% of their time thinking about access control for models. As with every new feature, though, we need to ask: What about all the other resource types?

Can seeds + snapshots be marked “public”? Or should we require that they be wrapped in a public model?
Exposures? (What would a downstream developer do with a public exposure, except know that a given upstream model is being exposed?)
Metrics (& entities)? In general, I’d expect these to be public. Still, it should probably be possible to define “private” metrics that aren’t directly accessible to the semantic layer, especially intermediate metrics that serve as building blocks for derived metrics.

dbt developers can define groups.

A group may contain models ("public" & "private"), seeds, snapshots, tests, analyses, ~~exposures~~, entities, metrics.

Each resource (each model) may only belong to one group.
Source tables are already implicitly grouped by the source they belong to.
Macros aren't grouped. Their namespace is the project.

Update: In the first cut, exposures cannot belong to groups, and can't reference private models at all. They can define an owner, which could be the same as the owner of a resource group.

A model's group should be configured explicitly.

dbt should not implicitly infer group membership on the basis of other model properties or file path.
[TBD] Should it be possible to configure group in dbt_project.yml, e.g. for all models in a subdirectory?

Groups can be selected (group:), similar to the tag: or fqn: selection methods.

Groups must define an owner (dict<name: str, email: str, …>), which then applies to objects within the group.

If you want to use someone else's public model, you should know who they are! If you want to use their private model, you should know whom to ask.
This is analogous to the loader field in sources, and equivalent to the owner property already defined for exposures.
When defined, we should show this owner metadata in auto-generated project documentation, rather than what we show currently for models (the database user that created the model’s physical table).
The owner field should appear in node_info, for relevant events & structured logs, which could eventually enable more granular notifications.

This could look like:

# models/marts/github/github.yml

groups:
  - name: github
    owner:
      # only 'name' or 'email' is required
      name: Jeremy
      email: data@jer.co
      # anything else is optional metadata
      slack: talk-jerco-memes
      github: jtcohen6
      whatever_else: you want

models:
  - name: int__github_issue_label_history
    group: github  # explicit opt-in
    access: private
  - name: fct_github_issues
    group: github
    access: public

Groups are not intended as a mechanism for model namespacing. Resource names must still be globally unique within one project. (In a future with multi-project deployments, we should support multiple models with the same name, so long as they're defined in separate projects. That will mean finally tackling a longtime limitation: #1269.)

Only public models can be referenced outside their group.

In other words, private models can be ref'd within the same group, but they cannot be ref'd by resources outside of their group. This enables cleaner dependency chains, with fewer interwoven arrows.

What about models not in a group? My sense is, they should be neither public nor private. They can be ref'd elsewhere, and they also aren't held to the minimum standard for all public models in a project (see below). Motivation: Preserve status quo, and avoid creating lots of tech debt for existing projects.

As soon as a model is added to a group, it becomes private, until explicitly marked public.

More devilish details:

A ref call to a private model in a different group should raise an error: Model 'model.my_project.my_model' depends on a node named 'private_model’, which is private in a different group.
Specific instances of generic tests defined on a "grouped" model should also be members of that model's group. E.g. If a private model int_payments_aggregated belongs to the group "finance," a unique test on that private model also belongs to the "finance" group, and is allowed to ref the model it is testing. However, to define a relationships test between two private models, they do actually need to be in the same group!
Exposures can depend on a private model, if (and only if) both the exposure and the model are in the same group. That said, exposing a private model isn't recommended!
[Future] When we add support for cross-project ref, only models explicitly marked public can be "exported" and referenced from other projects. A ref call to a private model in a different project should raise an error: Model 'model.my_project.my_model' depends on a node named 'other_project.private_model’ which was not found

Public models ought to be "contracted," with a reliable (minimum) set of guarantees.

For more on model contracts, see #6726

We should implement a sensible & opinionated default. In my opinion:

Every public model should be "contracted," whereby all columns are explicitly named & typed.
The public model and all its columns should have a description.
The public model should have at least one unique test (= validated primary key)

Users may optionally define their own set of expectations, overriding the default, that would be checked against every public model in the project.

These expectations should be defined in a separate file. (Teams can take advantage of .CODEOWNERS rules, e.g. to require reviews from repository maintainers any time these expectations are updated.)
These rules would be validated during parsing. The idea is not for public models to magically inherit these configurations, but simply to make sure that they match up.
For example, a data team may want to enforce that every public model has persist_docs enabled (for integration with an external data catalog), is materialized as a view (on top of an underlying private table), and has at least a certain number of data quality tests. Imagine something like:

# public_models.yml
description: true  # every public model must be described
config:  # every public model must match these configs
  constraints_enabled: true
  persist_docs:
    relation: true
    columns: true
  materialized: view
columns:
  description: true  # every column must be described
tests:  # matches 'test_name', with optional package prefix
  unique: 1  # at least one unique test, on any column
  installed_package.totally_custom_test: 3  # at least 3 of whatever this is

For totally custom & complex validation logic (e.g. "every column named email should have a BigQuery policy tag, a dbt pii tag, and a description containing the word 'pseudonymized'"), these rules could, as they can today, be written in:

Jinja macros, enforced at compile/runtime via hook (a la dbt_project_evaluator and dbt_meta_testing)
Custom scripts that parse dbt metadata artifacts (manifest.json)

Groups can be visualized

One of the biggest eventual benefits of sorting models into groups, slowly but surely, is enabling users to make visual sense of large & complex DAGs.

Our team has not been able to meaningfully invest in new features for dbt-docs. While I don't anticipate that changing, there are a few low-effort & high-value additions we may want to shoot for:

DAG viz selection by group. (Even if we can't "roll up" to groups, or demarcate multiple groups simultaneously, just viewing all resources in one group is a good start.)
Within model view, show its access rule (public or private).
Within the DAG viz, an option to visually distinguish between public and private models.
A view of all public models with the same owner (grouped by name).

[Bonus content]

[Future] More ergonomic configuration?

(This "paper cut" issue has a lot of upvotes, and is never far from my mind!)

We might want to add groups as a new rung in the configuration ladder. If users could set some group-level config that cascades down to member models, they would be able to move some of that config out of dbt_project.yml. This does risk additional confusion, though, in trying to figure out where a model's configuration is coming from. We could always add this later.

# dbt_project.yml
models:
  marts:
    some_config: some_default_value

# models/marts/github/github.yml

groups:
  - name: github
    config:
      +some_config: override_default_value

models:
  - name: int__github_issue_label_history
    group: github
    # no model-level config -- uses group-level 'override' defined above
  - name: fct_github_issues
    group: github
    config:
      some_config: override_again  # this one wins

[Aside] What's the distinction between a "public model" and an "entity"?

If you've been following the discussion about adding more semantic information to dbt (#6644), and the proposal for entities as a new node type, you might be wondering: What makes a model worthy to be "public," and what makes it worthy to power an entity?

This is a subtle distinction! I expect many entities to be built on top of public models, and to leverage those public models' contracted metadata (column names + data types) as a way to provide richer dimensions for semantic queries. We had several conversations about whether these ought to be one & the same, but ultimately decided that they deserve to be two separate constructs. Why?

Entities represent the canonical representation of a business concept (customers, orders) for downstream querying ("semantic"). Public models represent a logical dataset with a set of guarantees and clear ownership. If you'll indulge me in a metaphor: Database tables are raw materials; models are the means of logical production; public models are finished goods; entities are the packaging (declared interface) for those goods; metrics their clearly defined directions for intended use.

jtcohen6 · 2023-03-26T20:26:41Z

jtcohen6
Mar 26, 2023
Maintainer Author

Updates after a few months of implementation work!

Docs (beta!): https://docs.getdbt.com/docs/collaborate/publish/model-access

`protected` models

We added a third access modifier, protected, which will be the default for all models. A model with this access level can be referenced by another model in the same project, even if they're in different groups. When we roll out support for cross-project ref, it will be prohibited to reference a protected model; only public models will be allowed. Our motivation: Make it easy for folks with existing (tangled & thorny) DAGs to start adopting groups, even if lots of other models are reaching in and ref'ing should-be-private models that are members of that group. Slowly but surely, that group's owner can define better interface boundaries, and convert those protected models into truly private ones.

Kicked out of scope

We aren't going to implement "publication standards" (shown above in pseudo-code as public_models.yml) in the near future. This is a big topic, and we wouldn't be doing it justice if we were to sneak it into v1.5 as a micro-feature. More on this:

[CT-2194] publication standards for public models #7062

As signalled above, we also aren't going to use groups as another rung on the hatrack off which to hang hierarchical configuration. We are, however, going to revisit the ways in which dbt's current approach to configs/properties/etc feels confusing:

[CT-2296] [Spike] Remove the distinction between configs and non-configs #7157

Finally - stay tuned for more to come soon on our updated plans for entities, and how all these new constructs (model ownership, et al) will integrate with the revamped dbt Semantic Layer :)

0 replies

yu-iskw · 2023-05-08T08:37:48Z

yu-iskw
May 8, 2023

@jtcohen6 I would like to allow to access some models from certain groups. As for me, the protected type would be broad to control data access in a large project. To do so, there a couple of ways to realize that. First, we may be able to composite group. A group can contain the sub groups. Models in a sub group can access private models in another sub group in the same parent group. Second, we can also implicitly declare what other group can access private models in a group.

The subsequent image is what I want to do. We assume if we have four groups in a dbt project here. Some models managed by group1 can be accessed from models managed by group4. Other models, as managed by group3, can't refer to models in group1.

If we will support multiple project deployments in the near future, we can also split a large dbt project so that we take advantage of the protected type. Besides, it might be related to introduce namespace of dbt models.

2 replies

jtcohen6 May 8, 2023
Maintainer Author

If we will support multiple project deployments in the near future, we can also split a large dbt project so that we take advantage of the protected type. Besides, it might be related to #1269

This is just what I was thinking! So long as Group 3 is appropriate to split out into a project separate from Groups 1, 2, & 4, which should indeed remain groups in the same projects. Groups 1 + 2 would feed protected models into group4, while restricting access to Group 3.

yu-iskw May 10, 2023

I got it. That would work, but I am a bit anxious about a couple of points. I would like to discuss them.

First, as an agile organization tends to dynamically change the organization structure frequently, it might be difficult to migrate models across dbt projects along to the change. Besides, ownerships of dbt models can be changed. If we separate dbt projects, it might be a bit hard to safely move across projects.

Second, we would like to group models from various different perspectives. We would like to of course group models by ownerships. We would like to clarify who manage what models. Another possible perspective is to group models by data segregation. Even if an organization runs a single service, there can be a couple of data segregations based on regulations, for instance, in terms of data privacy. We would like to group dbt models by data segregation in addition to the ownerships so that we definitely prohibit data exchange between data segregations.

dlaplante75 · 2023-05-10T23:27:54Z

dlaplante75
May 10, 2023

It seems to me that groups can only be declared at the root of a models.yml file. If we can assign a group to a folder (for all models inside it) from the project file. Wouldn't it make more sense that we could declare groups in the project file also?

Or am I missing something?

0 replies

gnilrets · 2023-07-17T18:58:07Z

gnilrets
Jul 17, 2023

We tried to implement groups and access in our single dbt project. I was originally thinking we'd use them to distinguish between models that are meant to be consumed by data mart owners vs models that are just intermediates in the data pipeline. For example, I wanted to defined one "main" group and multiple "mart" groups. The "main" group would contain all of our staging models and the final kimball-lite dimensional models, which are maintained by the core analytics engineering team. The "mart" models contain various aggregations and joins of the "main" models but shouldn't usually reference staging models, and are maintained by specific teams. Therefore, I wanted to make all of my staging models private and all of the main models public. I was hoping to write a dbt_project.yml like this:

models:
  mainspring:
    staging:
      +schema: staging
      +group: main
      +access: private
    main:
      +group: main
      +access: public
    finance_mart:
      +group: finance
      +access: private # then tag specific mart models as public
    hr_mart:
      +group: finance
      +access: private
      +group: finance

Unfortunately, this doesn't work because I can't define access in dbt_project.yml. I would have to ensure that each staging model had access: private, which is too easy to forget or copy-paste incorrectly.

I get the motivation that defining any model as public should be deliberate (although maybe we should be able to make that mistake). However, having the default access as protected instead of private makes this feature not particularly useful for single-project setups.
Our teams aren't big enough to justify splitting it out into multiple dbt projects, but I was hoping this groups/access feature could help to enforce some conventions we're trying to follow. With its current limitations, this feature doesn't quite work for us.

1 reply

dbeatty10 Aug 1, 2023
Maintainer

Thanks for this feedback @gnilrets!

You might want to follow along with #7619 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model groups & access #6730

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Model groups & access #6730

jtcohen6 Jan 25, 2023 Maintainer

Proposals

dbt developers can mark a model as "public."

dbt developers can define groups.

Only public models can be referenced outside their group.

Public models ought to be "contracted," with a reliable (minimum) set of guarantees.

Groups can be visualized

[Bonus content]

[Future] More ergonomic configuration?

[Aside] What's the distinction between a "public model" and an "entity"?

Replies: 4 comments · 3 replies

jtcohen6 Mar 26, 2023 Maintainer Author

protected models

Kicked out of scope

yu-iskw May 8, 2023

jtcohen6 May 8, 2023 Maintainer Author

yu-iskw May 10, 2023

dlaplante75 May 10, 2023

gnilrets Jul 17, 2023

dbeatty10 Aug 1, 2023 Maintainer

jtcohen6
Jan 25, 2023
Maintainer

Replies: 4 comments 3 replies

jtcohen6
Mar 26, 2023
Maintainer Author

`protected` models

yu-iskw
May 8, 2023

jtcohen6 May 8, 2023
Maintainer Author

dlaplante75
May 10, 2023

gnilrets
Jul 17, 2023

dbeatty10 Aug 1, 2023
Maintainer