Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: bigquery support #35

Merged
merged 8 commits into from
Dec 27, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ logs/
# CI CONFIG
integration_tests/ci_venv/
integration_tests/config/
integration_tests/ci_profiles/.user.yml
44 changes: 36 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,41 @@
[![](https://img.shields.io/static/v1?label=dbt-core&message=1.0.0&logo=dbt&logoColor=FF694B&labelColor=5c5c5c&color=047377&style=for-the-badge)](https://github.com/dbt-labs/dbt-core)
[![](https://img.shields.io/static/v1?label=dbt-utils&message=0.8.0&logo=dbt&logoColor=FF694B&labelColor=5c5c5c&color=047377&style=for-the-badge)](https://github.com/dbt-labs/dbt-utils/)

A DBT package designed to help SQL based analysis of graphs. This package currently only supports snowflake and postgres versions >= 10.
A DBT package designed to help SQL based analysis of graphs.

Supported adapters are:
- `dbt-snowflake`
- `dbt-postgres` (note postgres versions >= 10 is requried)
- `dbt-bigquery` (see important note below!!!)

Adapter contributions are welcome! Generally new adapters require additions to the `macros/utils` folder, assuming the given database / engine supports recursive CTEs elegantly. In some cases (namely bigquery), specific array handling was required.

It's recommended to use the unit test suit to develop new adapters. Please get in touch if you need assistance!

**IMPORTANT NOTE**:
BigQuery is untested in the wild, and is quite brittle regarding the `recursive` keyword. Ensure you __only__ macros without CTE nesting - for example, to use `largest_connected_subgraphs`, write SQL like:

```sql
-- model.sql
{{
largest_connected_subgraphs(...)
}}
```

rather than

```sql
-- model.sql
with recursive foo as (
{{
largest_connected_subgraphs(...)
}}
)
...
```

This is to ensure that the `recursive` handling works. A feature request to improve this behaviour has been sent to google - please upvote:
https://issuetracker.google.com/u/1/issues/263510050

----
## Introduction
Expand Down Expand Up @@ -473,7 +507,7 @@ Arguments:

This is an adapter specific macro for aggregating a column into an array.

This macro excludes nulls, and supports snowflake and postgres.
This macro excludes nulls.

**Usage:**
```sql
Expand Down Expand Up @@ -501,8 +535,6 @@ Arguments:

This is an adapter specific macro for appending a new value into an array.

This macro supports snowflake and postgres.

**Usage:**
```sql
select
Expand All @@ -524,8 +556,6 @@ Arguments:

This is an adapter specific macro for constructuring an array from a list of values.

This macro supports snowflake and postgres.

**Usage:**
```sql
{% set list = ['field_one', 'field_two', "'hardcoded_string'"] %}
Expand All @@ -549,8 +579,6 @@ Arguments:

This is an adapter specific macro to test whether a value is contained within an array.

This macro supports snowflake and postgres.

**Usage:**
```sql
select
Expand Down
1 change: 0 additions & 1 deletion integration_tests/ci_profiles/.user.yml

This file was deleted.

18 changes: 11 additions & 7 deletions integration_tests/macros/cte_difference.sql
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
{% macro cte_difference(cte_1, cte_2, fields=[]) %}
(
select '{{cte_1}}' as _data_location, {{ fields|join(', ') }} from {{ cte_1 }}
except
select '{{cte_1}}' as _data_location, {{ fields|join(', ') }} from {{ cte_2 }}
union
select '{{cte_2}}' as _data_location, {{ fields|join(', ') }} from {{ cte_2 }}
except
select '{{cte_2}}' as _data_location, {{ fields|join(', ') }} from {{ cte_1 }}
(
select '{{cte_1}}' as _data_location, {{ fields|join(', ') }} from {{ cte_1 }}
{{ dbt_graph_theory.set_except() }}
select '{{cte_1}}' as _data_location, {{ fields|join(', ') }} from {{ cte_2 }}
)
{{ dbt_graph_theory.set_union(distinct=true) }}
(
select '{{cte_2}}' as _data_location, {{ fields|join(', ') }} from {{ cte_2 }}
{{ dbt_graph_theory.set_except() }}
select '{{cte_2}}' as _data_location, {{ fields|join(', ') }} from {{ cte_1 }}
)
) as diff
{% endmacro %}
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ with recast as (
id,
vertex_1,
vertex_2,
order_date::date as order_date
cast(order_date as date) as order_date
from {{ ref('test_connect_ordered_graph_2_sg_date_data') }}
),

Expand All @@ -20,12 +20,12 @@ computed as (
required as (
select v.* from (
values
('1', 'A', 'B', '2022-01-01'::date),
('2', 'B', 'C', '2022-01-03'::date),
('3', 'C', 'D', '2022-01-05'::date),
('4', 'E', 'F', '2022-01-08'::date),
('5', 'F', 'G', '2022-01-16'::date),
('inserted_edge_1', 'D', 'E', '2022-01-07'::date)
('1', 'A', 'B', cast('2022-01-01' as date)),
('2', 'B', 'C', cast('2022-01-03' as date)),
('3', 'C', 'D', cast('2022-01-05' as date)),
('4', 'E', 'F', cast('2022-01-08' as date)),
('5', 'F', 'G', cast('2022-01-16' as date)),
('inserted_edge_1', 'D', 'E', cast('2022-01-07' as date))
) as v (id, vertex_1, vertex_2, order_date)
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ with recast as (
id,
vertex_1,
vertex_2,
order_date::date as order_date
cast(order_date as date) as order_date
from {{ ref('test_connect_ordered_graph_3_sg_date_data') }}
),

Expand All @@ -20,14 +20,14 @@ computed as (
required as (
select v.* from (
values
('1', 'A', 'B', '2022-01-01'::date),
('2', 'B', 'C', '2022-01-03'::date),
('3', 'C', 'D', '2022-01-05'::date),
('4', 'E', 'F', '2022-01-08'::date),
('5', 'F', 'G', '2022-01-16'::date),
('6', 'H', 'I', '2022-01-27'::date),
('inserted_edge_1', 'D', 'E', '2022-01-07'::date),
('inserted_edge_2', 'G', 'H', '2022-01-26'::date)
('1', 'A', 'B', cast('2022-01-01' as date)),
('2', 'B', 'C', cast('2022-01-03' as date)),
('3', 'C', 'D', cast('2022-01-05' as date)),
('4', 'E', 'F', cast('2022-01-08' as date)),
('5', 'F', 'G', cast('2022-01-16' as date)),
('6', 'H', 'I', cast('2022-01-27' as date)),
('inserted_edge_1', 'D', 'E', cast('2022-01-07' as date)),
('inserted_edge_2', 'G', 'H', cast('2022-01-26' as date))
) as v (id, vertex_1, vertex_2, order_date)
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ with recast as (
id,
vertex_1,
vertex_2,
order_time::timestamp as order_time
cast(order_time as timestamp) as order_time
from {{ ref('test_connect_ordered_graph_4_sg_timestamp_data') }}
),

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ with computed as (
-- recast because vertex_2 is all null in seed data, interpreted as int dtype
recast_computed as (
select
vertex::text as vertex,
cast(vertex as {{ type_string() }}) as vertex,
subgraph_id,
subgraph_members
from
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ with computed as (
required as (
select v.* from (
values
(null::text, null::text, null::text, array[null])
(cast(null as {{ type_string() }}), cast(null as {{ type_string() }}), cast(null as {{ type_string() }}), array[null])
) as v (graph_id, vertex, subgraph_id, subgraph_members)
where false
)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
with computed as (
with recursive computed as (
{{ dbt_graph_theory.largest_connected_subgraphs(
input=ref('test_largest_connected_subgraphs_nd_data')
) }}
Expand All @@ -7,7 +7,7 @@ with computed as (
required as (
select v.* from (
values
(null::text, null::text, array[null])
(null::{{ type_string() }}, null::{{ type_string() }}, array[null])
) as v (vertex, subgraph_id, subgraph_members)
where false
)
Expand Down
44 changes: 22 additions & 22 deletions macros/connect_ordered_graph.sql
Original file line number Diff line number Diff line change
Expand Up @@ -11,43 +11,43 @@
Additional fields are dropped - if these are required, they should be joined back in.

Required [minimal] table structure:
graph_id (Optional, text):
graph_id (Optional, string):
An identifier at the graph level (ie. if the table in question represents multiple graphs).
When this is not defined, it is assumed that the table represents the one graph.
edge_id (text):
edge_id (string):
An identifier of the edge (from vertex_1 to vertex_2). This field should be unique at the graph level.
vertex_1 (text):
vertex_1 (string):
The alias for the first (origin, for directed graphs) vertex of the given edge_id.
Nulls are allowed, and correspond to the given vertex_2 not being connected to any other vertices.
vertex_2 (text):
vertex_2 (string):
The alias for the second (destination, for directed graphs) vertex of the given edge_id.
Nulls are allowed, and correspond to the given vertex_1 not being connected to any other vertices.
ordering (timestamp, date or numeric):
The field corresponding to the order of the edges of the given graph. This is used to connect sensible nodes to each other
(ie. in order from one subgraph to the other).

It returns a query giving a vertex / graph level table with the following fields:
graph_id (text):
graph_id (string):
Identifies the graph based on the input table. If graph_id was not present in the input table, this field is always '1'.
vertex (text):
vertex (string):
Identifies the vertex that the given subgraph and subgraph_members corresponds to. This (as well as graph_id) defines the level of the table.
subgraph_id (text):
subgraph_id (string):
An identifier of the (connected) subgraph for the given vertices for the given edge.
This is unique at the graph level.
subgraph_members (array[Any]):
An array of the vertices that constitute the given subgraph. The data type of the array is that of the vertex_1 and vertex_2 fields.

Parameters:
input (text or a ref / source): The input model or CTE that follows the structure above.
edge_id (text): The field corresponding to the edge_id field described above.
vertex_1 (text): The field corresponding to the vertex_1 field described above.
vertex_2 (text): The field corresponding to the vertex_2 field described above.
ordering (dict[text, text]):
input (string or a ref / source): The input model or CTE that follows the structure above.
edge_id (string): The field corresponding to the edge_id field described above.
vertex_1 (string): The field corresponding to the vertex_1 field described above.
vertex_2 (string): The field corresponding to the vertex_2 field described above.
ordering (dict[string, string]):
A dict with key being the field corresponding to the ordering as descripted above,
and the value being the data type of the given field.
For example, { 'event_time' : 'timestamp' } corresponds to a field named event_time of type timestamp.
The data type must be one of: 'timestamp', 'date', 'numeric'.
graph_id (text, Optional, default = None): The field corresponding to the graph_id field described above.
graph_id (string, Optional, default = None): The field corresponding to the graph_id field described above.
#}

{% set supported_ordering_types = ['numeric', 'timestamp', 'date'] %}
Expand All @@ -71,14 +71,14 @@ with subgraphs as (

enforce_graph_types as (
select
{{ graph_id if graph_id else '1::text'}}::text as graph_id,
{{ edge_id }}::text as edge_id,
{{ vertex_1 }}::text as vertex_1,
{{ vertex_2 }}::text as vertex_2,
cast({{ graph_id if graph_id else '1'}} as {{ type_string() }}) as graph_id,
cast({{ edge_id }} as {{ type_string() }}) as edge_id,
cast({{ vertex_1 }} as {{ type_string() }}) as vertex_1,
cast({{ vertex_2 }} as {{ type_string() }}) as vertex_2,
{% if ordering_type == 'timestamp' %}
{{ dbt_graph_theory.cast_timestamp(ordering_field) }} as ordering
{% else %}
{{ ordering_field }}::{{ordering_type}} as ordering
cast({{ ordering_field }} as {{ordering_type}}) as ordering
{% endif %}
from
{{ input }}
Expand Down Expand Up @@ -114,7 +114,7 @@ vertex_ordering as (
vertex,
ordering
from from_vertices
union
{{ dbt_graph_theory.set_union(distinct=true) }}
select
graph_id,
vertex,
Expand Down Expand Up @@ -201,13 +201,13 @@ include_new_edges as (
vertex_1 as {{ vertex_1 }},
vertex_2 as {{ vertex_2 }}
from enforce_graph_types
union all
{{ dbt_graph_theory.set_union(distinct=false) }}
select
{{ 'graph_id as ' ~ graph_id ~ ',' if graph_id }}
concat(
{{ "graph_id::text, '_'," if graph_id }}
{{ "cast(graph_id as {{ type_string() }}), '_'," if graph_id }}
'inserted_edge_',
row_number() over (order by graph_id, vertex_1)::text
cast(row_number() over (order by graph_id, vertex_1) as {{ type_string() }})
) as {{ edge_id }},
{% if ordering_type == 'timestamp' %}
case
Expand Down
18 changes: 9 additions & 9 deletions macros/enforce_graph_structure.sql
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,18 @@
This macro takes a table and enforces that it follows the graph table structure.

Parameters:
input (text or a ref / source): The input model or CTE that follows the structure above.
edge_id (text): The edge_id field of the given input.
vertex_1 (text): The vertex_1 field of the given input.
vertex_2 (text): The vertex_2 field of the given input.
graph_id (text, Optional, default = None): The (optional) graph_di field of the given input.
input (string or a ref / source): The input model or CTE that follows the structure above.
edge_id (string): The edge_id field of the given input.
vertex_1 (string): The vertex_1 field of the given input.
vertex_2 (string): The vertex_2 field of the given input.
graph_id (string, Optional, default = None): The (optional) graph_di field of the given input.
#}

select
{{ edge_id }}::text as {{ edge_id }},
{{ graph_id ~ '::text as ' ~ graph_id ~ ',' if graph_id }}
{{ vertex_1 }}::text as {{ vertex_1 }},
{{ vertex_2 }}::text as {{ vertex_2 }}
cast({{ edge_id }} as {{ type_string() }}) as {{ edge_id }},
{{ 'cast(' ~ graph_id ~ ' as ' ~ type_string() ~ ') as ' ~ graph_id ~ ',' if graph_id }}
cast({{ vertex_1 }} as {{ type_string() }}) as {{ vertex_1 }},
cast({{ vertex_2 }} as {{ type_string() }}) as {{ vertex_2 }}
from
{{ input }}
{% endmacro %}
3 changes: 2 additions & 1 deletion macros/exceptions/adapter_missing_exception.sql
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

{% set supported_adapters = [
"dbt_postgres",
"dbt_snowflake"
"dbt_snowflake",
"dbt_bigquery"
] %}

{{- exceptions.raise_compiler_error(
Expand Down
Loading