Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable dbt python models via serverless Dataproc and create example model #2346

Merged
merged 4 commits into from
Jun 1, 2023

Conversation

atvaccaro
Copy link
Contributor

@atvaccaro atvaccaro commented Mar 3, 2023

Description

I had to enable Google Private Access on our default VPC, and also grant access in the prod project (BigQuery and GCS) to the staging default compute service account since local dbt runs use the staging project.

Still waiting on dbt-labs/dbt-bigquery#636 but we can fork if necessary
Blocked on dbt-labs/dbt-bigquery#609

We had a meeting today (2023-05-31) to review the tech spec, so I'd like to merge this ahead of a to-be-scheduled pairing session to test an initial model.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation
  • agencies.yml

How has this been tested?

I had created and tested an example model, but we don't actually want/need it. Pasting the source code here for posterity; it ran on our custom image provided to Dataproc.

Screenshots (optional)

import geopandas as gpd  # noqa: F401
import pandas as pd  # noqa: F401
import pyspark.sql.functions as F
import pyspark.sql.types as T
from shapely import wkt
from shapely.geometry import LineString


@F.udf(returnType=T.FloatType())
def point_array_to_linestring(arr):
    if len(arr) == 1:
        return 0
    return LineString([wkt.loads(p) for p in arr]).length


def model(dbt, session):
    dbt.config(materialized="incremental", packages=["pandas", "geopandas"])

    df = dbt.ref("dim_shapes_arrays_wkt")

    if dbt.is_incremental and session.catalog.tableExists(repr(dbt.this)):
        # only new rows compared to max in current table
        max_from_this = f"select max(_feed_valid_from) from `{dbt.this}`"
        df = df.filter(df._feed_valid_from > session.sql(max_from_this).collect()[0][0])

    df = df.withColumn(
        "length_meters", point_array_to_linestring(F.col("pt_array_wkt"))
    )

    return df

@atvaccaro atvaccaro self-assigned this Mar 3, 2023
@atvaccaro atvaccaro force-pushed the dbt-python-models branch 2 times, most recently from a07e455 to e817268 Compare March 28, 2023 20:59
@github-actions
Copy link

github-actions bot commented Mar 28, 2023

Warehouse report 📦

New models 🌱

mart.gtfs.dim_shapes_arrays_enriched
mart.gtfs.dim_shapes_arrays_wkt

Changed models 🔀

mart.gtfs_quality.fct_daily_organization_combined_guideline_checks
mart.gtfs_quality.fct_daily_service_combined_guideline_checks
intermediate.gtfs_quality.guidelines_checks.int_gtfs_quality__modification_date_present_rt

DAG

@atvaccaro atvaccaro changed the title test python model! test Python models via serverless Dataproc Mar 31, 2023
@atvaccaro atvaccaro force-pushed the dbt-python-models branch from 1c52ba8 to 802ea8d Compare May 31, 2023 20:57
@atvaccaro atvaccaro changed the title test Python models via serverless Dataproc Enable dbt python models via serverless Dataproc and create example model May 31, 2023
@atvaccaro atvaccaro force-pushed the dbt-python-models branch from 802ea8d to 892f2f1 Compare May 31, 2023 21:18
@atvaccaro atvaccaro added do-not-merge Do not merge, even if approved and removed blocked labels May 31, 2023
@atvaccaro atvaccaro force-pushed the dbt-python-models branch from 7c575da to ef4047d Compare May 31, 2023 21:23
@atvaccaro atvaccaro removed the do-not-merge Do not merge, even if approved label May 31, 2023
@atvaccaro atvaccaro assigned atvaccaro and unassigned atvaccaro May 31, 2023
@atvaccaro atvaccaro marked this pull request as ready for review May 31, 2023 21:27
Copy link
Contributor

@lauriemerrell lauriemerrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still waiting on dbt-labs/dbt-bigquery#636 but we can fork if necessary

Is this actually a blocker to merge?

Would it be useful to add some documentation about this functionality to the warehouse README? Are there any links we could add about where to check if you need to troubleshoot any of this setup?

warehouse/.gitignore Show resolved Hide resolved
@atvaccaro
Copy link
Contributor Author

Is this actually a blocker to merge?

No, but developer experience will be poor; I gave this caveat to tiffany and eric in our tech spec meeting yesterday.

I've added some other documentation!

@atvaccaro atvaccaro requested a review from lauriemerrell June 1, 2023 15:23
warehouse/README.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants