External sources in dbt

Source config extension for metadata about external file structure
Adapter macros to create external tables and refresh external table partitions
Snowflake-specific macros to create, backfill, and refresh snowpipes
Azure Synapse macros to create/recreate external table

Syntax

# iterate through all source nodes, create if missing + refresh if appropriate
$ dbt run-operation stage_external_sources

# iterate through all source nodes, create or replace + refresh if appropriate
$ dbt run-operation stage_external_sources --vars 'ext_full_refresh: true'

The macros assume that you have already created an external stage (Snowflake) or external schema (Redshift/Spectrum), and that you have permissions to select from it and create tables in it.

The stage_external_sources macro accepts a similar node selection syntax to snapshotting source freshness.

# Stage all Snowplow and Logs external sources:
$ dbt run-operation stage_external_sources --args 'select: snowplow logs'

# Stage a particular external source table:
$ dbt run-operation stage_external_sources --args 'select: snowplow.event'

Maybe someday:

$ dbt source stage-external
$ dbt source stage-external --full-refresh
$ dbt source stage-external --select snowplow.event logs

Spec

version: 2

sources:
  - name: snowplow
    tables:
      - name: event
      
                            # NEW: "external" property of source node
        external:
          location:         # S3 file path or Snowflake stage
          file_format:      # Hive specification or Snowflake named format / specification
          row_format:       # Hive specification
          table_properties: # Hive specification
          
          # Azure Synapse
          data_source: azure_data_source_name
          file_format: azure_file_format
          reject_type: value
          reject_value: 0

          # Snowflake: create an empty table + pipe instead of an external table
          snowpipe:
            auto_ingest:    true
            aws_sns_topic:  # AWS
            integration:    # Azure
            copy_options:   "on_error = continue, enforce_length = false" # e.g.
          
                            # Specify a list of file-path partitions.
          
          # ------ SNOWFLAKE ------
          partitions:
            - name: collector_date
              data_type: date
              expression: to_date(substr(metadata$filename, 8, 10), 'YYYY/MM/DD')
              
          # ------ REDSHIFT -------
          partitions:
            - name: appId
              data_type: varchar(255)
              vals:         # list of values
                - dev
                - prod
              path_macro: dbt_external_tables.key_value
                  # Macro to convert partition value to file path specification.
                  # This "helper" macro is defined in the package, but you can use
                  # any custom macro that takes keyword arguments 'name' + 'value'
                  # and returns the path as a string
            
                  # If multiple partitions, order matters for compiling S3 path
            - name: collector_date
              data_type: date
              vals:         # macro w/ keyword args to generate list of values
                macro: dbt.dates_in_range
                args:
                  start_date_str: '2019-08-01'
                  end_date_str: '{{modules.datetime.date.today().strftime("%Y-%m-%d")}}'
                  in_fmt: "%Y-%m-%d"
                  out_fmt: "%Y-%m-%d"
               path_macro: dbt_external_tables.year_month_day
             
        
        # Specify ALL column names + datatypes. Column order matters for CSVs. 
        # Other file formats require column names to exactly match.
        
        columns:
          - name: app_id
            data_type: varchar(255)
            description: "Application ID"
          - name: platform
            data_type: varchar(255)
            description: "Platform"
        ...

Resources

sample_sources for full valid YML config that establishes Snowplow events as a dbt source and stage-ready external table in Snowflake and Spectrum.
sample_analysis for a "dry run" version of the DDL/DML that stage_external_sources will run as an operation

Supported databases

Redshift (Spectrum)
Snowflake
TK: Spark
Azure Synapse

Limitations

Azure Synapse does not support partitions or table refresh

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
.github		.github
etc		etc
macros		macros
sample_analysis		sample_analysis
sample_sources		sample_sources
LICENSE.md		LICENSE.md
README.md		README.md
dbt_project.yml		dbt_project.yml
packages.yml		packages.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

External sources in dbt

Syntax

Spec

Resources

Supported databases

Limitations

About

Releases

Packages

Languages

License

seunghanhong/dbt-external-tables

Folders and files

Latest commit

History

Repository files navigation

External sources in dbt

Syntax

Spec

Resources

Supported databases

Limitations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages