Skip to content

Commit

Permalink
🎉 Refactor Normalization docker images and upgrade to use dbt 0.21.0 (a…
Browse files Browse the repository at this point in the history
…irbytehq#6959)

* Split normalization docker images for some connectors with specifics dependencies

* Regenerate (airbytehq#7003)
  • Loading branch information
ChristopheDuong authored and schlattk committed Jan 4, 2022
1 parent ad5f098 commit d67fae9
Show file tree
Hide file tree
Showing 1,225 changed files with 8,729 additions and 8,273 deletions.
3 changes: 3 additions & 0 deletions airbyte-integrations/bases/base-normalization/.dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,6 @@
!setup.py
!normalization
!dbt-project-template
!dbt-project-template-mssql
!dbt-project-template-mysql
!dbt-project-template-oracle
39 changes: 1 addition & 38 deletions airbyte-integrations/bases/base-normalization/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,36 +1,4 @@
FROM fishtownanalytics/dbt:0.19.0

USER root
WORKDIR /tmp
RUN apt-get update && apt-get install -y \
wget \
curl \
unzip \
libaio-dev \
libaio1 \
gnupg \
gnupg1 \
gnupg2

# Install MS SQL Server dependencies
RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
RUN curl https://packages.microsoft.com/config/debian/10/prod.list > /etc/apt/sources.list.d/mssql-release.list
RUN apt-get update && ACCEPT_EULA=Y apt-get install -y \
libgssapi-krb5-2 \
unixodbc-dev \
msodbcsql17 \
mssql-tools
ENV PATH=$PATH:/opt/mssql-tools/bin

# Install Oracle dependencies
RUN mkdir -p /opt/oracle
RUN wget https://download.oracle.com/otn_software/linux/instantclient/19600/instantclient-basic-linux.x64-19.6.0.0.0dbru.zip
RUN unzip instantclient-basic-linux.x64-19.6.0.0.0dbru.zip -d /opt/oracle
ENV ORACLE_HOME /opt/oracle/instantclient_19_6
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ORACLE_HOME
ENV TNS_ADMIN /opt/oracle/instantclient_19_6/network/admin
RUN pip install cx_Oracle

FROM fishtownanalytics/dbt:0.21.0
COPY --from=airbyte/base-airbyte-protocol-python:0.1.1 /airbyte /airbyte

# Install SSH Tunneling dependencies
Expand All @@ -50,10 +18,6 @@ RUN pip install .

WORKDIR /airbyte/normalization_code
RUN pip install .
RUN pip install dbt-oracle==0.4.3
RUN pip install git+https://github.com/dbeatty10/dbt-mysql@96655ea9f7fca7be90c9112ce8ffbb5aac1d3716#egg=dbt-mysql
RUN pip install dbt-sqlserver==0.19.3


WORKDIR /airbyte/normalization_code/dbt-template/
# Download external dbt dependencies
Expand All @@ -63,5 +27,4 @@ WORKDIR /airbyte
ENV AIRBYTE_ENTRYPOINT "/airbyte/entrypoint.sh"
ENTRYPOINT ["/airbyte/entrypoint.sh"]

LABEL io.airbyte.version=0.1.52
LABEL io.airbyte.name=airbyte/normalization
14 changes: 14 additions & 0 deletions airbyte-integrations/bases/base-normalization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,13 +108,27 @@ or can also be invoked on github, thanks to the slash commands posted as comment

/test connector=bases/base-normalization

You can restrict the tests to a subset of destinations by specifying a comma separated list of destinations.
For example, let's say you are working on a change to normalization for Postgres, with Gradle:

NORMALIZATION_TEST_TARGET=postgres ./gradlew :airbyte-integrations:bases:base-normalization:integrationTest

or directly with pytest:

NORMALIZATION_TEST_TARGET=postgres pytest airbyte-integrations/bases/base-normalization/integration_tests

Note that these tests are connecting and processing data on top of real data warehouse destinations.
Therefore, valid credentials files are expected to be injected in the `secrets/` folder in order to run
(not included in git repository).

This is usually automatically done by the CI thanks to the `tools/bin/ci_credentials.sh` script or you can
re-use the `destination_config.json` passed to destination connectors.

As normalization supports more and more destinations, tests are relying on an increasing number of destinations.
As a result, it is possible that the docker garbage collector is triggered to wipe "unused" docker images while the
integration tests for normalization are running. Thus, if you encounter errors about a connector's docker image not being
present locally (even though it was built beforehand), make sure to increase the docker image storage size of your docker engine ("defaultKeepStorage" for mac for example).

### Integration Tests Definitions for test_ephemeral.py:
The test here focus on benchmarking the "ephemeral" materialization mode of dbt. Depending on the number of
columns in a catalog, this may throw exceptions and fail. This test ensures that we support reasonable number of columns in destination tables.
Expand Down
48 changes: 44 additions & 4 deletions airbyte-integrations/bases/base-normalization/build.gradle
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import java.nio.file.Paths

plugins {
id 'airbyte-docker'
id 'airbyte-python'
Expand Down Expand Up @@ -27,13 +29,52 @@ task checkSshScriptCopy(type: Task, dependsOn: copySshScript) {
}
}

test.dependsOn checkSshScriptCopy
assemble.dependsOn checkSshScriptCopy
airbyteDocker.dependsOn(checkSshScriptCopy)
assemble.dependsOn(checkSshScriptCopy)
test.dependsOn(checkSshScriptCopy)

installReqs.dependsOn(":airbyte-integrations:bases:airbyte-protocol:installReqs")
integrationTest.dependsOn(build)

task("customIntegrationTestPython", type: PythonTask, dependsOn: installTestReqs){

static def getDockerfile(String customConnector) {
return "${customConnector}.Dockerfile"
}

static def getDockerImageName(String customConnector) {
return "airbyte/normalization-${customConnector}"
}

static def getImageNameWithTag(String customConnector) {
return "${getDockerImageName(customConnector)}:dev"
}


def buildAirbyteDocker(String customConnector) {
def baseCommand = ['docker', 'build', '.', '-f', getDockerfile(customConnector), '-t', getImageNameWithTag(customConnector)]
return {
commandLine baseCommand
}
}

task airbyteDockerMSSql(type: Exec, dependsOn: checkSshScriptCopy) {
configure buildAirbyteDocker('mssql')
dependsOn assemble
}
task airbyteDockerMySql(type: Exec, dependsOn: checkSshScriptCopy) {
configure buildAirbyteDocker('mysql')
dependsOn assemble
}
task airbyteDockerOracle(type: Exec, dependsOn: checkSshScriptCopy) {
configure buildAirbyteDocker('oracle')
dependsOn assemble
}

airbyteDocker.dependsOn(airbyteDockerMSSql)
airbyteDocker.dependsOn(airbyteDockerMySql)
airbyteDocker.dependsOn(airbyteDockerOracle)

task("customIntegrationTestPython", type: PythonTask, dependsOn: installTestReqs) {
module = "pytest"
command = "-s integration_tests"

Expand All @@ -45,7 +86,6 @@ task("customIntegrationTestPython", type: PythonTask, dependsOn: installTestReqs
dependsOn ':airbyte-integrations:connectors:destination-snowflake:airbyteDocker'
dependsOn ':airbyte-integrations:connectors:destination-oracle:airbyteDocker'
dependsOn ':airbyte-integrations:connectors:destination-mssql:airbyteDocker'

}

integrationTest.dependsOn("customIntegrationTestPython")
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# This file is necessary to install dbt-utils with dbt deps
# the content will be overwritten by the transform function

# Name your package! Package names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'airbyte_utils'
version: '1.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project. Profiles contain
# database connection information, and should be configured in the ~/.dbt/profiles.yml file
profile: 'normalize'

# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that source models can be found
# in the "models/" directory. You probably won't need to change these!
source-paths: ["models"]
docs-paths: ["docs"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]

target-path: "../build" # directory which will store compiled SQL files
log-path: "../logs" # directory which will store DBT logs
modules-path: "/tmp/dbt_modules" # directory which will store external DBT dependencies

clean-targets: # directories to be removed by `dbt clean`
- "build"
- "dbt_modules"

quoting:
database: true
# Temporarily disabling the behavior of the ExtendedNameTransformer on table/schema names, see (issue #1785)
# all schemas should be unquoted
schema: false
identifier: true

# You can define configurations for models in the `source-paths` directory here.
# Using these configurations, you can enable or disable models, change how they
# are materialized, and more!
models:
airbyte_utils:
generated:
airbyte_ctes:
+tags: airbyte_internal_cte
+materialized: ephemeral
airbyte_views:
+tags: airbyte_internal_views
+materialized: view
airbyte_tables:
+tags: normalized_tables
+materialized: table
+materialized: table

vars:
dbt_utils_dispatch_list: ['airbyte_utils']
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# add dependencies. these will get pulled during the `dbt deps` process.

packages:
- git: "https://github.com/fishtown-analytics/dbt-utils.git"
revision: 0.6.4
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# This file is necessary to install dbt-utils with dbt deps
# the content will be overwritten by the transform function

# Name your package! Package names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'airbyte_utils'
version: '1.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project. Profiles contain
# database connection information, and should be configured in the ~/.dbt/profiles.yml file
profile: 'normalize'

# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that source models can be found
# in the "models/" directory. You probably won't need to change these!
source-paths: ["models"]
docs-paths: ["docs"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]

target-path: "../build" # directory which will store compiled SQL files
log-path: "../logs" # directory which will store DBT logs
modules-path: "/tmp/dbt_modules" # directory which will store external DBT dependencies

clean-targets: # directories to be removed by `dbt clean`
- "build"
- "dbt_modules"

quoting:
database: true
# Temporarily disabling the behavior of the ExtendedNameTransformer on table/schema names, see (issue #1785)
# all schemas should be unquoted
schema: false
identifier: true

# You can define configurations for models in the `source-paths` directory here.
# Using these configurations, you can enable or disable models, change how they
# are materialized, and more!
models:
airbyte_utils:
generated:
airbyte_ctes:
+tags: airbyte_internal_cte
+materialized: ephemeral
airbyte_views:
+tags: airbyte_internal_views
+materialized: view
airbyte_tables:
+tags: normalized_tables
+materialized: table
+materialized: table

vars:
dbt_utils_dispatch_list: ['airbyte_utils']
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# add dependencies. these will get pulled during the `dbt deps` process.

packages:
- git: "https://github.com/fishtown-analytics/dbt-utils.git"
revision: 0.6.4
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# This file is necessary to install dbt-utils with dbt deps
# the content will be overwritten by the transform function

# Name your package! Package names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'airbyte_utils'
version: '1.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project. Profiles contain
# database connection information, and should be configured in the ~/.dbt/profiles.yml file
profile: 'normalize'

# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that source models can be found
# in the "models/" directory. You probably won't need to change these!
source-paths: ["models"]
docs-paths: ["docs"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]

target-path: "../build" # directory which will store compiled SQL files
log-path: "../logs" # directory which will store DBT logs
modules-path: "/tmp/dbt_modules" # directory which will store external DBT dependencies

clean-targets: # directories to be removed by `dbt clean`
- "build"
- "dbt_modules"

quoting:
database: false
schema: false
identifier: false

# You can define configurations for models in the `source-paths` directory here.
# Using these configurations, you can enable or disable models, change how they
# are materialized, and more!
models:
airbyte_utils:
generated:
airbyte_ctes:
+tags: airbyte_internal_cte
+materialized: ephemeral
airbyte_views:
+tags: airbyte_internal_views
+materialized: view
airbyte_tables:
+tags: normalized_tables
+materialized: table
+materialized: table

vars:
dbt_utils_dispatch_list: ['airbyte_utils']
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# add dependencies. these will get pulled during the `dbt deps` process.

packages:
- git: "https://github.com/fishtown-analytics/dbt-utils.git"
revision: 0.6.4
Original file line number Diff line number Diff line change
Expand Up @@ -54,5 +54,6 @@ models:
+materialized: table
+materialized: table

vars:
dbt_utils_dispatch_list: ['airbyte_utils']
dispatch:
- macro_namespace: dbt_utils
search_order: ['airbyte_utils', 'dbt_utils']
Original file line number Diff line number Diff line change
@@ -1,19 +1,16 @@
{#
Overriding the following macro from dbt-utils:
https://github.com/fishtown-analytics/dbt-utils/blob/0.6.2/macros/cross_db_utils/concat.sql
To implement our own version of concat
Because on postgres, we cannot pass more than 100 arguments to a function
This is necessary until: https://github.com/fishtown-analytics/dbt-utils/blob/dev/0.7.0/macros/cross_db_utils/concat.sql
is released.
concat in dbt 0.6.4 used to work fine for bigquery but the new implementaion in 0.7.3 is less scalable (can not handle too many columns)
Therefore, we revert the implementation here and add versions for missing destinations
#}

{% macro concat(fields) -%}
{{ adapter.dispatch('concat', packages = ['airbyte_utils', 'dbt_utils'])(fields) }}
{{ adapter.dispatch('concat')(fields) }}
{%- endmacro %}

{% macro postgres__concat(fields) %}
{{ dbt_utils.alternative_concat(fields) }}
{% endmacro %}
{% macro bigquery__concat(fields) -%}
{#-- concat() in SQL bigquery scales better with number of columns than using the '||' operator --#}
concat({{ fields|join(', ') }})
{%- endmacro %}

{% macro sqlserver__concat(fields) -%}
{#-- CONCAT() in SQL SERVER accepts from 2 to 254 arguments, we use batches for the main concat, to overcome the limit. --#}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{#
Drop schema to clean up the destination database
#}
{% macro drop_schemas(schemas) %}
{% for schema in schemas %}
drop schema if exists {{ schema }} cascade;
{% endfor %}
{% endmacro %}
Loading

0 comments on commit d67fae9

Please sign in to comment.