Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Refactor Normalization docker images and upgrade to use dbt 0.21.0 #6959

Merged
merged 39 commits into from
Oct 14, 2021
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
9709f14
Split normalization docker images for some connectors with specifics …
ChristopheDuong Oct 8, 2021
bd90eee
Clean up schemas
ChristopheDuong Oct 11, 2021
6b7ff77
Use global version for normalization
ChristopheDuong Oct 11, 2021
3aa02dc
Trying to solve oracle tests
ChristopheDuong Oct 11, 2021
fb7c12f
Use EnvConfigs in main()
ChristopheDuong Oct 11, 2021
16addd9
test_ephemeral should use get_test_targets
ChristopheDuong Oct 11, 2021
f07bb4b
Postgres always in test_ephemeral targets
ChristopheDuong Oct 11, 2021
f5bcc85
format code
ChristopheDuong Oct 11, 2021
c7d3d1a
Tweak gradle dependencies
ChristopheDuong Oct 11, 2021
8d4a9bc
Tweak settings.gradle
ChristopheDuong Oct 12, 2021
6dd5c48
Fix test oracle
ChristopheDuong Oct 12, 2021
e4090ea
tweak settings.gradle
ChristopheDuong Oct 12, 2021
b1b7ffe
Merge remote-tracking branch 'origin/master' into split-normalization
ChristopheDuong Oct 12, 2021
672ace9
format code
ChristopheDuong Oct 12, 2021
b2c7f70
Fix bigquery ephemeral test
ChristopheDuong Oct 12, 2021
f4f0cd2
Merge remote-tracking branch 'origin/master' into split-normalization
ChristopheDuong Oct 12, 2021
c6d3de3
Format code
ChristopheDuong Oct 12, 2021
f13a404
Tweak comments
ChristopheDuong Oct 12, 2021
be595cc
Fix tests
ChristopheDuong Oct 12, 2021
ae0a511
Fix integration tests
ChristopheDuong Oct 13, 2021
259afe9
Merge remote-tracking branch 'origin/master' into split-normalization
ChristopheDuong Oct 13, 2021
ccad8c3
tweak build
ChristopheDuong Oct 13, 2021
e912947
Re-enable test_check_row_count
ChristopheDuong Oct 13, 2021
1ad9ca4
add missing folder
ChristopheDuong Oct 13, 2021
84084c8
rename test file
ChristopheDuong Oct 13, 2021
5e9ded1
Spotless settings
ChristopheDuong Oct 13, 2021
ec677cc
Fix snowflake uppercse test
ChristopheDuong Oct 13, 2021
96e1333
Fix oracle tests
ChristopheDuong Oct 13, 2021
05c12e1
Apply suggestions from code review
ChristopheDuong Oct 13, 2021
e028ed7
Merge remote-tracking branch 'origin/master' into split-normalization
ChristopheDuong Oct 14, 2021
c461a8f
Add env variables to test for using external db
ChristopheDuong Oct 14, 2021
e48f17a
Split integration tests between simple and nested
ChristopheDuong Oct 14, 2021
70afb15
Merge remote-tracking branch 'origin/master' into split-normalization
ChristopheDuong Oct 14, 2021
c281f1f
code format
ChristopheDuong Oct 14, 2021
9f0bcc8
Add column with quotes in simple streams
ChristopheDuong Oct 14, 2021
18fb503
format code
ChristopheDuong Oct 14, 2021
cb9889c
Fix tests
ChristopheDuong Oct 14, 2021
881cb68
Cleanup dir before running tests
ChristopheDuong Oct 14, 2021
2129ac8
Regenerate (#7003)
ChristopheDuong Oct 14, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions airbyte-integrations/bases/base-normalization/.dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,6 @@
!setup.py
!normalization
!dbt-project-template
!dbt-project-template-mssql
!dbt-project-template-mysql
!dbt-project-template-oracle
39 changes: 1 addition & 38 deletions airbyte-integrations/bases/base-normalization/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,36 +1,4 @@
FROM fishtownanalytics/dbt:0.19.0

USER root
WORKDIR /tmp
RUN apt-get update && apt-get install -y \
wget \
curl \
unzip \
libaio-dev \
libaio1 \
gnupg \
gnupg1 \
gnupg2

# Install MS SQL Server dependencies
RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
RUN curl https://packages.microsoft.com/config/debian/10/prod.list > /etc/apt/sources.list.d/mssql-release.list
RUN apt-get update && ACCEPT_EULA=Y apt-get install -y \
libgssapi-krb5-2 \
unixodbc-dev \
msodbcsql17 \
mssql-tools
ENV PATH=$PATH:/opt/mssql-tools/bin

# Install Oracle dependencies
RUN mkdir -p /opt/oracle
RUN wget https://download.oracle.com/otn_software/linux/instantclient/19600/instantclient-basic-linux.x64-19.6.0.0.0dbru.zip
RUN unzip instantclient-basic-linux.x64-19.6.0.0.0dbru.zip -d /opt/oracle
ENV ORACLE_HOME /opt/oracle/instantclient_19_6
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ORACLE_HOME
ENV TNS_ADMIN /opt/oracle/instantclient_19_6/network/admin
RUN pip install cx_Oracle

FROM fishtownanalytics/dbt:0.21.0
COPY --from=airbyte/base-airbyte-protocol-python:0.1.1 /airbyte /airbyte

# Install SSH Tunneling dependencies
Expand All @@ -50,10 +18,6 @@ RUN pip install .

WORKDIR /airbyte/normalization_code
RUN pip install .
RUN pip install dbt-oracle==0.4.3
RUN pip install git+https://github.com/dbeatty10/dbt-mysql@96655ea9f7fca7be90c9112ce8ffbb5aac1d3716#egg=dbt-mysql
RUN pip install dbt-sqlserver==0.19.3


WORKDIR /airbyte/normalization_code/dbt-template/
# Download external dbt dependencies
Expand All @@ -63,5 +27,4 @@ WORKDIR /airbyte
ENV AIRBYTE_ENTRYPOINT "/airbyte/entrypoint.sh"
ENTRYPOINT ["/airbyte/entrypoint.sh"]

LABEL io.airbyte.version=0.1.52
LABEL io.airbyte.name=airbyte/normalization
14 changes: 14 additions & 0 deletions airbyte-integrations/bases/base-normalization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,13 +108,27 @@ or can also be invoked on github, thanks to the slash commands posted as comment

/test connector=bases/base-normalization

You can restrict the tests to a subset of destinations by specifying a comma separated list of destinations.
For example, let's say you are working on a change to normalization for Postgres, with Gradle:

NORMALIZATION_TEST_TARGET=postgres ./gradlew :airbyte-integrations:bases:base-normalization:integrationTest

or directly with pytest:

NORMALIZATION_TEST_TARGET=postgres pytest airbyte-integrations/bases/base-normalization/integration_tests

Note that these tests are connecting and processing data on top of real data warehouse destinations.
Therefore, valid credentials files are expected to be injected in the `secrets/` folder in order to run
(not included in git repository).

This is usually automatically done by the CI thanks to the `tools/bin/ci_credentials.sh` script or you can
re-use the `destination_config.json` passed to destination connectors.

As normalization supports more and more destinations, tests are relying on an increasing number of destinations.
As a result, it is possible that the docker garbage collector is triggered to wipe "unused" docker images while the
integration tests for normalization are running. Thus, if you encounter errors about a connector's docker image not being
present locally (even though it was built beforehand), make sure to increase the docker image storage size of your docker engine ("defaultKeepStorage" for mac for example).

### Integration Tests Definitions for test_ephemeral.py:
The test here focus on benchmarking the "ephemeral" materialization mode of dbt. Depending on the number of
columns in a catalog, this may throw exceptions and fail. This test ensures that we support reasonable number of columns in destination tables.
Expand Down
48 changes: 44 additions & 4 deletions airbyte-integrations/bases/base-normalization/build.gradle
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import java.nio.file.Paths

plugins {
id 'airbyte-docker'
id 'airbyte-python'
Expand Down Expand Up @@ -27,13 +29,52 @@ task checkSshScriptCopy(type: Task, dependsOn: copySshScript) {
}
}

test.dependsOn checkSshScriptCopy
assemble.dependsOn checkSshScriptCopy
airbyteDocker.dependsOn(checkSshScriptCopy)
assemble.dependsOn(checkSshScriptCopy)
test.dependsOn(checkSshScriptCopy)

installReqs.dependsOn(":airbyte-integrations:bases:airbyte-protocol:installReqs")
integrationTest.dependsOn(build)

task("customIntegrationTestPython", type: PythonTask, dependsOn: installTestReqs){

static def getDockerfile(String customConnector) {
return "${customConnector}.Dockerfile"
}

static def getDockerImageName(String customConnector) {
return "airbyte/normalization-${customConnector}"
}

static def getImageNameWithTag(String customConnector) {
return "${getDockerImageName(customConnector)}:dev"
}


def buildAirbyteDocker(String customConnector) {
def baseCommand = ['docker', 'build', '.', '-f', getDockerfile(customConnector), '-t', getImageNameWithTag(customConnector)]
return {
commandLine baseCommand
}
}

task airbyteDockerMSSql(type: Exec, dependsOn: checkSshScriptCopy) {
configure buildAirbyteDocker('mssql')
dependsOn assemble
}
task airbyteDockerMySql(type: Exec, dependsOn: checkSshScriptCopy) {
configure buildAirbyteDocker('mysql')
dependsOn assemble
}
task airbyteDockerOracle(type: Exec, dependsOn: checkSshScriptCopy) {
configure buildAirbyteDocker('oracle')
dependsOn assemble
}

airbyteDocker.dependsOn(airbyteDockerMSSql)
airbyteDocker.dependsOn(airbyteDockerMySql)
airbyteDocker.dependsOn(airbyteDockerOracle)

task("customIntegrationTestPython", type: PythonTask, dependsOn: installTestReqs) {
module = "pytest"
command = "-s integration_tests"

Expand All @@ -45,7 +86,6 @@ task("customIntegrationTestPython", type: PythonTask, dependsOn: installTestReqs
dependsOn ':airbyte-integrations:connectors:destination-snowflake:airbyteDocker'
dependsOn ':airbyte-integrations:connectors:destination-oracle:airbyteDocker'
dependsOn ':airbyte-integrations:connectors:destination-mssql:airbyteDocker'

}

integrationTest.dependsOn("customIntegrationTestPython")
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# This file is necessary to install dbt-utils with dbt deps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, is there no way to DRY the dbt_project.yml files?

# the content will be overwritten by the transform function

# Name your package! Package names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'airbyte_utils'
version: '1.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project. Profiles contain
# database connection information, and should be configured in the ~/.dbt/profiles.yml file
profile: 'normalize'

# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that source models can be found
# in the "models/" directory. You probably won't need to change these!
source-paths: ["models"]
docs-paths: ["docs"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]

target-path: "../build" # directory which will store compiled SQL files
log-path: "../logs" # directory which will store DBT logs
modules-path: "/tmp/dbt_modules" # directory which will store external DBT dependencies

clean-targets: # directories to be removed by `dbt clean`
- "build"
- "dbt_modules"

quoting:
database: true
# Temporarily disabling the behavior of the ExtendedNameTransformer on table/schema names, see (issue #1785)
# all schemas should be unquoted
schema: false
identifier: true

# You can define configurations for models in the `source-paths` directory here.
# Using these configurations, you can enable or disable models, change how they
# are materialized, and more!
models:
airbyte_utils:
generated:
airbyte_ctes:
+tags: airbyte_internal_cte
+materialized: ephemeral
airbyte_views:
+tags: airbyte_internal_views
+materialized: view
airbyte_tables:
+tags: normalized_tables
+materialized: table
+materialized: table

vars:
dbt_utils_dispatch_list: ['airbyte_utils']
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# add dependencies. these will get pulled during the `dbt deps` process.

packages:
- git: "https://github.com/fishtown-analytics/dbt-utils.git"
revision: 0.6.4
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# This file is necessary to install dbt-utils with dbt deps
# the content will be overwritten by the transform function

# Name your package! Package names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'airbyte_utils'
version: '1.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project. Profiles contain
# database connection information, and should be configured in the ~/.dbt/profiles.yml file
profile: 'normalize'

# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that source models can be found
# in the "models/" directory. You probably won't need to change these!
source-paths: ["models"]
docs-paths: ["docs"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]

target-path: "../build" # directory which will store compiled SQL files
log-path: "../logs" # directory which will store DBT logs
modules-path: "/tmp/dbt_modules" # directory which will store external DBT dependencies

clean-targets: # directories to be removed by `dbt clean`
- "build"
- "dbt_modules"

quoting:
database: true
# Temporarily disabling the behavior of the ExtendedNameTransformer on table/schema names, see (issue #1785)
# all schemas should be unquoted
schema: false
identifier: true

# You can define configurations for models in the `source-paths` directory here.
# Using these configurations, you can enable or disable models, change how they
# are materialized, and more!
models:
airbyte_utils:
generated:
airbyte_ctes:
+tags: airbyte_internal_cte
+materialized: ephemeral
airbyte_views:
+tags: airbyte_internal_views
+materialized: view
airbyte_tables:
+tags: normalized_tables
+materialized: table
+materialized: table

vars:
dbt_utils_dispatch_list: ['airbyte_utils']
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# add dependencies. these will get pulled during the `dbt deps` process.

packages:
- git: "https://github.com/fishtown-analytics/dbt-utils.git"
revision: 0.6.4
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# This file is necessary to install dbt-utils with dbt deps
# the content will be overwritten by the transform function

# Name your package! Package names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'airbyte_utils'
version: '1.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project. Profiles contain
# database connection information, and should be configured in the ~/.dbt/profiles.yml file
profile: 'normalize'

# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that source models can be found
# in the "models/" directory. You probably won't need to change these!
source-paths: ["models"]
docs-paths: ["docs"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]

target-path: "../build" # directory which will store compiled SQL files
log-path: "../logs" # directory which will store DBT logs
modules-path: "/tmp/dbt_modules" # directory which will store external DBT dependencies

clean-targets: # directories to be removed by `dbt clean`
- "build"
- "dbt_modules"

quoting:
database: false
schema: false
identifier: false

# You can define configurations for models in the `source-paths` directory here.
# Using these configurations, you can enable or disable models, change how they
# are materialized, and more!
models:
airbyte_utils:
generated:
airbyte_ctes:
+tags: airbyte_internal_cte
+materialized: ephemeral
airbyte_views:
+tags: airbyte_internal_views
+materialized: view
airbyte_tables:
+tags: normalized_tables
+materialized: table
+materialized: table

vars:
dbt_utils_dispatch_list: ['airbyte_utils']
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# add dependencies. these will get pulled during the `dbt deps` process.

packages:
- git: "https://github.com/fishtown-analytics/dbt-utils.git"
revision: 0.6.4
Original file line number Diff line number Diff line change
Expand Up @@ -54,5 +54,6 @@ models:
+materialized: table
+materialized: table

vars:
dbt_utils_dispatch_list: ['airbyte_utils']
dispatch:
- macro_namespace: dbt_utils
search_order: ['airbyte_utils', 'dbt_utils']
Original file line number Diff line number Diff line change
@@ -1,19 +1,16 @@
{#
Overriding the following macro from dbt-utils:
https://github.com/fishtown-analytics/dbt-utils/blob/0.6.2/macros/cross_db_utils/concat.sql
To implement our own version of concat
Because on postgres, we cannot pass more than 100 arguments to a function
This is necessary until: https://github.com/fishtown-analytics/dbt-utils/blob/dev/0.7.0/macros/cross_db_utils/concat.sql
is released.
concat in dbt 0.6.4 used to work fine for bigquery but the new implementaion in 0.7.3 is less scalable (can not handle too many columns)
Therefore, we revert the implementation here and add versions for missing destinations
#}

{% macro concat(fields) -%}
{{ adapter.dispatch('concat', packages = ['airbyte_utils', 'dbt_utils'])(fields) }}
{{ adapter.dispatch('concat')(fields) }}
{%- endmacro %}

{% macro postgres__concat(fields) %}
{{ dbt_utils.alternative_concat(fields) }}
{% endmacro %}
{% macro bigquery__concat(fields) -%}
{#-- concat() in SQL bigquery scales better with number of columns than using the '||' operator --#}
concat({{ fields|join(', ') }})
{%- endmacro %}

{% macro sqlserver__concat(fields) -%}
{#-- CONCAT() in SQL SERVER accepts from 2 to 254 arguments, we use batches for the main concat, to overcome the limit. --#}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{#
Drop schema to clean up the destination database
#}
{% macro drop_schemas(schemas) %}
{% for schema in schemas %}
drop schema if exists {{ schema }} cascade;
{% endfor %}
{% endmacro %}
Loading