Refactor Normalization to handle nested Streams in new Catalog API #2044

ChristopheDuong · 2021-02-11T17:49:45Z

What

Closes #886 and related issues

Closes #1426

How

Create tables for nested columns

Pre-merge Checklist

Run integration tests
Publish Docker images

Recommended reading order

airbyte-integrations/bases/base-normalization/normalization/transform_catalog/transform.py
the rest

cgardens · 2021-02-11T22:25:31Z

airbyte-workers/src/main/java/io/airbyte/workers/normalization/DefaultNormalizationRunner.java

@@ -57,9 +59,10 @@
    SNOWFLAKE
  }

-  public DefaultNormalizationRunner(final DestinationType destinationType, final ProcessBuilderFactory pbf) {
+  public DefaultNormalizationRunner(final DestinationType destinationType, final ProcessBuilderFactory pbf, boolean useDevVersion) {


do we need to keep this or was it just useful while you were doing manual testing?

It's open for discussion, I am thinking we can keep it...

The intended behavior is:

to automatically use normalization:dev images when the destination is used with dev.

If the destination is using a numbered version, then normalization is also using the declared numbered version.

It's just to avoid switching between number/dev image tags when working on a PR of normalization
(I noticed @sherifnada does that often too) and keep in mind to switch it back before merging

WDYT?

cool. let's work through this feature in a separate PR.

airbyte-workers/src/main/java/io/airbyte/workers/normalization/NormalizationRunnerFactory.java

airbyte-integrations/bases/base-normalization/normalization/transform_catalog/transform.py

...e-integrations/bases/base-normalization/dbt-project-template/macros/cross_db_utils/array.sql

airbyte-integrations/bases/base-normalization/normalization/transform_catalog/transform.py

cgardens · 2021-02-11T23:36:50Z

airbyte-integrations/bases/base-normalization/normalization/transform_catalog/transform.py

+                child_str = child[0:55]
+            else:
+                child_str = child
+            for i in range(1, 100):


what is this doing? it's handling naming collisions of child tables, i think? i think i believe that there's a real problem here, but can you help me understand exactly what it is and how it is being solved?

yes, it's handling name collisions from the currently processed catalog...

In Stripe connector for example, in the charges stream, there is the concept of card multiple times and depending on where it's from, it has different schemas:

charges/card

charges/source/card

charges/payment_method_details/card

The "exploded" tables are:

named after the nested column name (in my example, it's card)

If I name them after the "path" to the nested column, it's making super long names that easily overflows the number of character limits allowed and I hit even more naming problems (and potential collisions if truncating)

if the name was already used by this stream or another stream in the same catalog, then it just adds an integer i counting allowing up to 100 collisions.

Top-level stream names have priority over nested children so they will retain the non-suffixed versions (for example I have charges stream and a nested customer.charges in the customer stream)

Sometimes, the collisions have the same schema (below with address) and could be unioned/merged into a single table but I am leaving it to the user to do that extra step afterward... So if someone wants to merge the different card tables into a single one (or the address ones) by keeping only a subset of common fields or filled in with blanks, then it's up to them...

With the naming, it's easy to retrieve all the cards table too since they are all grouped next to each other alphabetically.

Of course, I haven't documented all this yet

discussed with chris he will create an issue for this. we can live with this for now but need to fix quickly

Issue created here: #2055

airbyte-integrations/bases/base-normalization/normalization/transform_catalog/transform.py

cgardens · 2021-02-11T23:44:04Z

...lake/src/main/java/io/airbyte/integrations/destination/snowflake/SnowflakeSqlOperations.java

@@ -46,7 +46,7 @@ public void createTableIfNotExists(JdbcDatabase database, String schemaName, Str
    final String createTableQuery = String.format(
        "CREATE TABLE IF NOT EXISTS %s.%s ( \n"
            + "%s VARCHAR PRIMARY KEY,\n"
-            + "\"%s\" VARIANT,\n"
+            + "%s VARIANT,\n"


why is this changing?

All destinations are producing the _airbyte_data column without quotes and as a case-insensitive column

cgardens · 2021-02-11T23:44:36Z

airbyte-integrations/bases/base-normalization/dbt-project-template/dbt_project.yml

@@ -29,14 +29,20 @@ clean-targets:         # directories to be removed by `dbt clean`

 quoting:
  database: true
-  schema: true
+  schema: false


I will add the same comment:

Temporarily disabling the behavior of the ExtendedNameTransformer, see (issue #1785)

Since we don't use Extended Names for tables and schemas anymore but replace all special characters with _ then DBT shouldn't be using quoting when querying them otherwise we run into case-sensitive and quoting issues...

The destination may produce schema/table names that are case-insensitive and dbt would be case-sensitive with quoting when querying those schema/tables... resulting in conflicts and exceptions

cgardens · 2021-02-12T00:17:03Z

Exciting that we're going to support nesting!!!!

sherifnada

Left some comments but will take another look tomorrow. I think my biggest takeaway right now is that this class is really complex/inaccessible and we should try to make it simpler. It doesn't necessarily have to happen in this PR since I know we are trying to push this out, but I do think it needs to be one of the top things to work on in this area

airbyte-integrations/bases/base-normalization/normalization/transform_catalog/transform.py

sherifnada · 2021-02-12T07:53:27Z

airbyte-integrations/bases/base-normalization/normalization/transform_catalog/transform.py

@@ -139,6 +101,280 @@ def extract_schema(profiles_yml: dict) -> str:
        return profiles_yml["schema"]


+def generate_dbt_model(schema: str, output: str, integration_type: str, catalog: dict, json_col: str) -> Dict[str, Set[str]]:


Shouldn't integration type be an enum instead of a string?

can you add a docstring explaining what each of these params are? at first look, it's really not obvious what output or schema or json_col are supposed to be? (maye the last one is just json_column_name?)

airbyte-integrations/bases/base-normalization/normalization/transform_catalog/transform.py

sherifnada · 2021-02-12T08:07:39Z

airbyte-integrations/bases/base-normalization/normalization/transform_catalog/transform.py

+    sql += "\nfrom numbered where row_num = 1\n"
+    if len(path) > 1:
+        sql += "-- {} from {}\n  ".format(name, "/".join(path))
+    output_sql_table(output, schema, sql_file_name, sql, path)


it seems like the output of this method is basically a side-effect i.e: writing to the file system? Can we push that concern outside of this method? It will make it easier to write unit tests

sherifnada · 2021-02-12T08:08:19Z

airbyte-integrations/bases/base-normalization/normalization/transform_catalog/transform.py

+    name,
+    table,
+    parent_hash_id: str,
+    inject_sql_prefix: str,


it's hard for me to conceptually understand what's happening in this method without a lot of mental gymnastics and poring over code. For example, it would really help me to have a mental model of e.g:

first we create an intermediate view containing XYZ, then another containing ABC, then we derive the final normalized view. Nested table are handled via etc...

As a sidenote I've generally felt that the cost of "reloading" context for this class whenever I'm working on it/reading it is fairly large. I don't have a silver bullet solution -- it's probably a combination of commenting or docs plus potentially refactors to make it easier to more obvious, but I think it's important to point this concern out

Slowly getting there... the previous version was even worse, at least here I managed to separate in different SQL files as successive parts yes.

The cool thing is that it creates more table intermediate views (ab1_, ab2_, ab3_ and final table) than before but it's easy to switch it back to a single SQL file with dbt materialization's option

The steps are chained together in a transformation pipeline and I will be able to describe what each "unit" does and choose to include or exclude each one (with its own testing etc?) but I haven't reached there yet

airbyte-integrations/bases/base-normalization/normalization/transform_catalog/transform.py

ChristopheDuong · 2021-02-12T10:41:02Z

Left some comments but will take another look tomorrow. I think my biggest takeaway right now is that this class is really complex/inaccessible and we should try to make it simpler. It doesn't necessarily have to happen in this PR since I know we are trying to push this out, but I do think it needs to be one of the top things to work on in this area

Yes, I definitely agree here

I need some time to deeply refactor/split the code properly... but because of time constraints every time I handle normalization projects, I haven't been able to revamp it more.

It kind of grew organically sprouting way too many arguments in the functions etc

ChristopheDuong · 2021-02-15T15:18:27Z

/test connector=destination-postgres

🕑 destination-postgres https://github.com/airbytehq/airbyte/actions/runs/568935911
✅ destination-postgres https://github.com/airbytehq/airbyte/actions/runs/568935911

ChristopheDuong · 2021-02-15T15:18:48Z

/test connector=destination-bigquery

🕑 destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/568937030
❌ destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/568937030
🕑 destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/568937030
❌ destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/568937030

ChristopheDuong · 2021-02-15T15:18:55Z

/test connector=destination-snowflake

🕑 destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/568937181
✅ destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/568937181

ChristopheDuong · 2021-02-15T15:19:02Z

/test connector=destination-redshift

🕑 destination-redshift https://github.com/airbytehq/airbyte/actions/runs/568937799
✅ destination-redshift https://github.com/airbytehq/airbyte/actions/runs/568937799

ChristopheDuong · 2021-02-15T15:57:06Z

/test connector=destination-bigquery

🕑 destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/569034511
✅ destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/569034511

This reverts commit 61f54e7.

#2044)

ChristopheDuong changed the base branch from master to chris/api_catalog February 11, 2021 17:49

ChristopheDuong changed the title ~~Chris/nested normalization~~ Refactor Normalization to handle nested Streams in new Catalog API Feb 11, 2021

cgardens reviewed Feb 11, 2021

View reviewed changes

sherifnada reviewed Feb 12, 2021

View reviewed changes

ChristopheDuong marked this pull request as ready for review February 12, 2021 17:43

ChristopheDuong mentioned this pull request Feb 15, 2021

Load JSON data into BigQuery as structured data (records with repeated and nested fields, using STRUCT types) #1927

Closed

ChristopheDuong force-pushed the chris/nested-normalization branch from b50ce3c to 27cc7b1 Compare February 15, 2021 15:43

ChristopheDuong mentioned this pull request Feb 15, 2021

Follow-up advanced normalization cases with combining JSON schema nodes #2070

Closed

cgardens approved these changes Feb 15, 2021

View reviewed changes

cgardens mentioned this pull request Feb 15, 2021

Make normalization code easier to approach #2071

Closed

Base automatically changed from chris/api_catalog to master February 15, 2021 21:40

ChristopheDuong added 11 commits February 15, 2021 13:49

Make it easier to Dev on normalization

1d53960

Handle nested schemas in normalization

0bb0ccb

Handle nesting with redshift destinations

27b5274

Corrections from code review

499f69a

format code

b1283b9

Revert "Make it easier to Dev on normalization"

4296d11

This reverts commit 61f54e7.

Handle edge case column names in bigquery

0dbcb84

BumpVersion for nesting normalization

11c80db

Fix edge case naming in snowflake

06daa7e

Fix random BQ integration tests error depending on failure message

f369fde

Add more infos on generated files

51c39fd

cgardens force-pushed the chris/nested-normalization branch from f122e31 to 51c39fd Compare February 15, 2021 21:49

cgardens merged commit d799b45 into master Feb 15, 2021

cgardens deleted the chris/nested-normalization branch February 15, 2021 21:50

This was referenced Feb 17, 2021

Slack Source Fixes #2091

Merged

Redshift Destination Table Casing #2023

Closed

Redshift destination treats destination tables as having significant case, but Redshift is case-insensitive #1926

Closed

karinakuz added connectors/destination/bigquery connectors/destinations-api connectors/destinations-warehouse connectors/destination/snowflake and removed connectors/destinations-api labels Jan 12, 2022

Mykyta-Serbynevskyi added a commit that referenced this pull request Jul 13, 2022

Added metadata approach for sorting of connector-development subfolder (

9422b5d

#2044)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Normalization to handle nested Streams in new Catalog API #2044

Refactor Normalization to handle nested Streams in new Catalog API #2044

ChristopheDuong commented Feb 11, 2021 •

edited

Loading

cgardens Feb 11, 2021

ChristopheDuong Feb 12, 2021

cgardens Feb 12, 2021

ChristopheDuong Feb 12, 2021

cgardens Feb 11, 2021

ChristopheDuong Feb 12, 2021 •

edited

Loading

cgardens Feb 12, 2021

ChristopheDuong Feb 12, 2021

cgardens Feb 11, 2021

ChristopheDuong Feb 12, 2021 •

edited

Loading

cgardens Feb 11, 2021

ChristopheDuong Feb 12, 2021 •

edited

Loading

cgardens commented Feb 12, 2021

sherifnada left a comment

sherifnada Feb 12, 2021

sherifnada Feb 12, 2021

sherifnada Feb 12, 2021

ChristopheDuong Feb 12, 2021 •

edited

Loading

ChristopheDuong commented Feb 12, 2021 •

edited

Loading

ChristopheDuong commented Feb 15, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Feb 15, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Feb 15, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Feb 15, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Feb 15, 2021 •

edited by github-actions bot

Loading

		@@ -139,6 +101,280 @@ def extract_schema(profiles_yml: dict) -> str:
		return profiles_yml["schema"]


		def generate_dbt_model(schema: str, output: str, integration_type: str, catalog: dict, json_col: str) -> Dict[str, Set[str]]:

Refactor Normalization to handle nested Streams in new Catalog API #2044

Refactor Normalization to handle nested Streams in new Catalog API #2044

Conversation

ChristopheDuong commented Feb 11, 2021 • edited Loading

What

How

Pre-merge Checklist

Recommended reading order

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong Feb 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong Feb 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong Feb 12, 2021 • edited Loading

Choose a reason for hiding this comment

cgardens commented Feb 12, 2021

sherifnada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong Feb 12, 2021 • edited Loading

Choose a reason for hiding this comment

ChristopheDuong commented Feb 12, 2021 • edited Loading

ChristopheDuong commented Feb 15, 2021 • edited by github-actions bot Loading

ChristopheDuong commented Feb 15, 2021 • edited by github-actions bot Loading

ChristopheDuong commented Feb 15, 2021 • edited by github-actions bot Loading

ChristopheDuong commented Feb 15, 2021 • edited by github-actions bot Loading

ChristopheDuong commented Feb 15, 2021 • edited by github-actions bot Loading

ChristopheDuong commented Feb 11, 2021 •

edited

Loading

ChristopheDuong Feb 12, 2021 •

edited

Loading

ChristopheDuong Feb 12, 2021 •

edited

Loading

ChristopheDuong Feb 12, 2021 •

edited

Loading

ChristopheDuong Feb 12, 2021 •

edited

Loading

ChristopheDuong commented Feb 12, 2021 •

edited

Loading

ChristopheDuong commented Feb 15, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Feb 15, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Feb 15, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Feb 15, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Feb 15, 2021 •

edited by github-actions bot

Loading