"Basic" Normalization #782

cgardens · 2020-11-02T16:51:24Z

Tell us about the problem you're trying to solve

Currently each destination takes data from a source and dumps it into a single column in the destination as a json blob.
The blob is not very user-friendly. Specifically it can make the data hard to query in the destination. While this might be all right for users who are comfortable running normalization transformations in their data warehouses to map the blob to a more traditional relational database schema, for those who are not comfortable doing that or whose data is relatively simple, having to handle this formatting is a big friction.

Describe the solution you’d like

The normalization option should be optional. It should be always allowed to simply get the raw json blob in your destination without running a subsequent normalization step.
If normalization is turned on then in the destination database we should provide both the table with the json blob and normalized table.
We'd like to provide "basic" normalization. We're looking to hit the 80/20 rule here. We know that we cannot come up with a normalizing scheme that will solve every use case.
We want to come up with rules that are:
- transparent
- easy to reason about
- sane in the case of non-heavily nested data
For example we want to be able to map the following user object { name: "vera", "age": 36 } to a table, in say postgres, with a name users and columns ab_id (airbyte's uuid for tracking the record), name which is varchar and age which is an integer.
For nesting we will assume that we want to expand data. e.g. { name: "vera", "age": 36, "jobs": [ "journalist", "revolutionary"] } would be normalized to the following: a table named users with columns ab_id, name and age and then a second table named jobs with columns ab_id (its own uuid) parent_ab_id (airbyte id of the foreign key relation) value (e.g. "journalist").
- Alternatively we could never expand data. We need to pick one or other. Each one has big tradeoffs and will be a friction point depending on the shape of the data. e.g. always expand data will be awful in the stripe case where the deeply nested objects will lead to the creation of 50 database tables.
The solution we come up with should be reusable for all of our existing destinations and require minimal, if any, custom work for adding new destinations. We want to write this once for all (or most) destinations, not once per destination like we do with other integration code.

The text was updated successfully, but these errors were encountered:

ChristopheDuong · 2020-11-02T17:41:13Z

For "basic normalization", since we already have json objects, what comes to my mind is using the basic tools of data scientists for dealing with this type of operations:
https://stackoverflow.com/questions/40588852/pandas-read-nested-json

However this would be in python, would that be incompatible with the current destinations written in java?

cgardens · 2020-11-02T22:01:32Z

discussed offline. going to move forward with a trying to create a container that uses dbt to run normalization. here's how we're splitting up work and plan to have ready by wednesday morning
@ChristopheDuong working on:

Prototype of getting json schema into a dbt model.
Prototype of running that dbt model against BigQuery

@cgardens working on:

Hooking it up to the sync worker
Base DBT image including file mounts
Switch statement to convert configs into something dbt can consume

ChristopheDuong · 2020-11-03T10:36:43Z

FYI @cgardens this is the file that needs to be generated depending on the destination with the proper credentials:
https://docs.getdbt.com/reference/profiles.yml

By default, dbt expects the profiles.yml file to be located in the ~/.dbt/ directory.

sherifnada · 2020-11-03T19:07:08Z

is this saying that there is going to be a separate worker or process(?) that runs after the initial jsonblob sync process that converts the raw table to a normalized view? why is this preferable to writing the data once in normalized form?
what is the rationale for DBT vs. our own normalization?

My questions aren't statements about any decisions that were made but just trying to get the full context here

cgardens · 2020-11-03T19:16:29Z

sorry for lack of detail above. hopefully this clarifies!

is this saying that there is going to be a separate worker or process(?) that runs after the initial jsonblob sync process that converts the raw table to a normalized view? why is this preferable to writing the data once in normalized form?

right now we are planning on executing the normalization in its own docker container, but it will still happen as part of the sync worker. it will be an optional thing that happens at the end of the sync worker. we wanted to maintain the encapsulation of normalization stuff so that 1. we could move it later if we want 2. user dbt (which is our tool of choice for this). in the future we can consider adding another worker to do the normalization, but for now we don't lose anything from a customer point of view by putting it in the sync worker and it far easier than adding a new worker.

what is the rationale for DBT vs. our own normalization?

There are 2 main reasons to go with dbt here:

It is the only solution we have right now (other than writing our own) that allows the data to stay in the data store. our goal is to transform the jsonblob that we wrote into a "normalized" table without pulling the data back out of the db. this means that the data is transferred over the network once into the db and then is totally internal from there.
this requirement: "The solution we come up with should be reusable for all of our existing destinations and require minimal, if any, custom work for adding new destinations." DBT has built in support for writing a generic model in a single language that it can then compile into the correct queries in other databases. that's at least mostly true; we are running up against some limitations there. though you could argue something like jooq could fill a similar role.

Writing our own normalization for n different databses is a lot of work . I'm skeptical we'd do it better than dbt and we'd spend a lot of time doing it.

cgardens · 2020-11-09T15:34:55Z

Remaining todos (as of 2020/11/09)

~~Manual Testing with nested objects / arrays~~ UPDATE: going to complete this as separate issue: Basic Normalization for Nested Data #886
Add documentation on what basic normalization does (and which destinations implement it)
Add documentation on how to add basic normalization to a destination (? - not sure if we want to do this in its current state or do one more iteration before we encourage people to include it in their destination) UPDATE: going to complete this separately once this feature is more stable. issue: Document how to add basic normalization to new destinations #887
~~Integration test for base-normalization module~~ ran out of time. will be handled in separate issue: Integration Test for Basic Normalization Module #965
Integration test (or standard test) for normalization functionality for each destination that implements basic normalization
Smooth out rough edges on naming. (e.g. what table is called raw, what table is called normalized, etc) (issue)
Use versioned image (issue)

cgardens added the type/enhancement New feature or request label Nov 2, 2020

cgardens added this to the v0.5.0 milestone Nov 2, 2020

cgardens assigned ChristopheDuong and cgardens Nov 2, 2020

ChristopheDuong mentioned this issue Nov 3, 2020

DBT project for Airbyte normalization #802

Merged

This was referenced Nov 12, 2020

Integration Test for Basic Normalization Module #965

Closed

Basic Normalization for Nested Data #886

Closed

Document how to add basic normalization to new destinations #887

Closed

cgardens closed this as completed Nov 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Basic" Normalization #782

"Basic" Normalization #782

cgardens commented Nov 2, 2020

ChristopheDuong commented Nov 2, 2020

cgardens commented Nov 2, 2020 •

edited

Loading

ChristopheDuong commented Nov 3, 2020

sherifnada commented Nov 3, 2020

cgardens commented Nov 3, 2020

cgardens commented Nov 9, 2020 •

edited

Loading

"Basic" Normalization #782

"Basic" Normalization #782

Comments

cgardens commented Nov 2, 2020

Tell us about the problem you're trying to solve

Describe the solution you’d like

ChristopheDuong commented Nov 2, 2020

cgardens commented Nov 2, 2020 • edited Loading

ChristopheDuong commented Nov 3, 2020

sherifnada commented Nov 3, 2020

cgardens commented Nov 3, 2020

cgardens commented Nov 9, 2020 • edited Loading

Remaining todos (as of 2020/11/09)

cgardens commented Nov 2, 2020 •

edited

Loading

cgardens commented Nov 9, 2020 •

edited

Loading