Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Basic" Normalization #782

Closed
cgardens opened this issue Nov 2, 2020 · 6 comments
Closed

"Basic" Normalization #782

cgardens opened this issue Nov 2, 2020 · 6 comments
Assignees
Labels
type/enhancement New feature or request
Milestone

Comments

@cgardens
Copy link
Contributor

cgardens commented Nov 2, 2020

Tell us about the problem you're trying to solve

  • Currently each destination takes data from a source and dumps it into a single column in the destination as a json blob.
  • The blob is not very user-friendly. Specifically it can make the data hard to query in the destination. While this might be all right for users who are comfortable running normalization transformations in their data warehouses to map the blob to a more traditional relational database schema, for those who are not comfortable doing that or whose data is relatively simple, having to handle this formatting is a big friction.

Describe the solution you’d like

  • The normalization option should be optional. It should be always allowed to simply get the raw json blob in your destination without running a subsequent normalization step.
  • If normalization is turned on then in the destination database we should provide both the table with the json blob and normalized table.
  • We'd like to provide "basic" normalization. We're looking to hit the 80/20 rule here. We know that we cannot come up with a normalizing scheme that will solve every use case.
    We want to come up with rules that are:
    • transparent
    • easy to reason about
    • sane in the case of non-heavily nested data
  • For example we want to be able to map the following user object { name: "vera", "age": 36 } to a table, in say postgres, with a name users and columns ab_id (airbyte's uuid for tracking the record), name which is varchar and age which is an integer.
  • For nesting we will assume that we want to expand data. e.g. { name: "vera", "age": 36, "jobs": [ "journalist", "revolutionary"] } would be normalized to the following: a table named users with columns ab_id, name and age and then a second table named jobs with columns ab_id (its own uuid) parent_ab_id (airbyte id of the foreign key relation) value (e.g. "journalist").
    • Alternatively we could never expand data. We need to pick one or other. Each one has big tradeoffs and will be a friction point depending on the shape of the data. e.g. always expand data will be awful in the stripe case where the deeply nested objects will lead to the creation of 50 database tables.
  • The solution we come up with should be reusable for all of our existing destinations and require minimal, if any, custom work for adding new destinations. We want to write this once for all (or most) destinations, not once per destination like we do with other integration code.
@cgardens cgardens added the type/enhancement New feature or request label Nov 2, 2020
@cgardens cgardens added this to the v0.5.0 milestone Nov 2, 2020
@ChristopheDuong
Copy link
Contributor

For "basic normalization", since we already have json objects, what comes to my mind is using the basic tools of data scientists for dealing with this type of operations:
https://stackoverflow.com/questions/40588852/pandas-read-nested-json

However this would be in python, would that be incompatible with the current destinations written in java?

@cgardens
Copy link
Contributor Author

cgardens commented Nov 2, 2020

discussed offline. going to move forward with a trying to create a container that uses dbt to run normalization. here's how we're splitting up work and plan to have ready by wednesday morning
@ChristopheDuong working on:

  • Prototype of getting json schema into a dbt model.
  • Prototype of running that dbt model against BigQuery

@cgardens working on:

  • Hooking it up to the sync worker
  • Base DBT image including file mounts
  • Switch statement to convert configs into something dbt can consume

@ChristopheDuong
Copy link
Contributor

FYI @cgardens this is the file that needs to be generated depending on the destination with the proper credentials:
https://docs.getdbt.com/reference/profiles.yml

By default, dbt expects the profiles.yml file to be located in the ~/.dbt/ directory.

@sherifnada
Copy link
Contributor

  1. is this saying that there is going to be a separate worker or process(?) that runs after the initial jsonblob sync process that converts the raw table to a normalized view? why is this preferable to writing the data once in normalized form?
  2. what is the rationale for DBT vs. our own normalization?

My questions aren't statements about any decisions that were made but just trying to get the full context here

@cgardens
Copy link
Contributor Author

cgardens commented Nov 3, 2020

sorry for lack of detail above. hopefully this clarifies!

  1. is this saying that there is going to be a separate worker or process(?) that runs after the initial jsonblob sync process that converts the raw table to a normalized view? why is this preferable to writing the data once in normalized form?

right now we are planning on executing the normalization in its own docker container, but it will still happen as part of the sync worker. it will be an optional thing that happens at the end of the sync worker. we wanted to maintain the encapsulation of normalization stuff so that 1. we could move it later if we want 2. user dbt (which is our tool of choice for this). in the future we can consider adding another worker to do the normalization, but for now we don't lose anything from a customer point of view by putting it in the sync worker and it far easier than adding a new worker.

  1. what is the rationale for DBT vs. our own normalization?

There are 2 main reasons to go with dbt here:

  1. It is the only solution we have right now (other than writing our own) that allows the data to stay in the data store. our goal is to transform the jsonblob that we wrote into a "normalized" table without pulling the data back out of the db. this means that the data is transferred over the network once into the db and then is totally internal from there.
  2. this requirement: "The solution we come up with should be reusable for all of our existing destinations and require minimal, if any, custom work for adding new destinations." DBT has built in support for writing a generic model in a single language that it can then compile into the correct queries in other databases. that's at least mostly true; we are running up against some limitations there. though you could argue something like jooq could fill a similar role.

Writing our own normalization for n different databses is a lot of work . I'm skeptical we'd do it better than dbt and we'd spend a lot of time doing it.

@cgardens
Copy link
Contributor Author

cgardens commented Nov 9, 2020

Remaining todos (as of 2020/11/09)

  • Manual Testing with nested objects / arrays UPDATE: going to complete this as separate issue: Basic Normalization for Nested Data #886
  • Add documentation on what basic normalization does (and which destinations implement it)
  • Add documentation on how to add basic normalization to a destination (? - not sure if we want to do this in its current state or do one more iteration before we encourage people to include it in their destination) UPDATE: going to complete this separately once this feature is more stable. issue: Document how to add basic normalization to new destinations #887
  • Integration test for base-normalization module ran out of time. will be handled in separate issue: Integration Test for Basic Normalization Module #965
  • Integration test (or standard test) for normalization functionality for each destination that implements basic normalization
  • Smooth out rough edges on naming. (e.g. what table is called raw, what table is called normalized, etc) (issue)
  • Use versioned image (issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants