-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Basic" Normalization #782
Comments
For "basic normalization", since we already have json objects, what comes to my mind is using the basic tools of data scientists for dealing with this type of operations: However this would be in python, would that be incompatible with the current destinations written in java? |
discussed offline. going to move forward with a trying to create a container that uses dbt to run normalization. here's how we're splitting up work and plan to have ready by wednesday morning
@cgardens working on:
|
FYI @cgardens this is the file that needs to be generated depending on the destination with the proper credentials: By default, dbt expects the |
My questions aren't statements about any decisions that were made but just trying to get the full context here |
sorry for lack of detail above. hopefully this clarifies!
right now we are planning on executing the normalization in its own docker container, but it will still happen as part of the sync worker. it will be an optional thing that happens at the end of the sync worker. we wanted to maintain the encapsulation of normalization stuff so that 1. we could move it later if we want 2. user dbt (which is our tool of choice for this). in the future we can consider adding another worker to do the normalization, but for now we don't lose anything from a customer point of view by putting it in the sync worker and it far easier than adding a new worker.
There are 2 main reasons to go with dbt here:
Writing our own normalization for n different databses is a lot of work . I'm skeptical we'd do it better than dbt and we'd spend a lot of time doing it. |
Remaining todos (as of 2020/11/09)
|
Tell us about the problem you're trying to solve
Describe the solution you’d like
We want to come up with rules that are:
{ name: "vera", "age": 36 }
to a table, in say postgres, with a nameusers
and columnsab_id
(airbyte's uuid for tracking the record),name
which isvarchar
andage
which is an integer.{ name: "vera", "age": 36, "jobs": [ "journalist", "revolutionary"] }
would be normalized to the following: a table namedusers
with columnsab_id
,name
andage
and then a second table namedjobs
with columnsab_id
(its own uuid)parent_ab_id
(airbyte id of the foreign key relation)value
(e.g. "journalist").The text was updated successfully, but these errors were encountered: