-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First sync for a large data set takes very long and can cause crashes #156
Comments
There are two possible ways to go about this:
My preference is for option 1 since the main tables are being populated incrementally. I will start by exploring if there is a way to get the schema from the CHT Pipeline base table package so that we have only one source of truth for the schemas. |
I did some tests on using batched dbt runs with the following results: With a single Couch2pg instance replicating about 700k docs at a rate of 1000 docs per batch. I set the dbt batch at 10000 records and dbt took on average 3121 seconds to catch up and process all records. Without batching the average time was 2263 seconds With 7 Couch2pg instances each replicating ~700k docs in batches of 1000 regular dbt runs were much faster than batched dbt runs. This varied depending on the value I set for the batch size but regular dbt runs were still faster since there was no limit to the number of rows to be processed with each run. I also tried with 20 Couch2pg instances and added a delay between dbt runs of about 5 minutes to simulate models taking long to complete but even in that scenario regular dbt runs we often faster. My conclusion from this is that in 99% of cases just using incremental tables with dbt runs as they are will be sufficient and performant but in rare cases like MoH Kenya where we have a lot of instances with millions of docs and constrained database resources then batched dbt runs would definitely help but we would be trading off speed for resource conservation. In this case we can run dbt in batches by setting a flag as an environment variable. I have updated this PR to refelct this approach. |
@njuguna-n dbt 1.9 has a new feature for microbatching incremental models . Could this new feature solve the problem this ticket is stating? Additionally, you can see more info in this Coalesce presentation, starting from minute 25. |
@andrablaj Yes, this feature can solve the problem because the incremental model is loaded in batches. The only drawback I see with using this incremental strategy is that the batch size is currently only configurable to be a day and depending on the Couch2pg sync rate and the available Postgres resource the number of documents to be processed might still cause an issue. I will test this config today and assess whether the CHT Sync and CHT Pipeline PRs for this issue are still required. |
Testing microbatching with a beta release of dbt 1.9 and this branch of CHT Pipeline I get the error below.
|
@njuguna-n per this discussion, the error above might happen because the dbt-core version is ahead of the postgres adapter version ( |
Actually, there is a dbt-postgres pre-release 1.9.0-b1. So nevermind. |
I am using |
I have not been able to make the document_metadata model to build successfully using the new microbatching incremental strategy using the pre-release versions of dbt-core and dbt-postgres mentioned above. This is still worth pursuing and testing again once the generally available 1.9 release is ready. Batching is currently an enhancement that will not be required in most CHT deployments. Additionally, the current solution involves code changes in both the CHT Sync and CHT Pipeline repos with a hard-coded table name being used making future base model updates slightly more fragile. WIth incremental microbatching we would limit the required changes to only the models defined in the CHT Pipeline repo. There is no deployment of CHT or CHT Sync that is actively waiting for this improvement right now so I suggest we hold off on merging the two linked PRs and closing them once we confirm incremental microbatching works as expected. |
I tried the
The only change I made in this repo was updating RUN pip install --upgrade cffi \
&& pip install cryptography~=3.4 \
&& pip install dbt-core==1.9.0b1 dbt-postgres==1.9.0b1 Is there any extra configuration that you had locally, @njuguna-n? What was your setup? |
I have just tried it again with this branch and it worked just like in your test. I am not sure what was different about my setup but I will try to recreate it once I am back on Monday (28th October). One thing I have noted is that the microbatch models are handled differently; in the failed run it runs a batch for a specific date range |
It would be helpful to understand your setup:
|
@andrablaj I have tried to reproduce the error I got with no luck. I have changed the |
What feature do you want to improve?
When syncing data for the first time, couch2pg does a good job of copying over a lot of data fast but that means that dbt struggles to keep up and can end up trying to load millions of rows incrementally which has led to issues such as this one.
Describe the improvement you'd like
When the sync is running for the first time there should be a flag set so that either dbt or couch2pg can create the main base tables i.e.
document_metadata
,contact
, anddata_record
. These tables hold the majority of the data take the most time during dbt runs so pre-populating them would reduce the amount of time required for a sync to complete.Describe alternatives you've considered
We have manually updated the document_metadata table and deleted some documents but these were temporary measures that should not be done for all production deployments.
The text was updated successfully, but these errors were encountered: