Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-506] [Regression] post release 1.5.0 - models are failing with Runtime error - "Read operation timed out" #427

Closed
2 tasks done
Kamalakar4Pelago opened this issue May 3, 2023 · 19 comments · Fixed by #433
Assignees
Labels
bug Something isn't working regression
Milestone

Comments

@Kamalakar4Pelago
Copy link

Kamalakar4Pelago commented May 3, 2023

Is this a new bug in dbt-redshift?

  • I believe this is a new bug in dbt-redshift
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

  • Airflow DAG task running dbt core models for redshift are failing with Runtime error " read operation timed out",
    As a workaround - we deployed a hot fix to use 1.4.6 version, which is the prior working version in our case.

Expected Behavior

All models which were running in the 1.4.6 version should also be expected to be running with 1.5.0.

Steps To Reproduce

  1. Daily Airflow DAG - using ECS Operator, we are running DBT models, Which suddenly started failing from 29Apr2023 onwards and importantly no change was introduced from our side.
  2. task with run - dbt core task is failing with "read operation timed out" - as shown logs attached.

Relevant log output

[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:21.609000] �[0m16:51:21  Running with dbt=1.5.0
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:21.791000] �[0m16:51:21  Unable to do partial parsing because saved manifest not found. Starting full parse.
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:23.277000] �[0m16:51:23  Found 31 models, 14 tests, 0 snapshots, 0 analyses, 345 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:23.280000] �[0m16:51:23  
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:23.760000] �[0m16:51:23  Concurrency: 4 threads (target='core')
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:23.760000] �[0m16:51:23  
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:23.778000] �[0m16:51:23  1 of 8 START sql table model core.mv_bookings .................................. [RUN]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:23.780000] �[0m16:51:23  2 of 8 START sql incremental model core.t1_bookings_all ........................ [RUN]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:23.781000] �[0m16:51:23  3 of 8 START sql incremental model core.t1_payments ............................ [RUN]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:23.792000] �[0m16:51:23  4 of 8 START sql table model core.t1_product_options ........................... [RUN]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:43.399000] �[0m16:51:43  4 of 8 OK created sql table model core.t1_product_options ...................... [�[32mSUCCESS�[0m in 19.60s]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:43.402000] �[0m16:51:43  5 of 8 START sql table model core.t1_product_units ............................. [RUN]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:56.400000] �[0m16:51:56  1 of 8 ERROR creating sql table model core.mv_bookings ......................... [�[31mERROR�[0m in 32.62s]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:56.402000] �[0m16:51:56  6 of 8 START sql table model core.t1_products_all .............................. [RUN]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:56.889000] �[0m16:51:56  2 of 8 ERROR creating sql incremental model core.t1_bookings_all ............... [�[31mERROR�[0m in 33.10s]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:56.891000] �[0m16:51:56  7 of 8 START sql incremental model core.t1_promos .............................. [RUN]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:57.254000] �[0m16:51:57  5 of 8 OK created sql table model core.t1_product_units ........................ [�[32mSUCCESS�[0m in 13.85s]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:51:57.256000] �[0m16:51:57  8 of 8 START sql incremental model core.t1_row_counts .......................... [RUN]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:52:10.836000] �[0m16:52:10  3 of 8 OK created sql incremental model core.t1_payments ....................... [�[32mSUCCESS�[0m in 47.04s]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:52:25.167000] �[0m16:52:25  7 of 8 OK created sql incremental model core.t1_promos ......................... [�[32mSUCCESS�[0m in 28.27s]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:52:27.653000] �[0m16:52:27  6 of 8 ERROR creating sql table model core.t1_products_all ..................... [�[31mERROR�[0m in 31.25s]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:52:38.781000] �[0m16:52:38  8 of 8 OK created sql incremental model core.t1_row_counts ..................... [�[32mSUCCESS�[0m in 41.52s]
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:52:38.860000] �[0m16:52:38  
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:52:38.860000] �[0m16:52:38  Finished running 4 table models, 4 incremental models in 0 hours 1 minutes and 15.58 seconds (75.58s).
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:52:38.875000] �[0m16:52:38  
[2023-04-30 16:53:05,159] INFO - [2023-04-30T16:52:38.876000] �[0m16:52:38  �[31mCompleted with 3 errors and 0 warnings:�[0m
[2023-04-30 16:53:05,160] INFO - [2023-04-30T16:52:38.876000] �[0m16:52:38  
[2023-04-30 16:53:05,160] INFO - [2023-04-30T16:52:38.877000] �[0m16:52:38  �[33mRuntime Error in model mv_bookings (models/core/mv_bookings.sql)�[0m
[2023-04-30 16:53:05,160] INFO - [2023-04-30T16:52:38.877000] �[0m16:52:38    The read operation timed out
[2023-04-30 16:53:05,160] INFO - [2023-04-30T16:52:38.878000] �[0m16:52:38  
[2023-04-30 16:53:05,160] INFO - [2023-04-30T16:52:38.879000] �[0m16:52:38  �[33mRuntime Error in model t1_bookings_all (models/core/t1_bookings_all.sql)�[0m
[2023-04-30 16:53:05,160] INFO - [2023-04-30T16:52:38.879000] �[0m16:52:38    The read operation timed out
[2023-04-30 16:53:05,160] INFO - [2023-04-30T16:52:38.880000] �[0m16:52:38  
[2023-04-30 16:53:05,160] INFO - [2023-04-30T16:52:38.880000] �[0m16:52:38  �[33mRuntime Error in model t1_products_all (models/core/t1_products_all.sql)�[0m
[2023-04-30 16:53:05,160] INFO - [2023-04-30T16:52:38.881000] �[0m16:52:38    The read operation timed out

Environment

- OS: ubuntu 18.04
- Python: 3.8
- dbt-core:1.5.0
- dbt-redshift:1.5.0

Additional Context

No response

@Kamalakar4Pelago Kamalakar4Pelago added bug Something isn't working triage labels May 3, 2023
@github-actions github-actions bot changed the title [Bug] post release 1.5.0 - models are failing with Runtime error - "Read operation timed out" [ADAP-506] [Bug] post release 1.5.0 - models are failing with Runtime error - "Read operation timed out" May 3, 2023
@mp24-git
Copy link

mp24-git commented May 3, 2023

We are seeing the same issue when running our DBT code in docker against Redshift Serverless. Going back to 1.4.6 fixed the issue for us.

Docker Image: public.ecr.aws/docker/library/python:3.10-slim-bullseye
Python 3.10
dbt-core 1.5.0
dbt-redshift 1.5.0

@fabientra
Copy link

We had the same issue and had to rollback to a previous version

@dbeatty10 dbeatty10 changed the title [ADAP-506] [Bug] post release 1.5.0 - models are failing with Runtime error - "Read operation timed out" [ADAP-506] [Regression] post release 1.5.0 - models are failing with Runtime error - "Read operation timed out" May 3, 2023
@dbeatty10
Copy link
Contributor

Thanks for alerting us about this @Kamalakar4Pelago, @mp24-git, and @fabientra

Are all of you using Redshift Serverless? Are any of you using cross-database references or RA3 instances? Trying to hone in on what is common between all of you so that we can try reproducing the issue on our end.

@mp24-git
Copy link

mp24-git commented May 3, 2023

We are using Redshift Serverless without cross-database references. The issue happens mostly with large tables that we read from an external schema (which are eventually S3 parquet files). Smaller tables read from the same schema are ok.

I see that big tables fail after ~31 seconds but smaller tables seem to be fine

1 of 35 ERROR creating sql table model fdc_staging.bkpf ........................ [\u001b[31mERROR\u001b[0m in 31.24s]
2 of 35 OK created sql table model fdc_staging.bkpf_changes .................... [\u001b[32mSUCCESS\u001b[0m in 24.06s]
3 of 35 ERROR creating sql table model fdc_staging.bseg ........................ [\u001b[31mERROR\u001b[0m in 31.69s]
4 of 35 ERROR creating sql table model fdc_staging.bseg_changes ................ [\u001b[31mERROR\u001b[0m in 32.10s]

@njrs92
Copy link

njrs92 commented May 3, 2023

We are getting the same issue but am using a plain RA3 cluster. There are some uses of cross-database but fails on any table. The workaround was to bump connect_timeout to something obscenely high which gives the tables time to run.

@dbeatty10
Copy link
Contributor

Background context:

  • We switched from using psycopg2 in 1.4 to redshift_connector in 1.5

I'm wondering if the connection timeout is somehow coming into play? If so, might explain that ~30 second threshold.

Within your profiles.yml, could you try setting the following to see if the errors start taking ~40 seconds?

connect_timeout: 40

@njrs92
Copy link

njrs92 commented May 3, 2023

Yes we found that the connect_timeout directly linked to the time it takes to fail, we had it at 20sec, tested with 60, then 600 and finally 6000

@Kamalakar4Pelago
Copy link
Author

Thanks for alerting us about this @Kamalakar4Pelago, @mp24-git, and @fabientra

Are all of you using Redshift Serverless? Are any of you using cross-database references or RA3 instances? Trying to hone in on what is common between all of you so that we can try reproducing the issue on our end.

We are on the Redshift [dc2. large] cluster.

@fabientra
Copy link

We are using a plain ra3. read timeout occured on a query using Spectrum I think. We will try to increase the connect_timeout

@dbeatty10
Copy link
Contributor

dbeatty10 commented May 5, 2023

Potentially related?

Edit: I'm guessing not related.

@dbeatty10 dbeatty10 removed the triage label May 5, 2023
@stisti
Copy link

stisti commented May 6, 2023

Issue still exists with release 1.5.1. We are using a small cluster of ra3.xlplus nodes (not serverless.)

Very short queries that read only few hundred rows might work. Anything larger times out.

@elongl
Copy link

elongl commented May 8, 2023

Experiencing the same issue.

@dbeatty10
Copy link
Contributor

Thanks for confirming @stisti and @elongl.

The Redshift team is looking at this to determine a path forward.

Workaround

In the meantime, the workaround is to add an absurdly high timeout within your profiles.yml:

      connect_timeout: 999999

@Fleid Fleid self-assigned this May 9, 2023
@jialerchew
Copy link

confirming that we're facing the same issue. Upgraded from 1.4 to 1.5 for redshift_connector.

@RohitKJha30
Copy link

I faced the same issue after upgrading to version 1.5.0. Updated the connect_timeout to 999999 in the profile.yml as suggested by @dbeatty10 and the issues was fixed.

@shaunavalon
Copy link

We are also facing this issue after updating our ECS instances of DBT to 1.5.0. We are using Redshift Server with dc2.large nodes.

@mgoers-beekeeper
Copy link

Having the same issues here running our jobs deployed via dbt cloud. Had to rollback the upgrade to dbt 1.5 in dbt cloud.

@davidwitk
Copy link

Same issue here when trying to upgrade to v1.5 in dbt Cloud. I see that it seems to have been fixed with dbt-redshift 1.5.2.

@dbeatty10 How are the adapter versions updated in dbt Cloud? Is there a way to select the version of the Redshift adapter in dbt Cloud?

@dataders
Copy link
Contributor

@chodera sorry to hear that you're still seeing this issue. You're right that you shouldn't be seeing it any longer.

Is there a way to select the version of the Redshift adapter in dbt Cloud?

You're also highlighting a great UX improvement opportunity for dbt Cloud. Coincidentally, I opened an internal feature request ticket to have the adapter version displayed yesterday; everyone agrees it's a no brainer. As far as selecting which adapter version you're using, I think a core value proposition of dbt Cloud is that you shouldn't have to worry about which patch version you're using; it should just work.

For the time being, can you please reach out to dbt Cloud support? If you are able to determine that you are indeed using dbt-redshift v1.5.2 definitely come back to here and let us know. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.