[CT-248] Implement retryable errors for Spark/Databricks #293

grindheim · 2022-02-17T13:07:13Z

Describe the bug

When using dbt-spark, we randomly but somewhat frequently experience some errors that could be handled by detecting them as retryable errors.

Based on our logs, I believe at least the following errors could be considered retryable. The models usually always runs successfully the next time they're run.

They're listed in order from most frequently experienced to least:

HiveException: at least one column must be specified for the table
HiveException: Unable to alter table
A 503 response was returned but no Retry-After header was provided
Connection failed with error: Bad Status: No status code
ProtocolChangedException: The protocol version of the Delta table has been changed by a concurrent update. Please try the operation again

Ideally one could check whether the error message contains any of the above strings, and if so retry the query a number of times like in the BigQuery implementation - see comment from jtcohen6 linking to the implementation for BigQuery:
dbt-msft/dbt-sqlserver#119 (comment)

Steps To Reproduce

Seeing that these issues happen randomly, it's difficult to list a set of steps that will consistently reproduce the issues.

Expected behavior

If any of the listed errors happen, the connector should retry the given model X number of times, where ideally X is defined in the profile like for the BigQuery adapter (https://docs.getdbt.com/reference/warehouse-profiles/bigquery-profile/#retries).

Screenshots and log output

N/A

System information

The output of dbt --version:

installed version: 1.0.1
   latest version: 1.0.1

Up to date!

Plugins:
  - spark: 1.0.0

The operating system you're using:
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

The output of python --version:
Python 3.9.10

Additional context

N/A

The text was updated successfully, but these errors were encountered:

pgoslatara · 2022-03-14T19:41:27Z

Hey @grindheim, is there a reason this issue has been closed? I'm currently using dbt-spark and the lack of retry logic is definitely something I'd like to see addressed so wondering if there has been a discussion elsewhere that resulted in the closing of this issue.

grindheim · 2022-03-15T08:04:49Z

@pgoslatara No, there was just no feedback at all for 26 days, so figured I'd just close it and possibly reopen it for dbt-databricks. But reopening it now.

pgoslatara · 2022-03-15T09:23:15Z

Thanks @grindheim! I'd really like to see this implemented, I'm not sure I have the time or the knowledge to undertake this but keeping this issue open for now may allow someone else jump in with a solution.

jtcohen6 · 2022-03-15T12:34:45Z

@grindheim Thanks for opening the issue, and @pgoslatara thanks for the prompt to keep it open! Apologies for the delay in response from us. We're revamping the way we triage and maintain adapter plugin repositories. I think the topic is a good one; intermittent errors frustrate many users, and implementing retry on the adapter/connection level is the right way to go.

The question is whether these errors are cropping up while:

opening initial connections
executing queries

We already have "naive" retry implemented for initial connection opening, in two different ways, if connect_retries and connect_timeout are set in the connection profile:

A retry_all config, to retry all messages that raise errors (including
A set of retryable messages, to retry even when retry_all: False, which right now has handling only for terminated clusters while they resume:

dbt-spark/dbt/adapters/spark/connections.py

Lines 534 to 543 in d7f1d38

    
           def _is_retryable_error(exc: Exception) -> Optional[str]: 
        
               message = getattr(exc, 'message', None) 
        
               if message is None: 
        
                   return None 
        
               message = message.lower() 
        
               if 'pending' in message: 
        
                   return exc.message 
        
               if 'temporarily_unavailable' in message: 
        
                   return exc.message 
        
               return None

We don't have any retry implemented during query execution. I think adding a set of retryable errors is a good idea, if we can confirm that they are consistently intermittent for all users. If that guarantee of consistency proves impossible, we could also pursue the approach recommended in dbt-labs/dbt-core#3303, whereby users can "bring their own" retryable error statuses (list of exceptions defined in profiles.yml).

github-actions · 2022-09-12T02:15:00Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

grindheim added bug Something isn't working triage labels Feb 17, 2022

github-actions bot changed the title ~~Implement retryable errors for Spark/Databricks~~ [CT-248] Implement retryable errors for Spark/Databricks Feb 17, 2022

grindheim closed this as completed Mar 14, 2022

grindheim reopened this Mar 15, 2022

jtcohen6 removed the triage label Mar 15, 2022

jtcohen6 mentioned this issue Apr 5, 2022

Make internal macros use macro dispatch to be overridable in child adapters #320

Merged

4 tasks

jarno-r mentioned this issue May 23, 2022

[CT-681] persist_docs fails intermittently with HiveException #364

Closed

github-actions bot added the Stale label Sep 12, 2022

github-actions bot closed this as completed Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-248] Implement retryable errors for Spark/Databricks #293

[CT-248] Implement retryable errors for Spark/Databricks #293

grindheim commented Feb 17, 2022 •

edited

Loading

pgoslatara commented Mar 14, 2022

grindheim commented Mar 15, 2022 •

edited

Loading

pgoslatara commented Mar 15, 2022

jtcohen6 commented Mar 15, 2022

github-actions bot commented Sep 12, 2022

[CT-248] Implement retryable errors for Spark/Databricks #293

[CT-248] Implement retryable errors for Spark/Databricks #293

Comments

grindheim commented Feb 17, 2022 • edited Loading

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

Additional context

pgoslatara commented Mar 14, 2022

grindheim commented Mar 15, 2022 • edited Loading

pgoslatara commented Mar 15, 2022

jtcohen6 commented Mar 15, 2022

github-actions bot commented Sep 12, 2022

grindheim commented Feb 17, 2022 •

edited

Loading

grindheim commented Mar 15, 2022 •

edited

Loading