Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-248] Implement retryable errors for Spark/Databricks #293

Closed
grindheim opened this issue Feb 17, 2022 · 5 comments
Closed

[CT-248] Implement retryable errors for Spark/Databricks #293

grindheim opened this issue Feb 17, 2022 · 5 comments
Labels
bug Something isn't working Stale

Comments

@grindheim
Copy link
Contributor

grindheim commented Feb 17, 2022

Describe the bug

When using dbt-spark, we randomly but somewhat frequently experience some errors that could be handled by detecting them as retryable errors.

Based on our logs, I believe at least the following errors could be considered retryable. The models usually always runs successfully the next time they're run.

They're listed in order from most frequently experienced to least:

  • HiveException: at least one column must be specified for the table
  • HiveException: Unable to alter table
  • A 503 response was returned but no Retry-After header was provided
  • Connection failed with error: Bad Status: No status code
  • ProtocolChangedException: The protocol version of the Delta table has been changed by a concurrent update. Please try the operation again

Ideally one could check whether the error message contains any of the above strings, and if so retry the query a number of times like in the BigQuery implementation - see comment from jtcohen6 linking to the implementation for BigQuery:
dbt-msft/dbt-sqlserver#119 (comment)

Steps To Reproduce

Seeing that these issues happen randomly, it's difficult to list a set of steps that will consistently reproduce the issues.

Expected behavior

If any of the listed errors happen, the connector should retry the given model X number of times, where ideally X is defined in the profile like for the BigQuery adapter (https://docs.getdbt.com/reference/warehouse-profiles/bigquery-profile/#retries).

Screenshots and log output

N/A

System information

The output of dbt --version:

installed version: 1.0.1
   latest version: 1.0.1

Up to date!

Plugins:
  - spark: 1.0.0

The operating system you're using:
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

The output of python --version:
Python 3.9.10

Additional context

N/A

@grindheim grindheim added bug Something isn't working triage labels Feb 17, 2022
@github-actions github-actions bot changed the title Implement retryable errors for Spark/Databricks [CT-248] Implement retryable errors for Spark/Databricks Feb 17, 2022
@pgoslatara
Copy link

Hey @grindheim, is there a reason this issue has been closed? I'm currently using dbt-spark and the lack of retry logic is definitely something I'd like to see addressed so wondering if there has been a discussion elsewhere that resulted in the closing of this issue.

@grindheim
Copy link
Contributor Author

grindheim commented Mar 15, 2022

@pgoslatara No, there was just no feedback at all for 26 days, so figured I'd just close it and possibly reopen it for dbt-databricks. But reopening it now.

@grindheim grindheim reopened this Mar 15, 2022
@pgoslatara
Copy link

Thanks @grindheim! I'd really like to see this implemented, I'm not sure I have the time or the knowledge to undertake this but keeping this issue open for now may allow someone else jump in with a solution.

@jtcohen6
Copy link
Contributor

@grindheim Thanks for opening the issue, and @pgoslatara thanks for the prompt to keep it open! Apologies for the delay in response from us. We're revamping the way we triage and maintain adapter plugin repositories. I think the topic is a good one; intermittent errors frustrate many users, and implementing retry on the adapter/connection level is the right way to go.

The question is whether these errors are cropping up while:

  • opening initial connections
  • executing queries

We already have "naive" retry implemented for initial connection opening, in two different ways, if connect_retries and connect_timeout are set in the connection profile:

  • A retry_all config, to retry all messages that raise errors (including
  • A set of retryable messages, to retry even when retry_all: False, which right now has handling only for terminated clusters while they resume:

def _is_retryable_error(exc: Exception) -> Optional[str]:
message = getattr(exc, 'message', None)
if message is None:
return None
message = message.lower()
if 'pending' in message:
return exc.message
if 'temporarily_unavailable' in message:
return exc.message
return None

We don't have any retry implemented during query execution. I think adding a set of retryable errors is a good idea, if we can confirm that they are consistently intermittent for all users. If that guarantee of consistency proves impossible, we could also pursue the approach recommended in dbt-labs/dbt-core#3303, whereby users can "bring their own" retryable error statuses (list of exceptions defined in profiles.yml).

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

3 participants