Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake queueing causes keypair auth'd connections to fail #14

Closed
1 task done
drewbanin opened this issue May 22, 2020 · 7 comments
Closed
1 task done

Snowflake queueing causes keypair auth'd connections to fail #14

drewbanin opened this issue May 22, 2020 · 7 comments
Labels
type:enhancement New feature or request

Comments

@drewbanin
Copy link
Contributor

Describe the feature

Snowflake keypair connections have 10 seconds to execute before the embedded JWT token expires. If many concurrent queries are opened against Snowflake when using keypair auth, Snowflake can fail to open connections with this error:

Failed to connect to DB: db.snowflakecomputing.com:443. JWT token is invalid.

I don't think there's too much we can do on dbt's side of things to prevent this, but we can consider doing the following things:

  • catch this error and retry opening the connection after some timeout
  • provide better tooling for users to understand how queueing affects dbt invocations
    • This could be as simple as a handy Snowflake query you can run to analyze concurrently running dbt queries over time
    • It would be awesome to include queued time in the logs or run_results.json file if Snowflake makes it available to us

Steps To Reproduce

  1. Configure a Snowflake connection with keypair auth
  2. Create a sample project with 64 models in it
  3. Run dbt run --threads 64

Expected behavior

TBD. We can address this issue with:

  • Better docs and tooling
  • Retries
  • More information in dbt log output

System information

Which database are you using dbt with?

  • snowflake

The output of dbt --version:
Any

The operating system you're using:
Any

The output of python --version:
Any

@cgkirici
Copy link

cgkirici commented Dec 1, 2020

Hi Drew,

Getting this error even when running 1 model with 1 thread, and when using dbt debug to test the connection, which makes me think that the issue might not only be a case of running many concurrent queries. Or am I missing something?

Also snowflake documentation says this could be because 'Client machine's time is not synced with NTP server, and caused JWT Token to become invalid due to a token TTL timeout' link - could that be a potential reason?

@drewbanin
Copy link
Contributor Author

hey @cgEarth - is it possible that there are a lot of queries running against the warehouse that are not executed by dbt? I think the operative thing to investigate is just if the Snowflake database is overloaded such that any query generated by dbt would block for the > 10 second window where tokens are still valid.

I actually haven't heard anything about this one since this issue was opened ~6 months ago. I do think that that a time desynchronization could be a likely culprit as well, or there could be some other variable that we're missing here

@tnightengale
Copy link

@drewbanin How would you feel about adding a retry option for the Snowflake connection more generally? I see that the BigQuery connection got https://github.com/fishtown-analytics/dbt/pull/2694/files, on specific errors recently :)

Our need is that sometimes Snowflake Shares (shared database objects) from external orgs are dropped/missing momentarily, and unfortunately it coincides with our runs. If you agree that broadening retry support would be helpful for anyone experiencing transient errors, I'd be happy to submit a PR.

@jtcohen6
Copy link
Contributor

@tnightengale I'm open to that! There's a more recent issue (dbt-labs/dbt-core#3303) discussing potential for retry on transient errors. I think that might be a good place to talk more about the errors associated with missing Snowflake shares; our ability to identify them as a transient error; and developing retry logic more generically.

@JoshuaHuntley
Copy link
Contributor

@jtcohen6 Curious if there has been any additional information on this issue. My client is seeing these JWT issues pop up and our investigation has yielded no leads.

Snowflake load does not seem to predict the JWT failure issue. We've seen it happen at times where warehouse load is low with no queues but at other times when warehouse load has been high with longer queues the issue has not happened.

We've tried clusters as well as separating different jobs out onto different warehouses and playing with the number of threads. There does not seem to be rhyme or reason.

I'm also on dbt Slack if a more active conversation might be beneficial.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 4, 2021

@JoshuaHuntley No additional information from my end as far as getting to the bottom of this one. Ultimately, we're at the whims of Snowflake's internal queueing and the information available from snowflake-connector-python.

There's definitely interest in adding more retry logic for transient/timeout errors across several adapters. That seems like our only point of intervention for handling this bug as well. I haven't thought out the exact implementation details, but I imagine it might look like reimplementing the base execute method within the dbt-snowflake connections module to catch this error and resubmit the query.

@jtcohen6 jtcohen6 transferred this issue from dbt-labs/dbt-core Oct 12, 2021
@jtcohen6 jtcohen6 added the type:enhancement New feature or request label Oct 12, 2021
@jtcohen6
Copy link
Contributor

Resolved by #6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants