Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-858] [Enhancement] Connection is always closed after each query #5489

Open
1 task done
joshuataylor opened this issue Jul 19, 2022 · 10 comments
Open
1 task done
Labels
enhancement New feature or request Team:Adapters Issues designated for the adapter area of the code

Comments

@joshuataylor
Copy link
Contributor

joshuataylor commented Jul 19, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Every time a query is executed, it is then closed. This occurs with 1->XX threads, tested up to 16.

When using dbt-snowflake this causes you to have to re-login every time you issue a query, which if you have hundreds of models this can cause a massive slowdown as authentication to Snowflake is slow, especially when you are a long distance away from the server (Perth, AU -> US East 1 is 250ms for example, so having to reconnect every query when you have hundreds of models is unpleasant).

I have logged an issue on Snowflake here - dbt-labs/dbt-snowflake#201 , but I believe this is on the dbt-core level.

Expected Behavior

A single connection is made. It would be even better if a login request for SF was made in a single thread, then that's reused for all. That would also fix MFA as well, I think? But that is out of scope

Steps To Reproduce

  1. Run dbt-snowflake
  2. Set the threads to say 2
  3. Have 4 queries, execute them all
  4. Run this query:
select *
from table(information_schema.login_history_by_user())
order by event_timestamp desc;

You can see it in the logs as well:

10:24:59.898232 [debug] [ThreadPool]: Opening a new connection, currently in state closed

Relevant log output

10:24:59.898232 [debug] [ThreadPool]: Opening a new connection, currently in state closed

Environment

- OS: Linux, Mac
- Python: 3.10.4/3.9
- dbt: 1.1.1

What database are you using dbt with?

snowflake

Additional Context

No response

@joshuataylor joshuataylor added bug Something isn't working triage labels Jul 19, 2022
@github-actions github-actions bot changed the title [Bug] Connection is always closed after each query [CT-858] [Bug] Connection is always closed after each query Jul 19, 2022
@jtcohen6
Copy link
Contributor

Thanks for opening here as well @joshuataylor!

I think you're right, the behavior going on here is defined in dbt-core, and could be relevant to other adapters as well. It does seem to yield the trickiest complications on Snowflake.

Today, during a dbt run, dbt opens separate connections for:

  • the caching (metadata) queries it runs at the start of dbt run (connection named 'master')
  • each model it compiles/runs

It then closes each connection as it completes.

Risks to be wary of here:

So the goal here would be to reuse / recycle more connections while dbt is running, while still guaranteeing that at the end of a run, dbt always closes all its connections. (In an ideal case, we'd also handle authentication in a single thread, and be able to reuse that auth across multiple threads — but that feels like it might be out of scope for this effort.)

At its very simplest, the idea would be:

  • Don't close connections once they complete, but mark them as "done"
  • Instead of creating new connections, set_connection_name should try to grab a "done" connection from the pool, rename it, and use it

I think this is likely to be a big lift, requiring a deep dive into some internals of dbt's adapter and execution classes that we haven't touched in some time. I'm not sure when we'll be able to prioritize it. I agree that it feels important.

Relevant code

The context managers. The default behavior is to release a connection as soon as it's done being used.

@contextmanager
def connection_named(
self, name: str, node: Optional[CompileResultNode] = None
) -> Iterator[None]:
try:
if self.connections.query_header is not None:
self.connections.query_header.set(name, node)
self.acquire_connection(name)
yield
finally:
self.release_connection()
if self.connections.query_header is not None:
self.connections.query_header.reset()
@contextmanager
def connection_for(self, node: CompileResultNode) -> Iterator[None]:
with self.connection_named(node.unique_id, node):
yield

Called once for each node that compiles/runs:

with self.adapter.connection_for(self.node):

"Releasing" a connection, which actually means closing it:

###
# Methods that pass through to the connection manager
###
def acquire_connection(self, name=None) -> Connection:
return self.connections.set_connection_name(name)
def release_connection(self) -> None:
self.connections.release()

def release(self) -> None:
with self.lock:
conn = self.get_if_exists()
if conn is None:
return
try:
# always close the connection. close() calls _rollback() if there
# is an open transaction
self.close(conn)
except Exception:
# if rollback or close failed, remove our busted connection
self.clear_thread_connection()
raise

For a connection with conn_name, check to see if any existing connections by that name, otherwise open a new one:

def set_connection_name(self, name: Optional[str] = None) -> Connection:
conn_name: str
if name is None:
# if a name isn't specified, we'll re-use a single handle
# named 'master'
conn_name = "master"
else:
if not isinstance(name, str):
raise dbt.exceptions.CompilerException(
f"For connection name, got {name} - not a string!"
)
assert isinstance(name, str)
conn_name = name
conn = self.get_if_exists()
if conn is None:
conn = Connection(
type=Identifier(self.TYPE),
name=None,
state=ConnectionState.INIT,
transaction_open=False,
handle=None,
credentials=self.profile.credentials,
)
self.set_thread_connection(conn)
if conn.name == conn_name and conn.state == "open":
return conn
fire_event(NewConnection(conn_name=conn_name, conn_type=self.TYPE))
if conn.state == "open":
fire_event(ConnectionReused(conn_name=conn_name))
else:
conn.handle = LazyHandle(self.open)
conn.name = conn_name
return conn

Cleanup that happens at the very end of runnable tasks:

finally:
adapter.cleanup_connections()

@joshuataylor
Copy link
Contributor Author

I'll have a dig through and see what if I can find an elegant solution that hopefully (🤞) won't impact connections such as Spark.

As an alternative, in dbt-snowflake we could also check if the connection is closed, the token is valid then reuse the connection, as it should still be set on the handle.

As another alternative, if we could solve this on the dbt-snowflake level in the interim this would be a big speed win. We could add the token when logged in to the connection, then add it to connection. This would involve updating the connection contract though, maybe adding a metadata or other key that connectors could use?

@jtcohen6
Copy link
Contributor

@joshuataylor Ah - so, rather than reusing connections, just reusing the result of authentication?

I don't have a good sense of whether there are any potential security risks with taking that approach. If it works, though, and substantially speeds up the process of opening new connections, that our current dbt-core approach (treating connections as a commodity) might hold muster for the foreseeable future.

I'll reopen the dbt-snowflake issue with that scope in mind, since the changes would be specific to that codebase.

@joshuataylor
Copy link
Contributor Author

Yes, for now if we can just reuse the token between requests that should be fine.

We still need to make a HTTP request with Snowflake anyway, but using a Keep Alive connection would be faster as we don't have to handshake again. But we can leave that for later.

@leahwicz leahwicz added enhancement New feature or request and removed bug Something isn't working labels Jul 19, 2022
@leahwicz leahwicz changed the title [CT-858] [Bug] Connection is always closed after each query [CT-858] [Enhancement] Connection is always closed after each query Jul 19, 2022
@leahwicz leahwicz added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors and removed help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors labels Jul 19, 2022
@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Jan 16, 2023
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 24, 2023
@VersusFacit VersusFacit reopened this Feb 1, 2023
@github-actions github-actions bot removed the stale Issues that have gone stale label Feb 1, 2023
@jmasonlee
Copy link

We are running into this issue with our dbt on redshift, and it is making building models that are separated out for modularity annoying as each model needs to create a connection to redshift when it is run. If I have 5 separate models that each need a connection, this adds a significant time cost as compared to a single model that only needs one.

I'm wondering if there are any plans to prioritize this?

@joshuataylor
Copy link
Contributor Author

Could dbt-labs/dbt-snowflake#428 be used? This has been working out great

I also have a local dev version that does a few tricks to cache etc, but my hacks should only be used in development :).

@ChiQuang98
Copy link

ChiQuang98 commented Oct 10, 2023

Hi @jtcohen6
I confused about this part a little bit:

Today, during a dbt run, dbt opens separate connections for:

  • the caching (metadata) queries it runs at the start of dbt run (connection named 'master')
  • each model it compiles/runs

Could you please help me confirm this? I'm wondering if I can use 'dbt run' to execute a model. If that model refers to other objects, does it still consume only one connection even when it retrieves data from other objects?
Thank you in advance.

@denised
Copy link

denised commented Jun 5, 2024

Adding a "We're still interested" note to this item.
Right now we are more or less sidelined by the inability to reuse MFA results across multiple queries for dbt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Adapters Issues designated for the adapter area of the code
Projects
None yet
Development

No branches or pull requests

7 participants