[CT-858] [Enhancement] Connection is always closed after each query #5489

joshuataylor · 2022-07-19T10:25:54Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Every time a query is executed, it is then closed. This occurs with 1->XX threads, tested up to 16.

When using dbt-snowflake this causes you to have to re-login every time you issue a query, which if you have hundreds of models this can cause a massive slowdown as authentication to Snowflake is slow, especially when you are a long distance away from the server (Perth, AU -> US East 1 is 250ms for example, so having to reconnect every query when you have hundreds of models is unpleasant).

I have logged an issue on Snowflake here - dbt-labs/dbt-snowflake#201 , but I believe this is on the dbt-core level.

Expected Behavior

A single connection is made. It would be even better if a login request for SF was made in a single thread, then that's reused for all. That would also fix MFA as well, I think? But that is out of scope

Steps To Reproduce

Run dbt-snowflake
Set the threads to say 2
Have 4 queries, execute them all
Run this query:

select *
from table(information_schema.login_history_by_user())
order by event_timestamp desc;

You can see it in the logs as well:

10:24:59.898232 [debug] [ThreadPool]: Opening a new connection, currently in state closed

Relevant log output

10:24:59.898232 [debug] [ThreadPool]: Opening a new connection, currently in state closed

Environment

- OS: Linux, Mac
- Python: 3.10.4/3.9
- dbt: 1.1.1

What database are you using dbt with?

snowflake

Additional Context

No response

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2022-07-19T10:55:51Z

Thanks for opening here as well @joshuataylor!

I think you're right, the behavior going on here is defined in dbt-core, and could be relevant to other adapters as well. It does seem to yield the trickiest complications on Snowflake.

Today, during a dbt run, dbt opens separate connections for:

the caching (metadata) queries it runs at the start of dbt run (connection named 'master')
each model it compiles/runs

It then closes each connection as it completes.

Risks to be wary of here:

I do remember, a long time ago, that we ran into issues around leaving Snowflake connections open, such that the running thread would never yield: dbt rpc's "run_sql" method can hang forever #2645, Fix: close connections after use #2650
If a dbt run completes, and leaves a connection open, on data warehouses that support "auto-resume" compute, it can have real $$ implications: dbt-spark leaves open connections dbt-spark#280, Closes the connection properly. dbt-spark#285

So the goal here would be to reuse / recycle more connections while dbt is running, while still guaranteeing that at the end of a run, dbt always closes all its connections. (In an ideal case, we'd also handle authentication in a single thread, and be able to reuse that auth across multiple threads — but that feels like it might be out of scope for this effort.)

At its very simplest, the idea would be:

Don't close connections once they complete, but mark them as "done"
Instead of creating new connections, set_connection_name should try to grab a "done" connection from the pool, rename it, and use it

I think this is likely to be a big lift, requiring a deep dive into some internals of dbt's adapter and execution classes that we haven't touched in some time. I'm not sure when we'll be able to prioritize it. I agree that it feels important.

Relevant code

The context managers. The default behavior is to release a connection as soon as it's done being used.

dbt-core/core/dbt/adapters/base/impl.py

Lines 209 to 226 in e03d35a

    
           @contextmanager 
        
           def connection_named( 
        
               self, name: str, node: Optional[CompileResultNode] = None 
        
           ) -> Iterator[None]: 
        
               try: 
        
                   if self.connections.query_header is not None: 
        
                       self.connections.query_header.set(name, node) 
        
                   self.acquire_connection(name) 
        
                   yield 
        
               finally: 
        
                   self.release_connection() 
        
                   if self.connections.query_header is not None: 
        
                       self.connections.query_header.reset() 
        
           @contextmanager 
        
           def connection_for(self, node: CompileResultNode) -> Iterator[None]: 
        
               with self.connection_named(node.unique_id, node): 
        
                   yield

Called once for each node that compiles/runs:

dbt-core/core/dbt/task/base.py

Line 312 in e03d35a

with self.adapter.connection_for(self.node):

"Releasing" a connection, which actually means closing it:

dbt-core/core/dbt/adapters/base/impl.py

Lines 182 to 189 in e03d35a

    
           ### 
        
           # Methods that pass through to the connection manager 
        
           ### 
        
           def acquire_connection(self, name=None) -> Connection: 
        
               return self.connections.set_connection_name(name) 
        
           def release_connection(self) -> None: 
        
               self.connections.release()

dbt-core/core/dbt/adapters/base/connections.py

Lines 182 to 195 in e03d35a

    
           def release(self) -> None: 
        
               with self.lock: 
        
                   conn = self.get_if_exists() 
        
                   if conn is None: 
        
                       return 
        
               try: 
        
                   # always close the connection. close() calls _rollback() if there 
        
                   # is an open transaction 
        
                   self.close(conn) 
        
               except Exception: 
        
                   # if rollback or close failed, remove our busted connection 
        
                   self.clear_thread_connection() 
        
                   raise

For a connection with conn_name, check to see if any existing connections by that name, otherwise open a new one:

dbt-core/core/dbt/adapters/base/connections.py

Lines 123 to 160 in e03d35a

    
           def set_connection_name(self, name: Optional[str] = None) -> Connection: 
        
               conn_name: str 
        
               if name is None: 
        
                   # if a name isn't specified, we'll re-use a single handle 
        
                   # named 'master' 
        
                   conn_name = "master" 
        
               else: 
        
                   if not isinstance(name, str): 
        
                       raise dbt.exceptions.CompilerException( 
        
                           f"For connection name, got {name} - not a string!" 
        
                       ) 
        
                   assert isinstance(name, str) 
        
                   conn_name = name 
        
               conn = self.get_if_exists() 
        
               if conn is None: 
        
                   conn = Connection( 
        
                       type=Identifier(self.TYPE), 
        
                       name=None, 
        
                       state=ConnectionState.INIT, 
        
                       transaction_open=False, 
        
                       handle=None, 
        
                       credentials=self.profile.credentials, 
        
                   ) 
        
                   self.set_thread_connection(conn) 
        
               if conn.name == conn_name and conn.state == "open": 
        
                   return conn 
        
               fire_event(NewConnection(conn_name=conn_name, conn_type=self.TYPE)) 
        
               if conn.state == "open": 
        
                   fire_event(ConnectionReused(conn_name=conn_name)) 
        
               else: 
        
                   conn.handle = LazyHandle(self.open) 
        
               conn.name = conn_name 
        
               return conn

Cleanup that happens at the very end of runnable tasks:

dbt-core/core/dbt/task/runnable.py

Lines 438 to 439 in e03d35a

    
           finally: 
        
               adapter.cleanup_connections()

joshuataylor · 2022-07-19T11:32:50Z

I'll have a dig through and see what if I can find an elegant solution that hopefully (🤞) won't impact connections such as Spark.

As an alternative, in dbt-snowflake we could also check if the connection is closed, the token is valid then reuse the connection, as it should still be set on the handle.

As another alternative, if we could solve this on the dbt-snowflake level in the interim this would be a big speed win. We could add the token when logged in to the connection, then add it to connection. This would involve updating the connection contract though, maybe adding a metadata or other key that connectors could use?

jtcohen6 · 2022-07-19T13:15:33Z

@joshuataylor Ah - so, rather than reusing connections, just reusing the result of authentication?

I don't have a good sense of whether there are any potential security risks with taking that approach. If it works, though, and substantially speeds up the process of opening new connections, that our current dbt-core approach (treating connections as a commodity) might hold muster for the foreseeable future.

I'll reopen the dbt-snowflake issue with that scope in mind, since the changes would be specific to that codebase.

joshuataylor · 2022-07-19T13:52:33Z

Yes, for now if we can just reuse the token between requests that should be fine.

We still need to make a HTTP request with Snowflake anyway, but using a Keep Alive connection would be faster as we don't have to handshake again. But we can leave that for later.

github-actions · 2023-01-16T01:59:30Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2023-01-24T01:59:24Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

jmasonlee · 2023-03-17T16:36:48Z

We are running into this issue with our dbt on redshift, and it is making building models that are separated out for modularity annoying as each model needs to create a connection to redshift when it is run. If I have 5 separate models that each need a connection, this adds a significant time cost as compared to a single model that only needs one.

I'm wondering if there are any plans to prioritize this?

joshuataylor · 2023-03-17T16:50:44Z

Could dbt-labs/dbt-snowflake#428 be used? This has been working out great

I also have a local dev version that does a few tricks to cache etc, but my hacks should only be used in development :).

ChiQuang98 · 2023-10-10T15:52:05Z

Hi @jtcohen6
I confused about this part a little bit:

Today, during a dbt run, dbt opens separate connections for:

the caching (metadata) queries it runs at the start of dbt run (connection named 'master')

each model it compiles/runs

Could you please help me confirm this? I'm wondering if I can use 'dbt run' to execute a model. If that model refers to other objects, does it still consume only one connection even when it retrieves data from other objects?
Thank you in advance.

denised · 2024-06-05T00:06:38Z

Adding a "We're still interested" note to this item.
Right now we are more or less sidelined by the inability to reuse MFA results across multiple queries for dbt.

joshuataylor added bug Something isn't working triage labels Jul 19, 2022

github-actions bot changed the title ~~[Bug] Connection is always closed after each query~~ [CT-858] [Bug] Connection is always closed after each query Jul 19, 2022

jtcohen6 added Team:Adapters Issues designated for the adapter area of the code and removed triage labels Jul 19, 2022

jtcohen6 mentioned this issue Jul 19, 2022

[CT-854] Cache authentication (token) for subsequent connections dbt-labs/dbt-snowflake#201

Closed

leahwicz added enhancement New feature or request and removed bug Something isn't working labels Jul 19, 2022

leahwicz changed the title ~~[CT-858] [Bug] Connection is always closed after each query~~ [CT-858] [Enhancement] Connection is always closed after each query Jul 19, 2022

leahwicz added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors and removed help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors labels Jul 19, 2022

zczhuohuo mentioned this issue Dec 23, 2022

Passing Hive configuration properties cloudera/dbt-hive#85

Closed

github-actions bot added the stale Issues that have gone stale label Jan 16, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 24, 2023

VersusFacit reopened this Feb 1, 2023

github-actions bot removed the stale Issues that have gone stale label Feb 1, 2023

This was referenced May 5, 2023

[CT-2538] [Feature] Reuse connections on more adapters (specifically postgres) dbt-labs/dbt-adapters#83

Closed

[WIP] Adds reuse_connections to BaseConnectionManager #7587

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-858] [Enhancement] Connection is always closed after each query #5489

[CT-858] [Enhancement] Connection is always closed after each query #5489

joshuataylor commented Jul 19, 2022 •

edited by dbeatty10

Loading

jtcohen6 commented Jul 19, 2022

joshuataylor commented Jul 19, 2022

jtcohen6 commented Jul 19, 2022

joshuataylor commented Jul 19, 2022

github-actions bot commented Jan 16, 2023

github-actions bot commented Jan 24, 2023

jmasonlee commented Mar 17, 2023

joshuataylor commented Mar 17, 2023

ChiQuang98 commented Oct 10, 2023 •

edited by dbeatty10

Loading

denised commented Jun 5, 2024

[CT-858] [Enhancement] Connection is always closed after each query #5489

[CT-858] [Enhancement] Connection is always closed after each query #5489

Comments

joshuataylor commented Jul 19, 2022 • edited by dbeatty10 Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

What database are you using dbt with?

Additional Context

jtcohen6 commented Jul 19, 2022

Relevant code

joshuataylor commented Jul 19, 2022

jtcohen6 commented Jul 19, 2022

joshuataylor commented Jul 19, 2022

github-actions bot commented Jan 16, 2023

github-actions bot commented Jan 24, 2023

jmasonlee commented Mar 17, 2023

joshuataylor commented Mar 17, 2023

ChiQuang98 commented Oct 10, 2023 • edited by dbeatty10 Loading

denised commented Jun 5, 2024

joshuataylor commented Jul 19, 2022 •

edited by dbeatty10

Loading

ChiQuang98 commented Oct 10, 2023 •

edited by dbeatty10

Loading