Recover gracefully from laptop sleep #1558

enocom · 2023-09-28T17:13:09Z

Right now if a laptop goes to sleep with an active connection to a Cloud SQL instance, the socket factory won't recover on its own once the laptop has been restored.

There are limitations in Java to handling this:

https://stackoverflow.com/questions/52876556/how-does-java-calculate-sleep-time-when-pc-goes-into-hibernate-mode

Nonetheless, we should throw out a bad connection and force a refresh, such that the socket factory recovers gracefully from sleep.

See https://github.com/luryus/cloud-sql-jdbc-hang for a reproduction of this issue.

Related to GoogleCloudPlatform/cloud-sql-proxy#1788.

Whatever we do here, we should port to the AlloyDB Java Connector as well.

enocom · 2023-09-28T17:22:05Z

cc @luryus

This is a problem across Connectors and we do want to improve support for sleep.

luryus · 2023-12-20T07:17:20Z

Hi! I noticed that this (or a very similar issue) got fixed in the Go connector: GoogleCloudPlatform/cloud-sql-go-connector#686

Any chance this could now get ported to this Java library as well?

annafang-google · 2023-12-20T13:42:53Z

Ack. We'll port the fix over shortly.

enocom · 2023-12-20T16:43:16Z

I think this change should match the Go connector -- checking for an invalid cert and blocking until one is available if the cert is in fact expired.

luryus · 2024-02-18T07:50:17Z

Hi @ttosta-google ,

it seems that this is still not fixed in v1.16.0 and with #1771.

Even though the library now immediately starts a refresh when it notices that the current cert is expired, the force refresh does not succeed. Logs indicate that it gets stuck in the refresh rate limiter somehow. The library logs "Acquiring rate limiter permit.", but not "Rate limiter permit acquired". All subsequent connection retries then just get the good old bad_certificate errors.

This reproduces with both my reproducer project and DBeaver.

Here's sample logs:

Feb 17, 2024 1:33:22 PM com.google.cloud.sql.core.Refresher handleRefreshResult
FINE: [<instance name>] Refresh Operation: Completed refresh with new certificate expiration at 2024-02-17T12:33:21Z.
Feb 17, 2024 1:33:22 PM com.google.cloud.sql.core.Refresher handleRefreshResult
FINE: [<instance name>] Refresh Operation: Next operation scheduled at 2024-02-17T12:29:20Z.
Feb 17, 2024 1:33:37 PM com.lkoskela.googlejdbchang.App main
FINE: Connection is still valid
Feb 17, 2024 1:33:57 PM com.lkoskela.googlejdbchang.App main
FINE: Connection is still valid

...
<sleep for more than an hour>
...

Feb 17, 2024 3:48:29 PM org.postgresql.jdbc.PgConnection isValid
FINE: Validating connection.
org.postgresql.util.PSQLException: This connection has been closed.
        at org.postgresql.jdbc.PgConnection.checkClosed(PgConnection.java:993)
        at org.postgresql.jdbc.PgConnection.setNetworkTimeout(PgConnection.java:1694)
        at org.postgresql.jdbc.PgConnection.isValid(PgConnection.java:1524)
        at com.lkoskela.googlejdbchang.App.main(App.java:56)

Feb 17, 2024 3:48:29 PM com.lkoskela.googlejdbchang.App main
INFO: Connection no longer valid
Feb 17, 2024 3:48:30 PM com.lkoskela.googlejdbchang.App main
INFO: Opening connection...
Feb 17, 2024 3:48:30 PM org.postgresql.Driver connect
FINE: Connecting with URL: jdbc:postgresql:///<redacted>
Feb 17, 2024 3:48:30 PM org.postgresql.jdbc.PgConnection <init>
FINE: PostgreSQL JDBC Driver 42.6.0
Feb 17, 2024 3:48:30 PM org.postgresql.jdbc.PgConnection setDefaultFetchSize
FINE:   setDefaultFetchSize = 0
Feb 17, 2024 3:48:30 PM org.postgresql.jdbc.PgConnection setPrepareThreshold
FINE:   setPrepareThreshold = 5
Feb 17, 2024 3:48:30 PM org.postgresql.core.v3.ConnectionFactoryImpl openConnectionImpl
FINE: Trying to establish a protocol version 3 connection to :5432
Feb 17, 2024 3:48:30 PM com.google.cloud.sql.core.Refresher refreshIfExpired
FINE: [<instance name>] Client certificate has expired. Starting next refresh operation immediately.
Feb 17, 2024 3:48:30 PM com.google.cloud.sql.core.Refresher forceRefresh
FINE: [<instance name>] Force Refresh: the next refresh operation was cancelled. Scheduling new refresh operation immediately.
Feb 17, 2024 3:48:30 PM com.google.cloud.sql.core.Refresher startRefreshAttempt
FINE: [<instance name>] Refresh Operation: Acquiring rate limiter permit.
Feb 17, 2024 3:48:30 PM jdk.internal.event.EventHelper logX509CertificateEvent
<redacted>
Feb 17, 2024 3:48:30 PM jdk.internal.event.EventHelper logTLSHandshakeEvent
FINE:  TLSHandshake: <redacted>
Feb 17, 2024 3:48:30 PM org.postgresql.core.v3.ConnectionFactoryImpl tryConnect
FINE: Receive Buffer Size is 65,536
Feb 17, 2024 3:48:30 PM org.postgresql.core.v3.ConnectionFactoryImpl tryConnect
FINE: Send Buffer Size is 65,536
Feb 17, 2024 3:48:30 PM org.postgresql.Driver connect
FINE: Connection error:
org.postgresql.util.PSQLException: The connection attempt failed.
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
        at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
        at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
        at org.postgresql.Driver.makeConnection(Driver.java:443)
        at org.postgresql.Driver.connect(Driver.java:297)
        at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
        at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
        at com.lkoskela.googlejdbchang.App.main(App.java:54)
Caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
        at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)

ttosta-google · 2024-02-20T14:57:00Z

Thanks for the info @luryus! I've reopened this issue and will investigate it further.

enocom · 2024-02-20T17:36:54Z

This is another example where our rate limiting approach to refreshes doesn't work so well.

The rate limiter should be only wait 30 seconds to try again, though.

luryus · 2024-04-08T12:00:21Z

Hi, any updates on this? With 1.17.1 this still happens. I just retested this with my test program and it seems that after the pc has been asleep, the rate limiter lets the force refresh succeed only after a long time - probably only after the duration for the next scheduled refresh is reached. Until that the connection attempts fail with the errors I pasted in my previous message, even if I wait more than 30 seconds between each attempt.

It looks like a thread pool is starved or something, and that's why the refresh operations get stuck in the rate limiter.

Please take a look at this again - I run to this almost daily while using DBeaver and it's super annoying.

enocom · 2024-04-10T17:42:58Z

We've recently fixed a similar issue in the Proxy. For now I'd recommend connecting through the Proxy (which handles sleep without issue). And meanwhile, we've moved this up in our list of priorities and will investigate soon.

luryus · 2024-08-29T05:44:43Z

This issue seems to be fixed by switching to the lazy refresh strategy. Makes sense - Windows sleeping disturbs timers just like idling on serverless platforms does, so avoiding them entirely fixes the problem.

There's a slight delay when the certificates are refreshed after I wake up my laptop and make a query in DBeaver, but that's not a problem at all.

Perhaps the docs could be updated to cover this use case?

jackwotherspoon · 2024-08-29T12:44:44Z

@luryus I would agree. Probably a good idea for us to update the README's Lazy Refresh section to mention the awaking from sleep use case.

enocom added priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Sep 28, 2023

blunderbuss-gcf bot assigned ttosta-google Sep 28, 2023

product-auto-label bot added the api: cloudsql label Sep 28, 2023

github-actions bot removed the api: cloudsql label Sep 28, 2023

product-auto-label bot added the api: cloudsql label Sep 29, 2023

github-actions bot removed the api: cloudsql label Oct 2, 2023

enocom added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Oct 2, 2023

product-auto-label bot added the api: cloudsql label Oct 3, 2023

github-actions bot removed the api: cloudsql label Oct 4, 2023

product-auto-label bot added the api: cloudsql label Oct 5, 2023

github-actions bot removed the api: cloudsql label Oct 5, 2023

product-auto-label bot added the api: cloudsql label Oct 6, 2023

github-actions bot removed the api: cloudsql label Oct 6, 2023

product-auto-label bot added the api: cloudsql label Oct 7, 2023

github-actions bot removed the api: cloudsql label Oct 10, 2023

product-auto-label bot added the api: cloudsql label Oct 11, 2023

github-actions bot removed the api: cloudsql label Oct 11, 2023

product-auto-label bot added the api: cloudsql label Oct 12, 2023

github-actions bot removed the api: cloudsql label Oct 12, 2023

product-auto-label bot added the api: cloudsql label Oct 13, 2023

github-actions bot removed the api: cloudsql label Oct 13, 2023

product-auto-label bot added the api: cloudsql label Oct 14, 2023

github-actions bot removed the api: cloudsql label Oct 16, 2023

product-auto-label bot added the api: cloudsql label Oct 17, 2023

github-actions bot removed the api: cloudsql label Oct 17, 2023

product-auto-label bot added the api: cloudsql label Oct 18, 2023

github-actions bot removed the api: cloudsql label Oct 19, 2023

product-auto-label bot added the api: cloudsql label Oct 20, 2023

github-actions bot removed the api: cloudsql label Oct 23, 2023

product-auto-label bot added the api: cloudsql label Oct 24, 2023

github-actions bot removed the api: cloudsql label Oct 24, 2023

product-auto-label bot added the api: cloudsql label Oct 25, 2023

github-actions bot removed the api: cloudsql label Oct 25, 2023

product-auto-label bot added the api: cloudsql label Oct 26, 2023

github-actions bot removed the api: cloudsql label Oct 30, 2023

ttosta-google mentioned this issue Jan 10, 2024

fix: ensure cert refresh recovers from sleep #1771

Merged

ttosta-google closed this as completed in #1771 Jan 20, 2024

ttosta-google mentioned this issue Jan 22, 2024

Recover gracefully from laptop sleep GoogleCloudPlatform/alloydb-java-connector#386

Closed

ttosta-google reopened this Feb 20, 2024

ttosta-google assigned hessjcg and unassigned ttosta-google Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover gracefully from laptop sleep #1558

Recover gracefully from laptop sleep #1558

enocom commented Sep 28, 2023 •

edited

Loading

enocom commented Sep 28, 2023

luryus commented Dec 20, 2023

annafang-google commented Dec 20, 2023

enocom commented Dec 20, 2023

luryus commented Feb 18, 2024

ttosta-google commented Feb 20, 2024

enocom commented Feb 20, 2024

luryus commented Apr 8, 2024

enocom commented Apr 10, 2024 •

edited

Loading

luryus commented Aug 29, 2024

jackwotherspoon commented Aug 29, 2024

Recover gracefully from laptop sleep #1558

Recover gracefully from laptop sleep #1558

Comments

enocom commented Sep 28, 2023 • edited Loading

enocom commented Sep 28, 2023

luryus commented Dec 20, 2023

annafang-google commented Dec 20, 2023

enocom commented Dec 20, 2023

luryus commented Feb 18, 2024

ttosta-google commented Feb 20, 2024

enocom commented Feb 20, 2024

luryus commented Apr 8, 2024

enocom commented Apr 10, 2024 • edited Loading

luryus commented Aug 29, 2024

jackwotherspoon commented Aug 29, 2024

enocom commented Sep 28, 2023 •

edited

Loading

enocom commented Apr 10, 2024 •

edited

Loading