Add connection retrying on startup and SIGHUP, Fix #742 #869

steve-chavez · 2017-04-30T07:09:49Z

This adds a capped exponential backoff(base is 1 second, cap is 32 seconds, arbitrarily picked, could be improved) for retrying the db connection in a background thread.

This thread is started at postgREST startup and leaves it rejecting all requests with a 503 with no parsing or further processing, it's in an unavailable state until connection succeeds; this thread can also be started on SIGHUP.

I considered that postgREST is in an available state when the dbStructure is successfully obtained, and that's what the connectionWorker works on, I'll explain why:

At first I had an available IORef for when the connection succeeded but got some issues when the dbStructure was not obtained, for example postgREST succeeded with requests like:

GET /clients
GET /clients?select=id,name

But gived an error with requests like:

GET /
GET /clients?select=id,name,projects{*}

For what I've seen the first requests escaped the sections of the code that operated on dbStructure, so if the dbStructure is not obtained then postgREST is in a broken state, so instead of having two IORefs to maintain I decided that the IORef dbstructure should indicate the available state.

I think this is useful as it is but I don't know an easy way to add tests. Thoughts?

ruslantalpa · 2017-04-30T07:29:28Z

this gets us to the "set it and forget it" state :)

steve-chavez · 2017-04-30T16:38:03Z

Btw, to manually test this what I do is to set the pool timeout to a higher value, like 3600, here https://github.com/begriffs/postgrest/blob/master/main/Main.hs#L68, otherwise postgREST reconnects fast on its own, after that you can just stop or restart your PostgreSQL instance and check the reconnection process output, whether at startup or at SIGHUP.

begriffs · 2017-05-01T05:13:57Z

This is helpful to make the server more robust, and to deal with docker-compose races too. Very nice.

I'll review the specifics of the code soon, and will think about how to test as well...

jackfirth · 2017-05-01T18:10:26Z

Does this also apply when PostgREST is running normally but the database temporarily goes down? Or is it only used during PostgREST's initialization?

steve-chavez · 2017-05-01T19:02:54Z

For now just at startup and at SIGHUP, but if it works well as it is I could extend the connection retrying to work automatically whenever an user request gets a 503 response.

ajnsit · 2017-05-01T19:05:26Z

@steve-chavez Wouldn't it make more sense to have some sort of a heartbeat process instead of waiting for a user request to fail?

jackfirth · 2017-05-01T19:46:29Z

I consider recovering from temporary database connection failures necessary for closing #742. I could open a separate issue for that however.

steve-chavez · 2017-05-01T20:16:40Z

@ajnsit If the db is down the reconnection is not guaranteed to be fast, could take a long time depending on the recovery of the db, so the user request would still fail, having a heartbeat process would not avoid that.

Also I think that the hearbeat would be more wasteful since to do that I'll have to do a background thread that sends something like SELECT 1 in intervals and waits for failure all the time, better to just run a thread eventually.

@jackfirth I'll see if I can add the automatic connection retrying in this PR.

diogob

Great PR! Check my thoughts on using a type IORef (Either ApiRequestError DbStructure) instead of a IORef (Maybe DbStructure)

diogob · 2017-05-02T14:55:25Z

main/Main.hs

+  4. If 2 or 3 fail to give their result it means the connection is down so it goes back to 1,
+     otherwise it finishes his work successfully.
+-}
+connectionWorker :: ThreadId -> P.Pool -> Schema -> IORef (Maybe DbStructure) -> IO ()


Wouldn't be better to use an Either with ApiRequestError on the Left instead of a Maybe?

diogob · 2017-05-02T16:03:45Z

src/PostgREST/App.hs

-        return $ either (pgError authed) identity response
-    respond response
+    maybeDbStructure <- readIORef refDbStructure
+    case maybeDbStructure of


This is where I see that having an Either ApiRequestError DbStructure type would be useful, making composition easier

diogob · 2017-05-02T16:05:09Z

src/PostgREST/Error.hs

@@ -73,6 +74,10 @@ binaryFieldError =
  simpleError HT.status406 (toS (toMime CTOctetStream) <>
  " requested but a single column was not selected")

+connectionLostError :: Response


Using the either type we could also remove this function and use only apiRequestError

A special message for the reconnection state is needed. I tried to make your suggestion work by adding a new constructor for ApiRequestError with this message, all good, the problem was that by making the IORef (Either ApiRequestError DbStructure) I had to change two function type signatures, namely app and transactionMode, to avoid that I could have unwrapped the Left of the IORef, but that would be the same non composable way as of the Maybe. In the end I was not able to make the code simpler with your suggestion.

begriffs · 2017-05-03T06:16:06Z

Note it's getting too late for me to review this one tonight, but I'll get to it tomorrow night.

begriffs · 2017-05-04T04:53:34Z

Tried this locally and it works nicely when I kill the db and send postgrest a SIGHUP. During the interim it sends back 503, and reconnects when the db is back.

I agree with @jackfirth that detecting disconnection on a failed web request is important. When I kill the db and issue a request (no SIGHUP) postgrest currently sends back a lackluster error:

HTTP 500

{
  "details":"FATAL:  role \"postgrest_test_authenticator\" does not exist\n",
  "code":"",
  "message":"Database connection error"
}

The message of the error is good, but the details are misleading and HTTP 503 would be preferable to indicate the temporary nature of the problem.

Having trouble coming up with tests for the test suite. We'd need a way to mock the database or something and make it die while expecting reconnection attempts. Sounds hard and un-haskelly. I guess I'm good if we smoke test it and it works.

begriffs · 2017-05-04T05:08:52Z

I'll wait to look at the code in greater depth until after you've worked on the non-sighup disconnection detection in case that work causes reorganization of the code.

ruslantalpa · 2017-05-04T06:02:52Z

@begriffs the code in this pr is already useful in a lot of cases, especially docker. Let's have it reviewed and merged and leave the "mid production disconnect" for another PR. It already took a long time having this implemented.

begriffs · 2017-05-04T15:24:37Z

@ruslantalpa I tentatively disagree.

The automatic reconnection detection seems like the primary feature actually. Sighup handling is a special case that would be unnecessary if postgrest could auto reconnect on lost connection. (Well maybe it would still be useful to trigger an immediate retry if postgrest had reached a very large wait value for exponential backoff...) I'm imagining the reconnection attempt on startup to be a special case of reconnecting whenever the connection drops.

Merging this feature in the current state might be dangerous in that it's sorta kinda good enough for some cases and then we'd be less likely to revisit it. If we add features to postgrest I want us to do it right.

Thoughts?

ruslantalpa · 2017-05-04T15:52:47Z

The initial connect failure is the important feature. Once the db structure is introspected, the db can go down and PostgREST will recover because db connections are released from the pool after 1s (if not used). Without this feature, if the db is not present at startup, PostgREST never recovers, so you need to revert to tricks to control the boot order.
Also, this PR introduces the primitives (which will not need changing later) that will be used to implement the mid process disconnect.
The two features while in the same category are not dependent on one another, they can be added separately, in stages. I can understand your reasoning if somehow this code would later need a rewrite, but i don't think that is the case, the mid disconnect case was considered from the very beginning and the primitives were created in such a way to later support that feature (hence the global ioref and the 503).
Lastly, not merging a feature because it will be "less likely to revisit" (a second related feature) does not increase chances of the second feature being implemented (but rather decrease them 😃 )

jackfirth · 2017-05-04T17:38:08Z

@ruslantalpa I'm not sure I understand what you mean, so I'd like to clarify what I think is necessary for closing #742. Specifically, I think this flow should be possible:

User sends request to PostgREST, receives normal response.
Database goes down, all attempts to connect fail.
User sends request to PostgREST, receives error response.
Database comes back up, new connections can now be opened.
User sends request to PostgREST, receives normal response.

This PR as far as I know only addresses how PostgREST behaves at startup. Are you saying that due to the way PostgREST pools connections, this currently happens already? Or are you saying that PostgREST currently does not handle this situation, but this PR somehow interacts with connection pooling to address it? I definitely recall testing this flow on PostgREST 3.x and finding PostgREST didn't recover.

And when you say "mid process disconnect", do you mean when the database (very) temporarily fails while PostgREST is processing a request? So retrying a temporary failure there would prevent the user from seeing a failure at the cost of greater latency?

ruslantalpa · 2017-05-04T17:49:16Z

Yes, because of how pool connections work, this happens already, provided more then 1s passes between http requests so that the db connections in the pool are closed and the new ones will work.
This does not happen if there is a constant stream of requests (more then 1 per second)

ruslantalpa · 2017-05-04T17:51:16Z

by mid process i mean the situation you describe, not mid request

jackfirth · 2017-05-04T17:53:27Z

So in the flow I described, if PostgREST gets more than one query per second in total traffic continuously it will not recover from a temporary database failure? That seems like a pretty big deal, I would not call #742 fixed in that case.

ruslantalpa · 2017-05-04T17:56:16Z

it's a big deal, and no one is calling #742 fixed

jackfirth · 2017-05-04T18:10:00Z

Okay good, thanks for the clarification.

…n retrying

steve-chavez · 2017-05-04T18:46:21Z

@begriffs I've added the automatic connection retrying, this happens whenever an user gets a 503. Also corrected that error code status you mentioned to 503. Now it needs review and manual testing.

@ruslantalpa Since the added effort for making the reconnection automatic was not much I decided to add this feature on this PR, that way we can test now if the connectionWorker also works fine and without change for this case.

begriffs · 2017-05-04T19:43:13Z

@steve-chavez you are such a rockstar! 🤘 I'll give it a look this evening and try hitting it with a bunch of concurrent requests/sighups while bringing the db up and down just to check for any concurrency issues.

begriffs · 2017-05-06T03:06:29Z

I was able to find odd behavior. To reproduce, create these scripts:

sighup.sh:

while [ 1 ]
do
    killall -HUP postgrest
    sleep 0.23
done

getrequest.sh:

while [ 1 ]
do
    curl -I http://localhost:3000 -s | head -n 1 | cut -d$' ' -f2
    sleep 0.1
done

Start the db and then start postgrest
Run getrequest.sh, which should give a bunch of 200s
In another terminal run sighup.sh
Getrequest will show alternating 200 and 503. Maybe this is because it takes time to reload the db schema and during the reload it temporarily wipes out the structure cache?

steve-chavez · 2017-05-06T16:58:24Z

Accidentally deleted my previous comment with Vimium. Yes, it's because the time it takes to obtain the dbStructure, but that was deliberate, the idea was to put PostgREST in an unavailable state and respond with 503 quickly, without parsing ,etc., while the connection retrying is happening.
Do you think that his behavior should be improved?. I think is good enough, even if getting the dbStructure is avoided and we just retry the connection it would still give 503 to some concurrent requests.

begriffs · 2017-05-06T17:08:13Z

Makes sense. Repeatedly sending sighup is not even a realistic thing, and a brief 503 should be fine. Just wanted to double check it with you.

begriffs · 2017-05-06T17:10:10Z

src/PostgREST/App.hs

+                  (iTarget apiRequest) (iAction apiRequest)
+            response <- P.use pool $ HT.transaction HT.ReadCommitted txMode handleReq
+            return $ either (pgError authed) identity response
+        if isResponse503 response then do


Might be a good use for the when function:

when (isResponse503 response) worker respond respose

begriffs · 2017-05-06T19:03:12Z

It's looking good, I verified that

concurrent sighup and requests work when taking db down and up
everything still shuts down properly if isServerVersionSupported returns false
it retries for connection on startup if needed and gets it successfully

My only question about functionality at this point is about the retry timing and sighup. Let's say postgrest has tried reconnecting a number of times and the next try won't be for quite a while. It would be nice if sighup could reset the backoff and make it check right away. That way if I notice the db is down I can bring it up and get postgrest reconnected fast.

Things are working great right now and that sighup improvement is totally optional. If you think it's a pain to make that change I can just merge this as it is, let me know.

steve-chavez · 2017-05-06T20:39:22Z

Right now the code is so that only a connectionWorker thread can run at a time no matter how many SIGHUP's are made, this also works when doing SIGHUP's and the connectionWorker is already on because of a failed request with 503; adding that timer reset and handling it concurrently would make this feature more complex.

The retrying is done in exponential backoff and when it hits 32 seconds, it doesn't keep increasing and the next retrying is 32s, next 32s, and so on(capped at 32s), so if the idea is not making the user wait for too long one option could be to reduce this cap, but for now I don't know what number will be more appropriate.

Other option would be, and this was the original intention, to do a cyclic exp backoff so if it hits 32s it restarts again from 1s, that would make the wait shorter, but I couldn't get that to work with the retry lib, see Soostone/retry#47, @begriffs maybe you could check that I may have complicated myself and turns out its a one liner and then I could add that to this PR, if it's not simple then I think it'll be better if I give it more time and implement this option in another PR.

jackfirth · 2017-05-07T00:01:36Z

As a frequent PostgREST and constant Docker user, I'm really glad to see this implemented. Thanks a ton @steve-chavez!

begriffs · 2017-05-07T00:10:25Z

I overlooked that the wait time maxes out at 32 seconds. That's not so bad.

I think it's merge time! 🚀

…tgREST#869) * Add connection retrying on startup and SIGHUP, Fix PostgREST#742 * Ensure that only one connection worker can run at a time * Change ConnectionError status code to 503 and add automatic connection retrying

steve-chavez added 2 commits April 30, 2017 01:25

Add connection retrying on startup and SIGHUP, Fix PostgREST#742

fa37faf

Fix test suite

8323655

diogob reviewed May 2, 2017

View reviewed changes

steve-chavez added 2 commits May 2, 2017 12:54

Ensure that only one connection worker can run at a time

1dd2c35

Fix hlint suggestion

7c83a57

Change ConnectionError status code to 503 and add automatic connectio…

dacfc9d

…n retrying

begriffs reviewed May 6, 2017

View reviewed changes

Refactor if statement to when

cf5abfe

begriffs merged commit 3e26c1a into PostgREST:master May 7, 2017

steve-chavez mentioned this pull request Jun 24, 2020

Allow schema cache reloading with NOTIFY #1542

Merged

steve-chavez mentioned this pull request Jul 15, 2020

Schema cache reload with zero downtime #1559

Merged

Add connection retrying on startup and SIGHUP, Fix #742 #869

Add connection retrying on startup and SIGHUP, Fix #742 #869

Conversation

steve-chavez commented Apr 30, 2017

ruslantalpa commented Apr 30, 2017

steve-chavez commented Apr 30, 2017

begriffs commented May 1, 2017

jackfirth commented May 1, 2017 • edited Loading

steve-chavez commented May 1, 2017

ajnsit commented May 1, 2017

jackfirth commented May 1, 2017

steve-chavez commented May 1, 2017

diogob left a comment

Choose a reason for hiding this comment

diogob May 2, 2017

Choose a reason for hiding this comment

diogob May 2, 2017

Choose a reason for hiding this comment

diogob May 2, 2017

Choose a reason for hiding this comment

steve-chavez May 6, 2017

Choose a reason for hiding this comment

begriffs commented May 3, 2017

begriffs commented May 4, 2017

begriffs commented May 4, 2017

ruslantalpa commented May 4, 2017

begriffs commented May 4, 2017

ruslantalpa commented May 4, 2017 • edited Loading

jackfirth commented May 4, 2017 • edited Loading

ruslantalpa commented May 4, 2017

ruslantalpa commented May 4, 2017

jackfirth commented May 4, 2017 • edited Loading

ruslantalpa commented May 4, 2017

jackfirth commented May 4, 2017

steve-chavez commented May 4, 2017 • edited Loading

begriffs commented May 4, 2017

begriffs commented May 6, 2017

steve-chavez commented May 6, 2017

begriffs commented May 6, 2017

begriffs May 6, 2017

Choose a reason for hiding this comment

begriffs commented May 6, 2017

steve-chavez commented May 6, 2017

jackfirth commented May 7, 2017

begriffs commented May 7, 2017

jackfirth commented May 1, 2017 •

edited

Loading

ruslantalpa commented May 4, 2017 •

edited

Loading

jackfirth commented May 4, 2017 •

edited

Loading

jackfirth commented May 4, 2017 •

edited

Loading

steve-chavez commented May 4, 2017 •

edited

Loading