-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add connection retrying on startup and SIGHUP, Fix #742 #869
Conversation
this gets us to the "set it and forget it" state :) |
Btw, to manually test this what I do is to set the pool timeout to a higher value, like 3600, here https://github.com/begriffs/postgrest/blob/master/main/Main.hs#L68, otherwise postgREST reconnects fast on its own, after that you can just stop or restart your PostgreSQL instance and check the reconnection process output, whether at startup or at SIGHUP. |
This is helpful to make the server more robust, and to deal with docker-compose races too. Very nice. I'll review the specifics of the code soon, and will think about how to test as well... |
Does this also apply when PostgREST is running normally but the database temporarily goes down? Or is it only used during PostgREST's initialization? |
For now just at startup and at SIGHUP, but if it works well as it is I could extend the connection retrying to work automatically whenever an user request gets a 503 response. |
@steve-chavez Wouldn't it make more sense to have some sort of a heartbeat process instead of waiting for a user request to fail? |
I consider recovering from temporary database connection failures necessary for closing #742. I could open a separate issue for that however. |
@ajnsit If the db is down the reconnection is not guaranteed to be fast, could take a long time depending on the recovery of the db, so the user request would still fail, having a heartbeat process would not avoid that. Also I think that the hearbeat would be more wasteful since to do that I'll have to do a background thread that sends something like @jackfirth I'll see if I can add the automatic connection retrying in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR! Check my thoughts on using a type IORef (Either ApiRequestError DbStructure)
instead of a IORef (Maybe DbStructure)
main/Main.hs
Outdated
4. If 2 or 3 fail to give their result it means the connection is down so it goes back to 1, | ||
otherwise it finishes his work successfully. | ||
-} | ||
connectionWorker :: ThreadId -> P.Pool -> Schema -> IORef (Maybe DbStructure) -> IO () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't be better to use an Either
with ApiRequestError
on the Left
instead of a Maybe
?
src/PostgREST/App.hs
Outdated
return $ either (pgError authed) identity response | ||
respond response | ||
maybeDbStructure <- readIORef refDbStructure | ||
case maybeDbStructure of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where I see that having an Either ApiRequestError DbStructure
type would be useful, making composition easier
@@ -73,6 +74,10 @@ binaryFieldError = | |||
simpleError HT.status406 (toS (toMime CTOctetStream) <> | |||
" requested but a single column was not selected") | |||
|
|||
connectionLostError :: Response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the either type we could also remove this function and use only apiRequestError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A special message for the reconnection state is needed. I tried to make your suggestion work by adding a new constructor for ApiRequestError
with this message, all good, the problem was that by making the IORef (Either ApiRequestError DbStructure)
I had to change two function type signatures, namely app
and transactionMode
, to avoid that I could have unwrapped the Left of the IORef, but that would be the same non composable way as of the Maybe. In the end I was not able to make the code simpler with your suggestion.
Note it's getting too late for me to review this one tonight, but I'll get to it tomorrow night. |
Tried this locally and it works nicely when I kill the db and send postgrest a SIGHUP. During the interim it sends back 503, and reconnects when the db is back. I agree with @jackfirth that detecting disconnection on a failed web request is important. When I kill the db and issue a request (no SIGHUP) postgrest currently sends back a lackluster error: HTTP 500 {
"details":"FATAL: role \"postgrest_test_authenticator\" does not exist\n",
"code":"",
"message":"Database connection error"
} The message of the error is good, but the details are misleading and HTTP 503 would be preferable to indicate the temporary nature of the problem. Having trouble coming up with tests for the test suite. We'd need a way to mock the database or something and make it die while expecting reconnection attempts. Sounds hard and un-haskelly. I guess I'm good if we smoke test it and it works. |
I'll wait to look at the code in greater depth until after you've worked on the non-sighup disconnection detection in case that work causes reorganization of the code. |
@begriffs the code in this pr is already useful in a lot of cases, especially docker. Let's have it reviewed and merged and leave the "mid production disconnect" for another PR. It already took a long time having this implemented. |
@ruslantalpa I tentatively disagree. The automatic reconnection detection seems like the primary feature actually. Sighup handling is a special case that would be unnecessary if postgrest could auto reconnect on lost connection. (Well maybe it would still be useful to trigger an immediate retry if postgrest had reached a very large wait value for exponential backoff...) I'm imagining the reconnection attempt on startup to be a special case of reconnecting whenever the connection drops. Merging this feature in the current state might be dangerous in that it's sorta kinda good enough for some cases and then we'd be less likely to revisit it. If we add features to postgrest I want us to do it right. Thoughts? |
The initial connect failure is the important feature. Once the db structure is introspected, the db can go down and PostgREST will recover because db connections are released from the pool after 1s (if not used). Without this feature, if the db is not present at startup, PostgREST never recovers, so you need to revert to tricks to control the boot order. |
@ruslantalpa I'm not sure I understand what you mean, so I'd like to clarify what I think is necessary for closing #742. Specifically, I think this flow should be possible:
This PR as far as I know only addresses how PostgREST behaves at startup. Are you saying that due to the way PostgREST pools connections, this currently happens already? Or are you saying that PostgREST currently does not handle this situation, but this PR somehow interacts with connection pooling to address it? I definitely recall testing this flow on PostgREST 3.x and finding PostgREST didn't recover. And when you say "mid process disconnect", do you mean when the database (very) temporarily fails while PostgREST is processing a request? So retrying a temporary failure there would prevent the user from seeing a failure at the cost of greater latency? |
Yes, because of how pool connections work, this happens already, provided more then 1s passes between http requests so that the db connections in the pool are closed and the new ones will work. |
by mid process i mean the situation you describe, not mid request |
So in the flow I described, if PostgREST gets more than one query per second in total traffic continuously it will not recover from a temporary database failure? That seems like a pretty big deal, I would not call #742 fixed in that case. |
it's a big deal, and no one is calling #742 fixed |
Okay good, thanks for the clarification. |
@begriffs I've added the automatic connection retrying, this happens whenever an user gets a 503. Also corrected that error code status you mentioned to 503. Now it needs review and manual testing. @ruslantalpa Since the added effort for making the reconnection automatic was not much I decided to add this feature on this PR, that way we can test now if the |
@steve-chavez you are such a rockstar! 🤘 I'll give it a look this evening and try hitting it with a bunch of concurrent requests/sighups while bringing the db up and down just to check for any concurrency issues. |
I was able to find odd behavior. To reproduce, create these scripts: sighup.sh: while [ 1 ]
do
killall -HUP postgrest
sleep 0.23
done getrequest.sh: while [ 1 ]
do
curl -I http://localhost:3000 -s | head -n 1 | cut -d$' ' -f2
sleep 0.1
done
|
Accidentally deleted my previous comment with Vimium. Yes, it's because the time it takes to obtain the |
Makes sense. Repeatedly sending sighup is not even a realistic thing, and a brief 503 should be fine. Just wanted to double check it with you. |
src/PostgREST/App.hs
Outdated
(iTarget apiRequest) (iAction apiRequest) | ||
response <- P.use pool $ HT.transaction HT.ReadCommitted txMode handleReq | ||
return $ either (pgError authed) identity response | ||
if isResponse503 response then do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a good use for the when function:
when (isResponse503 response) worker
respond respose
It's looking good, I verified that
My only question about functionality at this point is about the retry timing and sighup. Let's say postgrest has tried reconnecting a number of times and the next try won't be for quite a while. It would be nice if sighup could reset the backoff and make it check right away. That way if I notice the db is down I can bring it up and get postgrest reconnected fast. Things are working great right now and that sighup improvement is totally optional. If you think it's a pain to make that change I can just merge this as it is, let me know. |
Right now the code is so that only a connectionWorker thread can run at a time no matter how many SIGHUP's are made, this also works when doing SIGHUP's and the connectionWorker is already on because of a failed request with 503; adding that timer reset and handling it concurrently would make this feature more complex. The retrying is done in exponential backoff and when it hits 32 seconds, it doesn't keep increasing and the next retrying is 32s, next 32s, and so on(capped at 32s), so if the idea is not making the user wait for too long one option could be to reduce this cap, but for now I don't know what number will be more appropriate. Other option would be, and this was the original intention, to do a cyclic exp backoff so if it hits 32s it restarts again from 1s, that would make the wait shorter, but I couldn't get that to work with the |
As a frequent PostgREST and constant Docker user, I'm really glad to see this implemented. Thanks a ton @steve-chavez! |
I overlooked that the wait time maxes out at 32 seconds. That's not so bad. I think it's merge time! 🚀 |
…tgREST#869) * Add connection retrying on startup and SIGHUP, Fix PostgREST#742 * Ensure that only one connection worker can run at a time * Change ConnectionError status code to 503 and add automatic connection retrying
This adds a capped exponential backoff(base is 1 second, cap is 32 seconds, arbitrarily picked, could be improved) for retrying the db connection in a background thread.
This thread is started at postgREST startup and leaves it rejecting all requests with a 503 with no parsing or further processing, it's in an
unavailable
state until connection succeeds; this thread can also be started on SIGHUP.I considered that postgREST is in an
available
state when the dbStructure is successfully obtained, and that's what theconnectionWorker
works on, I'll explain why:At first I had an
available
IORef for when the connection succeeded but got some issues when the dbStructure was not obtained, for example postgREST succeeded with requests like:But gived an error with requests like:
For what I've seen the first requests escaped the sections of the code that operated on dbStructure, so if the dbStructure is not obtained then postgREST is in a broken state, so instead of having two IORefs to maintain I decided that the IORef dbstructure should indicate the
available
state.I think this is useful as it is but I don't know an easy way to add tests. Thoughts?