Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add connection retrying on startup and SIGHUP, Fix #742 #869

Merged
merged 6 commits into from
May 7, 2017

Conversation

steve-chavez
Copy link
Member

This adds a capped exponential backoff(base is 1 second, cap is 32 seconds, arbitrarily picked, could be improved) for retrying the db connection in a background thread.

This thread is started at postgREST startup and leaves it rejecting all requests with a 503 with no parsing or further processing, it's in an unavailable state until connection succeeds; this thread can also be started on SIGHUP.

I considered that postgREST is in an available state when the dbStructure is successfully obtained, and that's what the connectionWorker works on, I'll explain why:

At first I had an available IORef for when the connection succeeded but got some issues when the dbStructure was not obtained, for example postgREST succeeded with requests like:

GET /clients
GET /clients?select=id,name

But gived an error with requests like:

GET /
GET /clients?select=id,name,projects{*}

For what I've seen the first requests escaped the sections of the code that operated on dbStructure, so if the dbStructure is not obtained then postgREST is in a broken state, so instead of having two IORefs to maintain I decided that the IORef dbstructure should indicate the available state.

I think this is useful as it is but I don't know an easy way to add tests. Thoughts?

@ruslantalpa
Copy link
Contributor

this gets us to the "set it and forget it" state :)

@steve-chavez
Copy link
Member Author

Btw, to manually test this what I do is to set the pool timeout to a higher value, like 3600, here https://github.com/begriffs/postgrest/blob/master/main/Main.hs#L68, otherwise postgREST reconnects fast on its own, after that you can just stop or restart your PostgreSQL instance and check the reconnection process output, whether at startup or at SIGHUP.

@begriffs
Copy link
Member

begriffs commented May 1, 2017

This is helpful to make the server more robust, and to deal with docker-compose races too. Very nice.

I'll review the specifics of the code soon, and will think about how to test as well...

@jackfirth
Copy link

jackfirth commented May 1, 2017

Does this also apply when PostgREST is running normally but the database temporarily goes down? Or is it only used during PostgREST's initialization?

@steve-chavez
Copy link
Member Author

For now just at startup and at SIGHUP, but if it works well as it is I could extend the connection retrying to work automatically whenever an user request gets a 503 response.

@ajnsit
Copy link

ajnsit commented May 1, 2017

@steve-chavez Wouldn't it make more sense to have some sort of a heartbeat process instead of waiting for a user request to fail?

@jackfirth
Copy link

I consider recovering from temporary database connection failures necessary for closing #742. I could open a separate issue for that however.

@steve-chavez
Copy link
Member Author

@ajnsit If the db is down the reconnection is not guaranteed to be fast, could take a long time depending on the recovery of the db, so the user request would still fail, having a heartbeat process would not avoid that.

Also I think that the hearbeat would be more wasteful since to do that I'll have to do a background thread that sends something like SELECT 1 in intervals and waits for failure all the time, better to just run a thread eventually.

@jackfirth I'll see if I can add the automatic connection retrying in this PR.

Copy link
Contributor

@diogob diogob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR! Check my thoughts on using a type IORef (Either ApiRequestError DbStructure) instead of a IORef (Maybe DbStructure)

main/Main.hs Outdated
4. If 2 or 3 fail to give their result it means the connection is down so it goes back to 1,
otherwise it finishes his work successfully.
-}
connectionWorker :: ThreadId -> P.Pool -> Schema -> IORef (Maybe DbStructure) -> IO ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be better to use an Either with ApiRequestError on the Left instead of a Maybe?

return $ either (pgError authed) identity response
respond response
maybeDbStructure <- readIORef refDbStructure
case maybeDbStructure of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I see that having an Either ApiRequestError DbStructure type would be useful, making composition easier

@@ -73,6 +74,10 @@ binaryFieldError =
simpleError HT.status406 (toS (toMime CTOctetStream) <>
" requested but a single column was not selected")

connectionLostError :: Response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the either type we could also remove this function and use only apiRequestError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A special message for the reconnection state is needed. I tried to make your suggestion work by adding a new constructor for ApiRequestError with this message, all good, the problem was that by making the IORef (Either ApiRequestError DbStructure) I had to change two function type signatures, namely app and transactionMode, to avoid that I could have unwrapped the Left of the IORef, but that would be the same non composable way as of the Maybe. In the end I was not able to make the code simpler with your suggestion.

@begriffs
Copy link
Member

begriffs commented May 3, 2017

Note it's getting too late for me to review this one tonight, but I'll get to it tomorrow night.

@begriffs
Copy link
Member

begriffs commented May 4, 2017

Tried this locally and it works nicely when I kill the db and send postgrest a SIGHUP. During the interim it sends back 503, and reconnects when the db is back.

I agree with @jackfirth that detecting disconnection on a failed web request is important. When I kill the db and issue a request (no SIGHUP) postgrest currently sends back a lackluster error:

HTTP 500

{
  "details":"FATAL:  role \"postgrest_test_authenticator\" does not exist\n",
  "code":"",
  "message":"Database connection error"
}

The message of the error is good, but the details are misleading and HTTP 503 would be preferable to indicate the temporary nature of the problem.

Having trouble coming up with tests for the test suite. We'd need a way to mock the database or something and make it die while expecting reconnection attempts. Sounds hard and un-haskelly. I guess I'm good if we smoke test it and it works.

@begriffs
Copy link
Member

begriffs commented May 4, 2017

I'll wait to look at the code in greater depth until after you've worked on the non-sighup disconnection detection in case that work causes reorganization of the code.

@ruslantalpa
Copy link
Contributor

@begriffs the code in this pr is already useful in a lot of cases, especially docker. Let's have it reviewed and merged and leave the "mid production disconnect" for another PR. It already took a long time having this implemented.

@begriffs
Copy link
Member

begriffs commented May 4, 2017

@ruslantalpa I tentatively disagree.

The automatic reconnection detection seems like the primary feature actually. Sighup handling is a special case that would be unnecessary if postgrest could auto reconnect on lost connection. (Well maybe it would still be useful to trigger an immediate retry if postgrest had reached a very large wait value for exponential backoff...) I'm imagining the reconnection attempt on startup to be a special case of reconnecting whenever the connection drops.

Merging this feature in the current state might be dangerous in that it's sorta kinda good enough for some cases and then we'd be less likely to revisit it. If we add features to postgrest I want us to do it right.

Thoughts?

@ruslantalpa
Copy link
Contributor

ruslantalpa commented May 4, 2017

The initial connect failure is the important feature. Once the db structure is introspected, the db can go down and PostgREST will recover because db connections are released from the pool after 1s (if not used). Without this feature, if the db is not present at startup, PostgREST never recovers, so you need to revert to tricks to control the boot order.
Also, this PR introduces the primitives (which will not need changing later) that will be used to implement the mid process disconnect.
The two features while in the same category are not dependent on one another, they can be added separately, in stages. I can understand your reasoning if somehow this code would later need a rewrite, but i don't think that is the case, the mid disconnect case was considered from the very beginning and the primitives were created in such a way to later support that feature (hence the global ioref and the 503).
Lastly, not merging a feature because it will be "less likely to revisit" (a second related feature) does not increase chances of the second feature being implemented (but rather decrease them 😃 )

@jackfirth
Copy link

jackfirth commented May 4, 2017

@ruslantalpa I'm not sure I understand what you mean, so I'd like to clarify what I think is necessary for closing #742. Specifically, I think this flow should be possible:

  • User sends request to PostgREST, receives normal response.
  • Database goes down, all attempts to connect fail.
  • User sends request to PostgREST, receives error response.
  • Database comes back up, new connections can now be opened.
  • User sends request to PostgREST, receives normal response.

This PR as far as I know only addresses how PostgREST behaves at startup. Are you saying that due to the way PostgREST pools connections, this currently happens already? Or are you saying that PostgREST currently does not handle this situation, but this PR somehow interacts with connection pooling to address it? I definitely recall testing this flow on PostgREST 3.x and finding PostgREST didn't recover.

And when you say "mid process disconnect", do you mean when the database (very) temporarily fails while PostgREST is processing a request? So retrying a temporary failure there would prevent the user from seeing a failure at the cost of greater latency?

@ruslantalpa
Copy link
Contributor

Yes, because of how pool connections work, this happens already, provided more then 1s passes between http requests so that the db connections in the pool are closed and the new ones will work.
This does not happen if there is a constant stream of requests (more then 1 per second)

@ruslantalpa
Copy link
Contributor

by mid process i mean the situation you describe, not mid request

@jackfirth
Copy link

jackfirth commented May 4, 2017

So in the flow I described, if PostgREST gets more than one query per second in total traffic continuously it will not recover from a temporary database failure? That seems like a pretty big deal, I would not call #742 fixed in that case.

@ruslantalpa
Copy link
Contributor

it's a big deal, and no one is calling #742 fixed

@jackfirth
Copy link

Okay good, thanks for the clarification.

@steve-chavez
Copy link
Member Author

steve-chavez commented May 4, 2017

@begriffs I've added the automatic connection retrying, this happens whenever an user gets a 503. Also corrected that error code status you mentioned to 503. Now it needs review and manual testing.

@ruslantalpa Since the added effort for making the reconnection automatic was not much I decided to add this feature on this PR, that way we can test now if the connectionWorker also works fine and without change for this case.

@begriffs
Copy link
Member

begriffs commented May 4, 2017

@steve-chavez you are such a rockstar! 🤘 I'll give it a look this evening and try hitting it with a bunch of concurrent requests/sighups while bringing the db up and down just to check for any concurrency issues.

@begriffs
Copy link
Member

begriffs commented May 6, 2017

I was able to find odd behavior. To reproduce, create these scripts:

sighup.sh:

while [ 1 ]
do
    killall -HUP postgrest
    sleep 0.23
done

getrequest.sh:

while [ 1 ]
do
    curl -I http://localhost:3000 -s | head -n 1 | cut -d$' ' -f2
    sleep 0.1
done
  1. Start the db and then start postgrest
  2. Run getrequest.sh, which should give a bunch of 200s
  3. In another terminal run sighup.sh
  4. Getrequest will show alternating 200 and 503. Maybe this is because it takes time to reload the db schema and during the reload it temporarily wipes out the structure cache?

@steve-chavez
Copy link
Member Author

Accidentally deleted my previous comment with Vimium. Yes, it's because the time it takes to obtain the dbStructure, but that was deliberate, the idea was to put PostgREST in an unavailable state and respond with 503 quickly, without parsing ,etc., while the connection retrying is happening.
Do you think that his behavior should be improved?. I think is good enough, even if getting the dbStructure is avoided and we just retry the connection it would still give 503 to some concurrent requests.

@begriffs
Copy link
Member

begriffs commented May 6, 2017

Makes sense. Repeatedly sending sighup is not even a realistic thing, and a brief 503 should be fine. Just wanted to double check it with you.

(iTarget apiRequest) (iAction apiRequest)
response <- P.use pool $ HT.transaction HT.ReadCommitted txMode handleReq
return $ either (pgError authed) identity response
if isResponse503 response then do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good use for the when function:

when (isResponse503 response) worker
respond respose

@begriffs
Copy link
Member

begriffs commented May 6, 2017

It's looking good, I verified that

  • concurrent sighup and requests work when taking db down and up
  • everything still shuts down properly if isServerVersionSupported returns false
  • it retries for connection on startup if needed and gets it successfully

My only question about functionality at this point is about the retry timing and sighup. Let's say postgrest has tried reconnecting a number of times and the next try won't be for quite a while. It would be nice if sighup could reset the backoff and make it check right away. That way if I notice the db is down I can bring it up and get postgrest reconnected fast.

Things are working great right now and that sighup improvement is totally optional. If you think it's a pain to make that change I can just merge this as it is, let me know.

@steve-chavez
Copy link
Member Author

Right now the code is so that only a connectionWorker thread can run at a time no matter how many SIGHUP's are made, this also works when doing SIGHUP's and the connectionWorker is already on because of a failed request with 503; adding that timer reset and handling it concurrently would make this feature more complex.

The retrying is done in exponential backoff and when it hits 32 seconds, it doesn't keep increasing and the next retrying is 32s, next 32s, and so on(capped at 32s), so if the idea is not making the user wait for too long one option could be to reduce this cap, but for now I don't know what number will be more appropriate.

Other option would be, and this was the original intention, to do a cyclic exp backoff so if it hits 32s it restarts again from 1s, that would make the wait shorter, but I couldn't get that to work with the retry lib, see Soostone/retry#47, @begriffs maybe you could check that I may have complicated myself and turns out its a one liner and then I could add that to this PR, if it's not simple then I think it'll be better if I give it more time and implement this option in another PR.

@jackfirth
Copy link

As a frequent PostgREST and constant Docker user, I'm really glad to see this implemented. Thanks a ton @steve-chavez!

@begriffs
Copy link
Member

begriffs commented May 7, 2017

I overlooked that the wait time maxes out at 32 seconds. That's not so bad.

I think it's merge time! 🚀

@begriffs begriffs merged commit 3e26c1a into PostgREST:master May 7, 2017
monacoremo pushed a commit to monacoremo/postgrest that referenced this pull request Jul 17, 2021
…tgREST#869)

* Add connection retrying on startup and SIGHUP, Fix PostgREST#742
* Ensure that only one connection worker can run at a time
* Change ConnectionError status code to 503 and add automatic connection retrying
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants