Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust listeners example and the handling of network errors #593

Closed
boromisp opened this issue Mar 6, 2019 · 14 comments
Closed

More robust listeners example and the handling of network errors #593

boromisp opened this issue Mar 6, 2019 · 14 comments

Comments

@boromisp
Copy link

boromisp commented Mar 6, 2019

Disclaimer: my networking knowledge is patchy, most of this came from about a week of testing and googling.

A few notes for the robust listeners example:

It could be helpful, to make a note of TCP keepalives in the Robust Listeners example. TCP keepalives could also be relevant if you set a long idleTimeoutMillis for your pool.

Without setting the keepalive time to the value recommended by your infrastructure provider (e.g., if your database is in AWS, you have to set it to at most 200 sec), a seemingly idle connection can be silently dropped.

I'm not sure if it is an AWS specific thing, standard TCP peculiarity or something about nodejs defaults, but in this case, both the server and the client computer thinks the connection is still alive (so no error event, no onLost callback, and the PostgreSQL backend still running).

Example solution:

const conn = await db.connect({ direct: true, onLost: handleLostConnection });
conn.client.connection.stream.setKeepAlive(true, 200 * 1000);

The keepalive interval and probes should probably also be set explicitly, but I found no way to do that without depending on net-keepalive.

While the section above deals with trying to keep the connection alive, it is not sufficient for detecting dead connections. The only reliable way I found so far is to use manual heartbeat queries, and call conn.done() on some application-specific timeout.

What seems to be missing (which is a node-postgres deficiency) is a connection timeout. The connectionTimeoutMillis parameter only applies to the wait for free capacity in the pg-pool. There is no way to abort connecting if it takes too long.
For the direct connection case, this can be somewhat worked around, with the caveat, that it will keep the process running for however long it takes the OS to decide the connection attempt was unsuccessful:

function connectWithTimeout(db, onLost, timeout) {
  return new Promise(async function (resolve, reject) {
    let timedOut = false;

    function handleLostConnection(err, e) {
      if (!timedOut) onLost(err, e);
    }

    const timeoutTimer = setTimeout(function () {
      timedOut = true;
      reject(new Error('Connection timeout'));
    }, timeout);

    try {
      const conn = await db.connect({ direct: true, onLost: handleLostConnection });
      if (!timedOut) resolve(conn);
    } catch (error) {
      if (!timedOut) reject(error);
    } finally {
      if (!timedOut) clearTimeout(timeoutTimer);
    }
  });
}

Patching directConnect could help with this, since there we can destroy the socket on timeout, but I will try to put together a PR for node-postges to address this.

@vitaly-t
Copy link
Owner

vitaly-t commented Mar 6, 2019

AWS has been the culprit when it comes to connectivity. They are awful in their connection management, and require a lot of extra work.

@boromisp
Copy link
Author

boromisp commented Mar 6, 2019

In their redshift documentation they seem to imply it is your local network, VPN provider, or other intermediaries who drop the connection. But, because of the scale of their operation, I'm sure they are aggressively pruning idle connections. I just wish they would document it somewhere.

The brianc/node-postgres#1847 PR adds two new connection parameters to address these issues:

const db = pgp({
    keepAlive: true,
    keepAliveIdleMillis: 200 * 1000,
    connectTimeoutMillis: 10 * 1000,
})

The manual heartbeat would still be required of course, but I couldn't find a general solution for that.

@vitaly-t
Copy link
Owner

vitaly-t commented Mar 6, 2019

And yet, all the problems related to dropped connections ever reported were around AWS. The library works fine with all other hosting providers.

@msageryd
Copy link

Thank you for writing this up. I have referenced this issue in the readme of my listener-module, I hope it's ok with you.

https://github.com/msageryd/pg-promise-listener#keeping-connections-alive-on-aws

@vitaly-t
Copy link
Owner

@msageryd The other open issue #585 is also related.

@msageryd
Copy link

I solved this problem from another perspective. It was the simplest solution to me right now. Please chime in if you think this approach is flawed.

Documentation
Code

@vitaly-t
Copy link
Owner

vitaly-t commented Mar 28, 2019

@msageryd That code creates a loose promise:

this.connection.none(...)

i.e. unchained promise, that does not have a catch handler.

@msageryd
Copy link

Oh, thanks. Fixed now. (code)
What do you think of the overall concept? It's kind of a catch-all, and it might be a little more delayed than a real keep alive solution. But it should fixe the worst case, i.e. when you have a non-working server without knowing about it.

@vitaly-t
Copy link
Owner

It would take some good testing to form an opinion. I hope there will be a proper fix at some point.

@vitaly-t
Copy link
Owner

vitaly-t commented May 9, 2019

@boromisp @msageryd Guys, I do not see the point in keeping this issue open, because nothing is being developed or changed in this library to cater for that.

@boromisp If your PR ever gets merged, this can be revisited.

@vitaly-t
Copy link
Owner

Closing, due to inactivity.

@vitaly-t
Copy link
Owner

vitaly-t commented Nov 10, 2019

@boromisp Thanks for weighting in on this. I've just spent whole weekend working on new releases, then by chance looked at that slonik project only to find out he published more bullshit about pg-promise. Naturally, I was very pissed. And when I asked him to remove it, he was quick to lock my account.

@k3n
Copy link

k3n commented Jun 1, 2022

@boromisp If your PR ever gets merged, this can be revisited.

I've followed the threads, but I'm confused and would appreciate some insight; what needs to be revisited here? A solution for the manual heartbeat to detect lost connections?

For what it's worth, @boromisp's PR was merged on 5/10, but this was closed on 5/15.

I'm wanting to use TCP keep alives but I'm unsure if the code in @boromisp's above comment is sufficient (and I'm not sure how to adequately test it locally).

Thank you.

@boromisp
Copy link
Author

boromisp commented Jun 2, 2022

The keepAlive: true parameter enables the tcp keepalives on the socket, and the keepAliveInitialDelayMillis: 200 * 1000 parameter overrides the default value for the tcp keepalive idle timeout of the socket. There are other keepalive options, but those cannot easily be overridden from node.js.

For long running queries to remote databases tcp keepalives (either client or server side) are the only way to keep the connection alive.

Heartbeat queries are an alternative for listener connections, eg: brianc/node-postgres#1611 (comment)
I found it easier to reason about heartbeat query timeouts than tcp keepalives in this context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants