Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Possibility for higher resilience for connect-once, never-disconnects use cases. #170

Closed
luan007 opened this issue Jul 3, 2021 · 19 comments

Comments

@luan007
Copy link

luan007 commented Jul 3, 2021

Hi,

This is not an issue but a suggestion, or question maybe,
and posting here instead of nats.ws as this seems to related to some logic within base-client.

We're using nats.js / nats.ws in unstable edge-networks (powered by 4G or 3G), the application connects & subs once, and is expected to stay connected forever (and reconnects quickly)

Thus:
a relatively aggressive pinging / reconnecting config is used here:

{
    noEcho: true,
    noRandomize: true,
    maxReconnectAttempts: -1,
    waitOnFirstConnect: true,
    reconnectTimeWait: 500,
    pingInterval: 3 * 1000, 
    maxPingOut: 3
}

Lately we've noticed when devices doing PUBs, sometimes the client will be in closed state, which causes error, and I'm sure there's no application logic that close() the client. It seems that _close() is being called from the inside.

In this state, all subs are lost as close() cleans those up, and heartbeat also stops.. If the dev wants to create a never-ending connection, he or she must re-init the connection manually, and re-init all subs manually which introduces a 'loop like' structure.

async function init() {
    await connect(); 
    //..all subs
}

on("closed", init);

This forces all subs into a function, or will introduce event-emitter alike structure into end-users' code.

My suggestion here is, if reconnect == -1 which suggests 'it never ends', can we keep the client instance intact no matter what? & keep trying and pinging until the end of application? This will simplify edge-app complexity.

For now, we're monkey patching _close

io.protocol._close = async () => {
        //it never dies!
        reset("Underlying structure breaks");
};

var cc = io.protocol.heartbeats.cancel.bind(io.protocol.heartbeats);
io.protocol.heartbeats.cancel = (stale) => {
   if (stale) {
      cc(stale);
      if (io.protocol.connected) {
          reset("Protocol hidden crash!");
      }
   }
   else {
       cc(stale);
   }
}
var reset_busy = false;
async function reset(e) {
    try {
        if (reset_busy) {
            console.warn("io busy", e);
            return;
        }
        reset_busy = true;
        try {
            io.protocol.transport.close();
        }
        catch (e) {
        }
        io.protocol.prepare();
        await io.protocol.dialLoop();
        io._closed = false;
        io.protocol._closed = false;
        reset_busy = false;
    } catch (e) {
        console.error(e);
    }
}

Above are some horrible logic, but keeps the problem away..

Thank you for your attention.

@luan007
Copy link
Author

luan007 commented Jul 3, 2021

Actually to make things simpler:

If maxReconnectAttempts == -1, in _close instead of close() (which is called by the user)
Reconnect / restart underlying transport if possible?

@aricart
Copy link
Member

aricart commented Jul 3, 2021

@luan007 I need to look at your description closer but a couple of things:

  • the outbound buffer is reset on disconnect, the reason for this is that it may have partials or be corrupt.
  • subscriptions are not tracked on the buffer but are recreated on a connect.
  • It is OK for the client to have terminal states where it won't reconnect. You are given a promise when this happens, so at that point, if this resolves, you can restart your connection the same way you started it.
  • if you are experiencing this on the WebSocket client, what is the host environment or library?

@aricart
Copy link
Member

aricart commented Jul 3, 2021

Also part of the issue could be reconnectTimeWait: 500, have you tried having something less aggressive (1000)?

@luan007
Copy link
Author

luan007 commented Jul 4, 2021

Thank you for quick response,

I'll try reconnectTimeWait: 1000 & then come back,

For 4 points above,
Understood, but for point 3:

It is OK for the client to have terminal states where it won't reconnect. You are given a promise when this happens, so at that point, if this resolves, you can restart your connection the same way you started it.

When the client call connect again in unexpected closed state (caused by internal but not by application logic itself), all previous subs are lost (subscriptions.end() or clear.. as I remembered), do you have suggestion for restarting the connection & keep all subs like in dialLoop?

As in our application, nats.io works like transport-layer, all subs are scattered around the code base (not concentrated), which means each component has to manage the 'close' event by itself & would generate lots of extra logic around this. I understand this is more of a design thing, & might not be an issue.


For the client, I'm using nats.js (node) & nats.ws (electron) at the same time, & in many places it works fine (disconnect, and reconnect by lib design), it's just those nasty 3G / 4G environments where lots of things are uncertain, like temporary DNS pollution, random upstream disconnect (causing half-open sockets) and so on.

@luan007
Copy link
Author

luan007 commented Jul 7, 2021

Latest update, tried less aggressive update intervals with no luck.. only one physical site (client deployments) has such issues, I'll try to use nats-server in LAN as leaf node & see, if server-to-server reconnect logic is more robust.

@luan007
Copy link
Author

luan007 commented Jul 8, 2021

Another update here, by using nats-server as leaf node, & connect local ws to this server which repeats all msg to cloud, the system stands stable without any issue. I'll take this as a work-around for our local deployments.

It would till be great if ws-client can replicate such behavior without local leaf node.

Thank you for your attention.

@aricart
Copy link
Member

aricart commented Jul 8, 2021

@luan007 one more metric that may be interesting is actually using the code on node.js/deno as you have it. Except for the imports the code should run as is, and would test whether there's a transport issue on the ws implementation that is not in the node or deno tcp transports.

Would really like to simulate your environment and get to the bottom of this.

@luan007
Copy link
Author

luan007 commented Jul 10, 2021

I've tested (in parallel) ws version & plain js version (connects to nats via tcp), they both fail at the same time thus I assume it's some sort of logic issue. As observed side-by-side, the nats-server (also deployed in parallel) fails & reconnects just fine.

I'll try to provide a minimal env when possible, but the network part is quite hard to replicate.

@aricart
Copy link
Member

aricart commented Jul 12, 2021

@luan007 by the way have you extracted the error on the close events?

    // this promise indicates the client closed
    const done = nc.closed();

The above promise resolves to void or error - if it has an error it is the reason why the client closed.

@luan007
Copy link
Author

luan007 commented Jul 14, 2021

got it, will do a field test as soon as arrive at deployment site. ETA next monday or so.. sorry for the wait

@aricart
Copy link
Member

aricart commented Aug 16, 2021

@luan007 any updates on this issue?

@luan007
Copy link
Author

luan007 commented Aug 30, 2021

Hi, I'm back...

First of all thank u for your patience..

After lots of digging, it seems that the 4G network operator is doing weird stuff to the network (i.e injecting code into http-traffic, heavy jitter, so-on), and creates lots of half-open tcp connections, which causes problems.

In short, I can sort of replicate this

Some how, after period of time, or during 4G spikes, nc.protocol.transport.isClosed turns true, underlying websocket connection is errored out. In this state, the lib does not try to 'reconnect' (as isClosed is true) anymore.

Here's how to 'sort of' trigger it (with-in nats.ws).

Config:

sname: "Emerge-Systems-WS-Front"
listen: "0.0.0.0:10222"
server_name: $sname

websocket {
    port: 10223
    no_tls: true
    same_origin: false
    compression: false
}

Client

 // load the library
        import { connect, StringCodec } from './nats.js'
        // do something with it...
        const sc = StringCodec();
        async function init() {
            const nc = await connect(
                {
                    debug: true,
                    servers: ["ws://172.20.10.3:10223"],
                    noEcho: false,
                    noRandomize: true,
                    maxReconnectAttempts: -1,
                    waitOnFirstConnect: true,
                    reconnectTimeWait: 500,
                    pingInterval: 3 * 1000,
                    maxPingOut: 3
                }
            );
            console.log("connected");
            const sub = nc.subscribe("*", {
                callback: (e, msg) => {
                    console.log("Got->", sc.decode(msg.data))
                }
            });
            setInterval(() => {
                nc.publish("hello", sc.encode("world"));
            }, 1000)
            window.nc = nc;
        }
        init();

Not much in code. Here's the repo (tricky):

  1. Connect your laptop to your mobile-phone's hotspot & change the ip '172.20.10.3' into yours
  2. Start the webpage & local server, everything should be working just fine
  3. Now, disconnect your laptop's Wi-Fi. You should see PUB - PING still going, but not more PONGs.
  4. After 3 failed PINGs, nats.ws did not complain - and this is where I think might be problematic already.
  5. Now, reconnect back to your phone immediately, before chrome consider this as a timed-out socket
  6. ... we've simulated a jitter in 4G traffic, where ws is dead, but not 'errored', & pings are not working (but should be)

image

But during testing above, the lib will reconnect after a rather long wait. On-site however, with the jitter always present, the lib will fail to connect after some time (cannot repo in local net)...

My question is, does PING really work as intended, as this is the only way out when underlying connection failed silently.
If PING is robust enough, jitter is not a problem.

Thank you again for your time, apologies for not be able to repo & the long wait.

@gitrojones
Copy link

I'm not sure if this is 100% related but the issues you're having could be related to the waitOnFirstConnect option. I've been experiencing similar issues with durable connections on some services we run which can take a bit of time to initialize and start accepting connections.

If the service exceeds the configured timeout (20s default), it looks like the initial connection will occur and everything will work as expected for upwards of 30min. However, eventually we end up seeing a cycle of:

Nats disconnected
Nats reconnecting
Nats connecting
Nats reconnect

Followed shortly after by a Nats Protocol Error which causes our services to cycle. After removing the waitOnFirstConnect option and upping the timeout so our services have enough time to initialize i'm no longer seeing the same instability as before. Very hard issue to pinpoint as its only on some services and appears to be somewhat variable.

@luan007
Copy link
Author

luan007 commented Oct 18, 2021

@gitrojones thanks & will try..
After the cycle you've mentioned above (disconnect -> reconnect), will it stay connected or enters a endless loop that way?

@aricart
Copy link
Member

aricart commented Oct 18, 2021

I'm not sure if this is 100% related but the issues you're having could be related to the waitOnFirstConnect option. I've been experiencing similar issues with durable connections on some services we run which can take a bit of time to initialize and start accepting connections.

If the service exceeds the configured timeout (20s default), it looks like the initial connection will occur and everything will work as expected for upwards of 30min. However, eventually we end up seeing a cycle of:

Nats disconnected
Nats reconnecting
Nats connecting
Nats reconnect

Followed shortly after by a Nats Protocol Error which causes our services to cycle. After removing the waitOnFirstConnect option and upping the timeout so our services have enough time to initialize i'm no longer seeing the same instability as before. Very hard issue to pinpoint as its only on some services and appears to be somewhat variable.

@gitrojones
Are these servers standalone, or wrapped on Kubernetes?
Also can you clarify the client implementation you are seeing this one (nats.ws, nats.js or nats.deno)

@aricart
Copy link
Member

aricart commented Oct 18, 2021

I will be doing a release very soon now (but it may be a few days), but I am wondering if the issue is related to this:

#201

It was possible for the client start processing a partial frame during connect, which meant that it would fail during connection, because the full connect JSON was not available even if it was expected to be there.

If you are willing to try, I can release a beta that you can try and see if the issue persists.

All the clients were updated:
nats-io/nats.ws#114
https://github.com/nats-io/nats.js/pull/456

@luan007
Copy link
Author

luan007 commented Oct 19, 2021

@aricart amazing & will try, thank you!

@aricart
Copy link
Member

aricart commented Oct 19, 2021

@luan007 I published dev: 1.3.1-3 for nats.ws, if you want to try that, this would be awesome.
npm update nats.ws@dev

@luan007
Copy link
Author

luan007 commented Oct 20, 2021

@aricart Updated & testing, closing this now as there're workarounds & lack of log to pin down the issue. Will reopen if there's anything substantial related to this. Thank you all!

@luan007 luan007 closed this as completed Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants