Suggestion: Possibility for higher resilience for connect-once, never-disconnects use cases. #170

luan007 · 2021-07-03T03:21:09Z

Hi,

This is not an issue but a suggestion, or question maybe,
and posting here instead of nats.ws as this seems to related to some logic within base-client.

We're using nats.js / nats.ws in unstable edge-networks (powered by 4G or 3G), the application connects & subs once, and is expected to stay connected forever (and reconnects quickly)

Thus:
a relatively aggressive pinging / reconnecting config is used here:

{
    noEcho: true,
    noRandomize: true,
    maxReconnectAttempts: -1,
    waitOnFirstConnect: true,
    reconnectTimeWait: 500,
    pingInterval: 3 * 1000, 
    maxPingOut: 3
}

Lately we've noticed when devices doing PUBs, sometimes the client will be in closed state, which causes error, and I'm sure there's no application logic that close() the client. It seems that _close() is being called from the inside.

In this state, all subs are lost as close() cleans those up, and heartbeat also stops.. If the dev wants to create a never-ending connection, he or she must re-init the connection manually, and re-init all subs manually which introduces a 'loop like' structure.

async function init() {
    await connect(); 
    //..all subs
}

on("closed", init);

This forces all subs into a function, or will introduce event-emitter alike structure into end-users' code.

My suggestion here is, if reconnect == -1 which suggests 'it never ends', can we keep the client instance intact no matter what? & keep trying and pinging until the end of application? This will simplify edge-app complexity.

For now, we're monkey patching _close

io.protocol._close = async () => {
        //it never dies!
        reset("Underlying structure breaks");
};

var cc = io.protocol.heartbeats.cancel.bind(io.protocol.heartbeats);
io.protocol.heartbeats.cancel = (stale) => {
   if (stale) {
      cc(stale);
      if (io.protocol.connected) {
          reset("Protocol hidden crash!");
      }
   }
   else {
       cc(stale);
   }
}

var reset_busy = false;
async function reset(e) {
    try {
        if (reset_busy) {
            console.warn("io busy", e);
            return;
        }
        reset_busy = true;
        try {
            io.protocol.transport.close();
        }
        catch (e) {
        }
        io.protocol.prepare();
        await io.protocol.dialLoop();
        io._closed = false;
        io.protocol._closed = false;
        reset_busy = false;
    } catch (e) {
        console.error(e);
    }
}

Above are some horrible logic, but keeps the problem away..

Thank you for your attention.

The text was updated successfully, but these errors were encountered:

luan007 · 2021-07-03T03:32:48Z

Actually to make things simpler:

If maxReconnectAttempts == -1, in _close instead of close() (which is called by the user)
Reconnect / restart underlying transport if possible?

aricart · 2021-07-03T13:08:27Z

@luan007 I need to look at your description closer but a couple of things:

the outbound buffer is reset on disconnect, the reason for this is that it may have partials or be corrupt.
subscriptions are not tracked on the buffer but are recreated on a connect.
It is OK for the client to have terminal states where it won't reconnect. You are given a promise when this happens, so at that point, if this resolves, you can restart your connection the same way you started it.
if you are experiencing this on the WebSocket client, what is the host environment or library?

aricart · 2021-07-03T13:10:28Z

Also part of the issue could be reconnectTimeWait: 500, have you tried having something less aggressive (1000)?

luan007 · 2021-07-04T01:44:44Z

Thank you for quick response,

I'll try reconnectTimeWait: 1000 & then come back,

For 4 points above,
Understood, but for point 3:

It is OK for the client to have terminal states where it won't reconnect. You are given a promise when this happens, so at that point, if this resolves, you can restart your connection the same way you started it.

When the client call connect again in unexpected closed state (caused by internal but not by application logic itself), all previous subs are lost (subscriptions.end() or clear.. as I remembered), do you have suggestion for restarting the connection & keep all subs like in dialLoop?

As in our application, nats.io works like transport-layer, all subs are scattered around the code base (not concentrated), which means each component has to manage the 'close' event by itself & would generate lots of extra logic around this. I understand this is more of a design thing, & might not be an issue.

For the client, I'm using nats.js (node) & nats.ws (electron) at the same time, & in many places it works fine (disconnect, and reconnect by lib design), it's just those nasty 3G / 4G environments where lots of things are uncertain, like temporary DNS pollution, random upstream disconnect (causing half-open sockets) and so on.

luan007 · 2021-07-07T01:23:45Z

Latest update, tried less aggressive update intervals with no luck.. only one physical site (client deployments) has such issues, I'll try to use nats-server in LAN as leaf node & see, if server-to-server reconnect logic is more robust.

luan007 · 2021-07-08T07:06:43Z

Another update here, by using nats-server as leaf node, & connect local ws to this server which repeats all msg to cloud, the system stands stable without any issue. I'll take this as a work-around for our local deployments.

It would till be great if ws-client can replicate such behavior without local leaf node.

Thank you for your attention.

aricart · 2021-07-08T15:07:59Z

@luan007 one more metric that may be interesting is actually using the code on node.js/deno as you have it. Except for the imports the code should run as is, and would test whether there's a transport issue on the ws implementation that is not in the node or deno tcp transports.

Would really like to simulate your environment and get to the bottom of this.

luan007 · 2021-07-10T14:46:38Z

I've tested (in parallel) ws version & plain js version (connects to nats via tcp), they both fail at the same time thus I assume it's some sort of logic issue. As observed side-by-side, the nats-server (also deployed in parallel) fails & reconnects just fine.

I'll try to provide a minimal env when possible, but the network part is quite hard to replicate.

aricart · 2021-07-12T13:53:36Z

@luan007 by the way have you extracted the error on the close events?

    // this promise indicates the client closed
    const done = nc.closed();

The above promise resolves to void or error - if it has an error it is the reason why the client closed.

luan007 · 2021-07-14T16:25:57Z

got it, will do a field test as soon as arrive at deployment site. ETA next monday or so.. sorry for the wait

aricart · 2021-08-16T22:31:03Z

@luan007 any updates on this issue?

luan007 · 2021-08-30T11:08:27Z

Hi, I'm back...

First of all thank u for your patience..

After lots of digging, it seems that the 4G network operator is doing weird stuff to the network (i.e injecting code into http-traffic, heavy jitter, so-on), and creates lots of half-open tcp connections, which causes problems.

In short, I can sort of replicate this

Some how, after period of time, or during 4G spikes, nc.protocol.transport.isClosed turns true, underlying websocket connection is errored out. In this state, the lib does not try to 'reconnect' (as isClosed is true) anymore.

Here's how to 'sort of' trigger it (with-in nats.ws).

Config:

sname: "Emerge-Systems-WS-Front"
listen: "0.0.0.0:10222"
server_name: $sname

websocket {
    port: 10223
    no_tls: true
    same_origin: false
    compression: false
}

Client

 // load the library
        import { connect, StringCodec } from './nats.js'
        // do something with it...
        const sc = StringCodec();
        async function init() {
            const nc = await connect(
                {
                    debug: true,
                    servers: ["ws://172.20.10.3:10223"],
                    noEcho: false,
                    noRandomize: true,
                    maxReconnectAttempts: -1,
                    waitOnFirstConnect: true,
                    reconnectTimeWait: 500,
                    pingInterval: 3 * 1000,
                    maxPingOut: 3
                }
            );
            console.log("connected");
            const sub = nc.subscribe("*", {
                callback: (e, msg) => {
                    console.log("Got->", sc.decode(msg.data))
                }
            });
            setInterval(() => {
                nc.publish("hello", sc.encode("world"));
            }, 1000)
            window.nc = nc;
        }
        init();

Not much in code. Here's the repo (tricky):

Connect your laptop to your mobile-phone's hotspot & change the ip '172.20.10.3' into yours
Start the webpage & local server, everything should be working just fine
Now, disconnect your laptop's Wi-Fi. You should see PUB - PING still going, but not more PONGs.
After 3 failed PINGs, nats.ws did not complain - and this is where I think might be problematic already.
Now, reconnect back to your phone immediately, before chrome consider this as a timed-out socket
... we've simulated a jitter in 4G traffic, where ws is dead, but not 'errored', & pings are not working (but should be)

But during testing above, the lib will reconnect after a rather long wait. On-site however, with the jitter always present, the lib will fail to connect after some time (cannot repo in local net)...

My question is, does PING really work as intended, as this is the only way out when underlying connection failed silently.
If PING is robust enough, jitter is not a problem.

Thank you again for your time, apologies for not be able to repo & the long wait.

gitrojones · 2021-10-11T15:46:38Z

I'm not sure if this is 100% related but the issues you're having could be related to the waitOnFirstConnect option. I've been experiencing similar issues with durable connections on some services we run which can take a bit of time to initialize and start accepting connections.

If the service exceeds the configured timeout (20s default), it looks like the initial connection will occur and everything will work as expected for upwards of 30min. However, eventually we end up seeing a cycle of:

Nats disconnected
Nats reconnecting
Nats connecting
Nats reconnect

Followed shortly after by a Nats Protocol Error which causes our services to cycle. After removing the waitOnFirstConnect option and upping the timeout so our services have enough time to initialize i'm no longer seeing the same instability as before. Very hard issue to pinpoint as its only on some services and appears to be somewhat variable.

luan007 · 2021-10-18T03:08:52Z

@gitrojones thanks & will try..
After the cycle you've mentioned above (disconnect -> reconnect), will it stay connected or enters a endless loop that way?

aricart · 2021-10-18T22:06:36Z

I'm not sure if this is 100% related but the issues you're having could be related to the waitOnFirstConnect option. I've been experiencing similar issues with durable connections on some services we run which can take a bit of time to initialize and start accepting connections.

If the service exceeds the configured timeout (20s default), it looks like the initial connection will occur and everything will work as expected for upwards of 30min. However, eventually we end up seeing a cycle of:
Nats disconnected
Nats reconnecting
Nats connecting
Nats reconnect
Followed shortly after by a Nats Protocol Error which causes our services to cycle. After removing the waitOnFirstConnect option and upping the timeout so our services have enough time to initialize i'm no longer seeing the same instability as before. Very hard issue to pinpoint as its only on some services and appears to be somewhat variable.

@gitrojones
Are these servers standalone, or wrapped on Kubernetes?
Also can you clarify the client implementation you are seeing this one (nats.ws, nats.js or nats.deno)

aricart · 2021-10-18T22:15:12Z

I will be doing a release very soon now (but it may be a few days), but I am wondering if the issue is related to this:

#201

It was possible for the client start processing a partial frame during connect, which meant that it would fail during connection, because the full connect JSON was not available even if it was expected to be there.

If you are willing to try, I can release a beta that you can try and see if the issue persists.

All the clients were updated:
nats-io/nats.ws#114
https://github.com/nats-io/nats.js/pull/456

luan007 · 2021-10-19T03:13:08Z

@aricart amazing & will try, thank you!

aricart · 2021-10-19T14:59:08Z

@luan007 I published dev: 1.3.1-3 for nats.ws, if you want to try that, this would be awesome.
npm update nats.ws@dev

luan007 · 2021-10-20T06:22:27Z

@aricart Updated & testing, closing this now as there're workarounds & lack of log to pin down the issue. Will reopen if there's anything substantial related to this. Thank you all!

luan007 closed this as completed Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Possibility for higher resilience for connect-once, never-disconnects use cases. #170

Suggestion: Possibility for higher resilience for connect-once, never-disconnects use cases. #170

luan007 commented Jul 3, 2021

luan007 commented Jul 3, 2021

aricart commented Jul 3, 2021

aricart commented Jul 3, 2021

luan007 commented Jul 4, 2021

luan007 commented Jul 7, 2021

luan007 commented Jul 8, 2021

aricart commented Jul 8, 2021

luan007 commented Jul 10, 2021

aricart commented Jul 12, 2021

luan007 commented Jul 14, 2021

aricart commented Aug 16, 2021

luan007 commented Aug 30, 2021

gitrojones commented Oct 11, 2021

luan007 commented Oct 18, 2021

aricart commented Oct 18, 2021 •

edited

Loading

aricart commented Oct 18, 2021 •

edited

Loading

luan007 commented Oct 19, 2021

aricart commented Oct 19, 2021

luan007 commented Oct 20, 2021

Suggestion: Possibility for higher resilience for connect-once, never-disconnects use cases. #170

Suggestion: Possibility for higher resilience for connect-once, never-disconnects use cases. #170

Comments

luan007 commented Jul 3, 2021

luan007 commented Jul 3, 2021

aricart commented Jul 3, 2021

aricart commented Jul 3, 2021

luan007 commented Jul 4, 2021

luan007 commented Jul 7, 2021

luan007 commented Jul 8, 2021

aricart commented Jul 8, 2021

luan007 commented Jul 10, 2021

aricart commented Jul 12, 2021

luan007 commented Jul 14, 2021

aricart commented Aug 16, 2021

luan007 commented Aug 30, 2021

In short, I can sort of replicate this

Config:

Client

gitrojones commented Oct 11, 2021

luan007 commented Oct 18, 2021

aricart commented Oct 18, 2021 • edited Loading

aricart commented Oct 18, 2021 • edited Loading

luan007 commented Oct 19, 2021

aricart commented Oct 19, 2021

luan007 commented Oct 20, 2021

aricart commented Oct 18, 2021 •

edited

Loading

aricart commented Oct 18, 2021 •

edited

Loading