-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: Possibility for higher resilience for connect-once, never-disconnects use cases. #170
Comments
Actually to make things simpler: If |
@luan007 I need to look at your description closer but a couple of things:
|
Also part of the issue could be |
Thank you for quick response, I'll try reconnectTimeWait: 1000 & then come back, For 4 points above,
When the client call As in our application, nats.io works like transport-layer, all subs are scattered around the code base (not concentrated), which means each component has to manage the 'close' event by itself & would generate lots of extra logic around this. I understand this is more of a design thing, & might not be an issue. For the client, I'm using nats.js (node) & nats.ws (electron) at the same time, & in many places it works fine (disconnect, and reconnect by lib design), it's just those nasty 3G / 4G environments where lots of things are uncertain, like temporary DNS pollution, random upstream disconnect (causing half-open sockets) and so on. |
Latest update, tried less aggressive update intervals with no luck.. only one physical site (client deployments) has such issues, I'll try to use nats-server in LAN as leaf node & see, if server-to-server reconnect logic is more robust. |
Another update here, by using nats-server as It would till be great if ws-client can replicate such behavior without local leaf node. Thank you for your attention. |
@luan007 one more metric that may be interesting is actually using the code on node.js/deno as you have it. Except for the imports the code should run as is, and would test whether there's a transport issue on the Would really like to simulate your environment and get to the bottom of this. |
I've tested (in parallel) ws version & plain js version (connects to nats via tcp), they both fail at the same time thus I assume it's some sort of logic issue. As observed side-by-side, the nats-server (also deployed in parallel) fails & reconnects just fine. I'll try to provide a minimal env when possible, but the network part is quite hard to replicate. |
@luan007 by the way have you extracted the error on the close events?
The above promise resolves to void or error - if it has an error it is the reason why the client closed. |
got it, will do a field test as soon as arrive at deployment site. ETA next monday or so.. sorry for the wait |
@luan007 any updates on this issue? |
Hi, I'm back... First of all thank u for your patience.. After lots of digging, it seems that the 4G network operator is doing weird stuff to the network (i.e injecting code into http-traffic, heavy jitter, so-on), and creates lots of half-open tcp connections, which causes problems. In short, I can sort of replicate thisSome how, after period of time, or during 4G spikes, Here's how to 'sort of' trigger it (with-in Config:
Client // load the library
import { connect, StringCodec } from './nats.js'
// do something with it...
const sc = StringCodec();
async function init() {
const nc = await connect(
{
debug: true,
servers: ["ws://172.20.10.3:10223"],
noEcho: false,
noRandomize: true,
maxReconnectAttempts: -1,
waitOnFirstConnect: true,
reconnectTimeWait: 500,
pingInterval: 3 * 1000,
maxPingOut: 3
}
);
console.log("connected");
const sub = nc.subscribe("*", {
callback: (e, msg) => {
console.log("Got->", sc.decode(msg.data))
}
});
setInterval(() => {
nc.publish("hello", sc.encode("world"));
}, 1000)
window.nc = nc;
}
init(); Not much in code. Here's the repo (tricky):
But during testing above, the lib will reconnect after a rather long wait. On-site however, with the jitter always present, the lib will fail to connect after some time (cannot repo in local net)... My question is, does PING really work as intended, as this is the only way out when underlying connection failed silently. Thank you again for your time, apologies for not be able to repo & the long wait. |
I'm not sure if this is 100% related but the issues you're having could be related to the If the service exceeds the configured timeout (20s default), it looks like the initial connection will occur and everything will work as expected for upwards of 30min. However, eventually we end up seeing a cycle of:
Followed shortly after by a |
@gitrojones thanks & will try.. |
@gitrojones |
I will be doing a release very soon now (but it may be a few days), but I am wondering if the issue is related to this: It was possible for the client start processing a partial frame during connect, which meant that it would fail during connection, because the full connect JSON was not available even if it was expected to be there. If you are willing to try, I can release a beta that you can try and see if the issue persists. All the clients were updated: |
@aricart amazing & will try, thank you! |
@luan007 I published dev: 1.3.1-3 for nats.ws, if you want to try that, this would be awesome. |
@aricart Updated & testing, closing this now as there're workarounds & lack of log to pin down the issue. Will reopen if there's anything substantial related to this. Thank you all! |
Hi,
This is not an issue but a suggestion, or question maybe,
and posting here instead of
nats.ws
as this seems to related to some logic within base-client.We're using nats.js / nats.ws in unstable edge-networks (powered by 4G or 3G), the application connects & subs once, and is expected to stay connected forever (and reconnects quickly)
Thus:
a relatively aggressive pinging / reconnecting config is used here:
Lately we've noticed when devices doing
PUBs
, sometimes the client will be inclosed
state, which causes error, and I'm sure there's no application logic thatclose()
the client. It seems that_close()
is being called from the inside.In this state, all subs are lost as
close()
cleans those up, andheartbeat
also stops.. If the dev wants to create a never-ending connection, he or she must re-init the connection manually, and re-init all subs manually which introduces a 'loop like' structure.This forces all subs into a function, or will introduce event-emitter alike structure into end-users' code.
My suggestion here is, if
reconnect == -1
which suggests 'it never ends', can we keep the client instance intact no matter what? & keep trying and pinging until the end of application? This will simplify edge-app complexity.For now, we're monkey patching
_close
Above are some horrible logic, but keeps the problem away..
Thank you for your attention.
The text was updated successfully, but these errors were encountered: