-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continuous connection attempts between a deployed seed node and local agent #415
Comments
After following the lifecycle of a local node starting up, syncing its node graph with a deployed agent seed node, running for a few minutes, then being killed, I believe this issue is likely closely related (if not the same issue) as #413. The process of syncing the node graph in the beginning looks normal (on both sides), but once this is done we get a continuous stream of Node Connection Manager logs of connecting back to the local agent. Since we don't see any actual logs of the connection being made (on either side) I believe the deployed agent isn't actually making multiple connections - it just attempts to, realises there's an existing one, and drops the connection. But these logs can be very confusing so maybe we should rethink whether they're needed? Once the local agent is killed though we see the Logs on the deployed agent:
|
I actually can't replicate these logs when just simulating the setup using local agents - the continuous logs (both the NodeConnectionManager and ConnectionForward ones) only seem to appear on a deployed agent. To clarify, my local setup just uses 127.0.0.1 for all agents, however, maybe it would be worth testing with namespaces? |
I think this issue is actually not about the number or frequency of connection attempts, because these are easily configurable/adjustable. I think what this is really about is our log messages not being very clear. It doesn't matter what the purpose of a node connection is, the log for an attempted node connection always looks the same, so it can be very hard to 1) debug what part of the code is actually running at any given moment and 2) know what tasks are occurring in the background. I think a couple of things need to happen here:
Possibly unrelated but I've also seen this error a couple of times - maybe too many connections at once?
|
For 1. I think it's still useful to know when connections are happening, background or not. But maybe we can reduce the noise a fair bit. Just log out that we're starting a node connection, proxy connection and connection failures. Any extra information could fall under a debug level? For 2. it would be useful to know what's getting called but I don't know what useful information to log out. Keep in mind that the nodeConnectionManager.withConnF(
targetNodeId,
async connection => {
const client = connection.getClient();
await client.doThing(Message);
},
); |
Regarding the warning. It seems the default number of listeners is 10: https://nodejs.org/api/events.html#eventsdefaultmaxlisteners. So if you add more than 10, it starts warning about that. (Does it prevent the additional listener, is it soft limit or a hard limit?) Which event emitter is this? Is this the eventbus? We should be setting a default number of maximum listeners... but also if we are going to allow it to be unbounded, then we must also ensure control that boundary so that way we don't end up with a memory leak. |
These log messages should be removed. This is because the whole idea we are memoising the node connections. If the connection already exists, we re-use it. If it doesn't we start it up. Log messages indicating "async lifecycles" are useful and that's why we log them out. So it makes sense to log when we are creating a new node connection. It doesn't make sense to log out if we are just checking if a node connection already exists and just reusing it. In the future, tracing system can be introduced to both handle async object lifecycles AND async function scope lifecycle. But for now these logs should just be removed. At the same time @tegefaulkes indicates that there are alot of connection requests being made to the same seed node over and over, due to the refresh buckets. We can add some heuristics to "coalesce" the result set of the refresh buckets operation. Basically if we have asked the seed node for some nodes using Furthermore, we should probably add some random jitter to the refresh bucket TTL. Right now there's a thundering herd occurring 1 hr after the first sync node graph. Where there will be 256 (255?) refresh bucket tasks doing a Not sure about the event emitter problem. This may be less of a problem if we change to using event target in #444. |
There are only 2 uses of
It's not likely that the But the flow count interceptor maybe. This was intended to allow us to have asynchronous interceptors... This may be an issue. We may wish to change this to We should also verify that there's no memory leak here... |
@tegefaulkes if you can figure out a nice heuristic to limit the |
I think this is reasonably separate to this issue. A new issue should be created for it. |
I've added tasks to the issue description. overall this should take 1-2 hours to complete. |
Removing excessive logging for using connections. We don't need a 3 log messages for each time we use an existing connection. Adding 'jitter' or spacing to the `refreshBuckets` delays so that they don't run all at once. This is implemented with a `refreshBucketDelaySpread` paramater that specifies the multiple of the delay to spread across. defaults to 0.5 for 50% Adding a 'heuristic' to `refreshBucket` to prevent it from contacting the same nodes repeatably. Currently this is just a check in `getClosestGlobalNodes` where if we find less than `nodeBucketLimit` nodes we just reset the timer on all `refreshBucket` tasks. Adding tests for checking the spread of `refreshBucket` delays. Another test for resetting the timer on `refreshBucket` tasks if a `findNode` finds less than 20 nodes. #415
Removing excessive logging for using connections. We don't need a 3 log messages for each time we use an existing connection. Adding 'jitter' or spacing to the `refreshBuckets` delays so that they don't run all at once. This is implemented with a `refreshBucketDelaySpread` paramater that specifies the multiple of the delay to spread across. defaults to 0.5 for 50% Adding a 'heuristic' to `refreshBucket` to prevent it from contacting the same nodes repeatably. Currently this is just a check in `getClosestGlobalNodes` where if we find less than `nodeBucketLimit` nodes we just reset the timer on all `refreshBucket` tasks. Adding tests for checking the spread of `refreshBucket` delays. Another test for resetting the timer on `refreshBucket` tasks if a `findNode` finds less than 20 nodes. #415
Removing excessive logging for using connections. We don't need a 3 log messages for each time we use an existing connection. Adding 'jitter' or spacing to the `refreshBuckets` delays so that they don't run all at once. This is implemented with a `refreshBucketDelaySpread` paramater that specifies the multiple of the delay to spread across. defaults to 0.5 for 50% Adding a 'heuristic' to `refreshBucket` to prevent it from contacting the same nodes repeatably. Currently this is just a check in `getClosestGlobalNodes` where if we find less than `nodeBucketLimit` nodes we just reset the timer on all `refreshBucket` tasks. Adding tests for checking the spread of `refreshBucket` delays. Another test for resetting the timer on `refreshBucket` tasks if a `findNode` finds less than 20 nodes. #415
Describe the bug
After initially connecting to a deployed seed node, the seed node will also attempt to open a forward connection back to the local agent that initialised the original connection. The two nodes will then continue to connect to each other (how long this will continue for is yet to be determined, but it is at least several minutes).
To Reproduce
npm run polykey -- agent start --seednodes="[deployedAgentNodeId]@[deployedAgentHost]:[deployedAgentPort] --verbose --format json
Expected behavior
Connections should not be re-attempted continuously, regardless of whether a successful connection was established in the beginning. Once a connection is established there should be no more logs bout the state of the connection, unless it changes. Some of the connections may also be to other nodes as part of syncing the node graph in the background, however, if each of these connections is logged it might start to get confusing, so maybe we should rethink this.
Additional context
Tasks
refreshBucket
delays to avoid large batches happening at once. trivialrefreshBucket
to avoid asking the same node over and over again. 1 hourrefreshBucket
operation returns less than 20 nodes then that's the WHOLE network it can see. reset the delays for all other buckets.If a node has been queried recently then skip asking it again. might not need this one.This will be addressed along sideStarting Connection Forward
infinite loop #413 if at all.The text was updated successfully, but these errors were encountered: