-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GRPC call failure when multiple nodes connecting to 1 seed node. #369
Comments
The https://grpc.github.io/grpc/core/md_doc_statuscodes.html It is one of:
|
It's important to setup wireshark to decode GRPC frames. We have plaintext GRPC frames being transferred between the GRPC client and the forward side of the proxy. As well as the reverse side of the proxy to the GRPC server. The wireshark system should be watching those packets, and together form an HTTP2 frame and GRPC frame trace. https://wiki.wireshark.org/gRPC We've done this before, using wireshark alone, no additional software. It's just some settings in wireshark. And I remember there being an issue for this. It's here! https://gitlab.com/MatrixAI/Employees/matrix-orientation/-/blob/master/Development/Debugging%20gRPC%20(Wireshark).md Doing this is to find out what the exact sequence of events is. Furthermore finding out whether the |
Additionally the exceptions that we saw in our logs are not in the same order of events as they actually occurred:
This means we need to figure out exactly a proper trace, because it could be that the |
I noticed that the:
The This should have been caught by the There's a test in However this problem still occurs even after fixing up the Additionally it is missing the
|
Logging using
|
The One can rename |
The
Which means it should be in the However atm you're doing:
There's too many ways for a node connection be destroyed. This should be reduced to one exception hierarchy for destroying a node connection. Atm, the Example of this in
|
Without destroying the connection:
|
Debug mode:
|
Failure on Agent 1:
|
It seems that the The watch connectivity state appears to trigger and go from READY to IDLE indicating the connection has been closed. Possibly from the internal GRPC client library due to the protocol error. The We will need wireshark GRPC dissector applied to the Establish a base case with 2 GRPC clients connected to a single GRPC server doing echo calls, and just download the GRPC log from wireshark. Then compare this to what happens when using the proxy, because the proxy is transparent. The logs should be equivalent. Then spot the difference in the frames. |
Furthermore don't forget about the |
Further tests involving GRPC. We've created a simple test script that doesn't show this problem. Note that I've changed:
as well in order to fix the ports. import type { Host, Port } from './src/network/types';
import Logger, { LogLevel, StreamHandler, formatting } from '@matrixai/logger';
import { GRPCServer, utils as grpcUtils } from './src/grpc';
import { GRPCClientTest, openTestServer, closeTestServer, createTestService } from './tests/grpc/utils';
import { Proxy } from './src/network';
import * as grpc from '@grpc/grpc-js';
import * as keysUtils from './src/keys/utils';
import { sleep } from './src/utils';
import * as utilsPB from './src/proto/js/polykey/v1/utils/utils_pb';
async function main () {
const loggerServer = new Logger('SERVER', LogLevel.DEBUG, [
new StreamHandler(
formatting.format`${formatting.level}:${formatting.keys}:${formatting.msg} - ${formatting.date}`
),
]);
const loggerClient1 = new Logger('CLIENT1', LogLevel.DEBUG, [
new StreamHandler(
formatting.format`${formatting.level}:${formatting.keys}:${formatting.msg} - ${formatting.date}`
),
]);
const loggerClient2 = new Logger('CLIENT2', LogLevel.DEBUG, [
new StreamHandler(
formatting.format`${formatting.level}:${formatting.keys}:${formatting.msg} - ${formatting.date}`
),
]);
const authenticate = async (_metaClient, metaServer = new grpc.Metadata()) =>
metaServer;
const serverKeyPair = await keysUtils.generateKeyPair(1024);
const serverKeyPairPem = keysUtils.keyPairToPem(serverKeyPair);
const serverCert = keysUtils.generateCertificate(
serverKeyPair.publicKey,
serverKeyPair.privateKey,
serverKeyPair.privateKey,
12332432423,
);
const serverCertPem = keysUtils.certToPem(serverCert);
const serverNodeId = keysUtils.certNodeId(serverCert)!;
// TARGET GRPC SERVER
const [server, serverPort] = await openTestServer(
authenticate,
loggerServer.getChild('GRPC SERVER')
);
const remoteProxy = new Proxy({
authToken: 'abc',
logger: loggerServer.getChild('PROXY'),
});
await remoteProxy.start({
tlsConfig: {
keyPrivatePem: serverKeyPairPem.privateKey,
certChainPem: serverCertPem,
},
// FORWARD HOST & PORT
forwardHost: '127.0.0.2' as Host,
forwardPort: 55553 as Port,
// GRPC SERVER HOST & PORT
serverHost: '127.0.0.2' as Host,
serverPort: serverPort as Port,
// PROXY HOST & PORT
proxyHost: '127.0.0.2' as Host,
proxyPort: 55555 as Port,
});
// C? F[ ]P
console.log('FORWARD');
console.log(remoteProxy.getForwardHost(), remoteProxy.getForwardPort());
console.log('PROXY');
console.log(remoteProxy.getProxyHost(), remoteProxy.getProxyPort());
// S R?[ ]P
console.log('SERVER');
console.log(remoteProxy.getServerHost(), remoteProxy.getServerPort());
// CREATE THE FIRST CLIENT
// this side does not have a GRPC server
// it's not necessary
const clientKeyPair1 = await keysUtils.generateKeyPair(1024);
const clientKeyPairPem1 = keysUtils.keyPairToPem(clientKeyPair1);
const clientCert1 = keysUtils.generateCertificate(
clientKeyPair1.publicKey,
clientKeyPair1.privateKey,
clientKeyPair1.privateKey,
12332432423,
);
const clientCertPem1 = keysUtils.certToPem(clientCert1);
const clientNodeId1 = keysUtils.certNodeId(clientCert1)!;
const localProxy1 = new Proxy({
authToken: 'abc',
logger: loggerClient1.getChild('PROXY'),
});
await localProxy1.start({
tlsConfig: {
keyPrivatePem: clientKeyPairPem1.privateKey,
certChainPem: clientCertPem1,
},
forwardHost: '127.0.0.3' as Host,
forwardPort: 44443 as Port,
serverHost: '127.0.0.3' as Host,
serverPort: 44444 as Port,
proxyHost: '127.0.0.3' as Host,
proxyPort: 44445 as Port,
});
const client1 = await GRPCClientTest.createGRPCClientTest({
nodeId: serverNodeId,
host: remoteProxy.getProxyHost(),
port: remoteProxy.getProxyPort(),
proxyConfig: {
host: localProxy1.getForwardHost(),
port: localProxy1.getForwardPort(),
authToken: localProxy1.authToken,
},
logger: loggerClient1.getChild('GRPC CLIENT'),
});
// CREATE SECOND CLIENT
const clientKeyPair2 = await keysUtils.generateKeyPair(1024);
const clientKeyPairPem2 = keysUtils.keyPairToPem(clientKeyPair2);
const clientCert2 = keysUtils.generateCertificate(
clientKeyPair2.publicKey,
clientKeyPair2.privateKey,
clientKeyPair2.privateKey,
12332432423,
);
const clientCertPem2 = keysUtils.certToPem(clientCert2);
const clientNodeId2 = keysUtils.certNodeId(clientCert2)!;
const localProxy2 = new Proxy({
authToken: 'abc',
logger: loggerClient2.getChild('PROXY'),
});
await localProxy2.start({
tlsConfig: {
keyPrivatePem: clientKeyPairPem2.privateKey,
certChainPem: clientCertPem2,
},
forwardHost: '127.0.0.4' as Host,
forwardPort: 33333 as Port,
serverHost: '127.0.0.4' as Host,
serverPort: 33334 as Port,
proxyHost: '127.0.0.4' as Host,
proxyPort: 33335 as Port,
});
const client2 = await GRPCClientTest.createGRPCClientTest({
nodeId: serverNodeId,
host: remoteProxy.getProxyHost(),
port: remoteProxy.getProxyPort(),
proxyConfig: {
host: localProxy2.getForwardHost(),
port: localProxy2.getForwardPort(),
authToken: localProxy2.authToken,
},
logger: loggerClient2.getChild('GRPC CLIENT'),
});
console.log('SENDING REQUEST FROM CLIENT 1');
const m = new utilsPB.EchoMessage();
const challenge = 'Hello!';
m.setChallenge(challenge);
const unaryResponse = await client1.unary(m);
console.log(unaryResponse.getChallenge());
console.log('SENDED REQUEST FROM CLIENT 1');
console.log('SENDING REQUEST FROM CLIENT 2');
const m2 = new utilsPB.EchoMessage();
const challenge2 = 'Hello!';
m2.setChallenge(challenge2);
const unaryResponse2 = await client2.unary(m2);
console.log(unaryResponse2.getChallenge());
console.log('SENDED REQUEST FROM CLIENT 2');
// DESTROY FIRST CLIENT
await client1.destroy();
await localProxy1.stop();
// DESTROY SECOND CLIENT
await client2.destroy();
await localProxy2.stop();
// STOP THE SERVER
await remoteProxy.stop();
await closeTestServer(server);
}
main(); The packet logs showed a proper procedure for both clients talking to a single server. There are 12 (even number) amount of frame sequences being sent.
|
We need to take the script I wrote above and incrementally build up to reproducing the problem that we are seeing in the |
Also further note when you are seeing UDP packets of length 4, where the data is either 4 bytes of |
I recreated the test using
|
Since our script can generate the keys, we can export these to a file, and then use wireshark to also decrypt the UDP packets on the fly. More understanding of wireshark will be needed though. That might be useful @emmacasolin. |
I may have found the problem... The test was trying to speed up generation of the root keypair by using the parameter The question now is, how does changing the root keypair size cause a protocol error in the GRPC? I'll dig into this more. |
Could be a TLS issue. Some things require larger key sizes. This is another reason we want to move away from RSA since key sizes are not a thing in ECDSA... etc. #168 |
Are you sure this is the problem, is it just intermittent? What if you change in between, and try other key sizes. Or somewhere in between 1024 and 2048. |
There is a pattern between the seed's keys bits and the agents. where
I still have no idea why the root keypair sizes would cause the GRPC to fail. still looking into it. |
That shows that 1024 & 1024 works. Didn't you say that causes problems? What is the row and column? Can you label them seeds vs agent. |
When I discovered the problem I was doing 4096 for the seed and 1024 for the agent. |
But in my |
I still don't understand why it's happening. But with them all set to 1024 only agent2 had the problem. If I use the default for the seed and 1024 for the agents then both agent 1 and 2 fail. |
Try reproducing this with the script that I created yesterday. |
Trying the same conditions in the test-connections.ts script. When using 4096 bits for the seed and 1024 bits for each agent the script runs fine with no problems. Both agents start and are able to echo the seed node. Could this be a timing issue then? |
I don't think key sizes are the problem here. Try and go back to what we discussed yesterday, slowly build up the |
For reference, packet summaries; working
failing
|
There are 7 places where By going into the |
Looks like the error is caused by a HTTP 505 protocol error. 505 refers to a
So it seems to ultimately be a problem with http2 negotiating the connection maybe? |
Can we found out what the HTTP version is being supplied? As in where is the condition? The MDN says this is because
|
Relevant documentation about HTTP2 negotiation.
|
That seems to mean that the server can send Furthermore if this is occurring inside the NodeJS HTTP2 core, that would be here: https://github.com/nodejs/node/blob/v14.17.3/lib/internal/http2/core.js#L755-L760 |
And then:
Is passed into the C++ bindings later at:
|
Could this be the bug? nodejs/node#35475 The bug doesn't reproduce anymore in our node version |
Can we use https://nodejs.org/api/http2.html#event-remotesettings to try and see if that event gets emitted for the HTTP2Session inside grpc-js library. Will require some monkeypatching here. |
Using the env variables
|
Current theory:
My reading of the HTTP2 spec is that isn't entirely clear whether it's allowed for the server to send the This is why we need add an event handler The next step after this might be to report it upstream, while a workaround would be on how we do our reverse proxy connection setup, maybe not create the TCP connection between proxy to GRPC server UNTIL we have composed a connection first. A further workaround might be the plug the reverse side UTP stream, until we have seen data coming from the forward side, this would allow us to provide a traffic stop for the GRPC server's |
So yes, in the working case the client receives a 'remoteSettings' event and in the failing case it does not receive that event. So we can confirm that the Settings frames are causing the problem for the failing case. |
I've implemented a solution. We can delay the proxy form sending the Settings frame out of order by waiting for data to be received first. On the forward proxy way can delay sending data to the client before we receive data from the client by using
This fixes the problematic test in |
There was an issue in obscure conditions that caused certain packets to be out of the expected order for HTTP2 leading to a protocol error. This ensure that the `MAGIC` frame comes before the server's `SETTING` frame. #369
There was an issue in obscure conditions that caused certain packets to be out of the expected order for HTTP2 leading to a protocol error. This ensure that the `MAGIC` frame comes before the server's `SETTING` frame. #369
There was an issue in obscure conditions that caused certain packets to be out of the expected order for HTTP2 leading to a protocol error. This ensure that the `MAGIC` frame comes before the server's `SETTING` frame. Updated `NodeConnectionManager` lifecycle logging. `Proxy` and `NodeConnection` rejects connections to wildcard `0.0.0.0` and `::` addresses. Fixed up any tests broken by this change. Added a test to check if you can connect to the target through the proxies after the proxies have created the UTP connection. Fixes #369
There was an issue in obscure conditions that caused certain packets to be out of the expected order for HTTP2 leading to a protocol error. This ensure that the `MAGIC` frame comes before the server's `SETTING` frame. Updated `NodeConnectionManager` lifecycle logging. `Proxy` and `NodeConnection` rejects connections to wildcard `0.0.0.0` and `::` addresses. Fixed up any tests broken by this change. Added a test to check if you can connect to the target through the proxies after the proxies have created the UTP connection. Fixes #369
Specification
There seems to be a problem with the GRPC client calls in a very specific circumstance. In the test
tests/nat/noNAT.test.ts:239
'agents in different namespaces can ping each other via seed node'
we are trying to connect to the same seed node from two agents. the sequence isThe problem here is during the polykey.start() sequence in agent2 we are calling
nodeConnectionManager.syncNodeGraph()
. this is making a agent-agent GRPC call. It seems that any client GRPC call to the seed1 causes a errorErrorGRPCClientCall: Generic call error - 13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: Protocol error
to be thrown.This only happens within this test. trying the same procedure manually causes no problem. Replicating the circumstances using polykey agents and
NodeConnectionManager
s to make the GRPC calls works fine as well.The error implies that the the stream is being closed because of a protocol error during communication. it is hard to tell where this is coming from.
We need to work out why this problem is happening.
Some things to verify;
WARN:ConnectionForward 127.0.0.1:53659:Client Error: Error: write EPIPE
is it because theclientSocket
is closed and thetlsSocket
is trying to write data to it and thus emitting anEPIPE
.GRPCClient
is triggering theclientSocket
to be closed on theConnectionForward
GRPCClient
closing, and what is causing the subsequent protocol errorErrorGRPCClientCall: Generic call error - 13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: Protocol error
.Additional context
Tasks
0.0.0.0
to prevent accidental usage: GRPC call failure when multiple nodes connecting to 1 seed node. #369 (comment)createConnection
works in NCM so it's not outputting that it's creating a connection multiple times, even though it fetches it from the cache GRPC call failure when multiple nodes connecting to 1 seed node. #369 (comment)withConnF
error handling to use theResourceRelease
and pass the error in there, and reduce all the relevant error instances to a limited set of exceptions, preferably only those that relate to connection failures. Consider how this may be delegated by connectivity state changes though (let's avoid duplicate logic between connectivity state and exceptions there). GRPC call failure when multiple nodes connecting to 1 seed node. #369 (comment)The text was updated successfully, but these errors were encountered: