Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC Websocket buffer overflowing #966

Closed
morelazers opened this issue Dec 9, 2021 · 9 comments
Closed

RPC Websocket buffer overflowing #966

morelazers opened this issue Dec 9, 2021 · 9 comments
Assignees
Labels

Comments

@morelazers
Copy link

morelazers commented Dec 9, 2021

Description

Currently it appears as though there's a potential buffer overlow in our usage of the Websocket RPC connection between the CFE and the State Chain Node. The issue is visible when attempting Keygen with a large enough set of participants (~40+).

Others have reported issues arising due to the size of RPC responses.

The max buffer size was added as a config option to Substrate recently, we should cherry pick this into our version of Substrate to see if it can work as a bandaid solution for now.

As a long term solution, we should evaluate the amount of data we're trying to push over the RPC connection at any one time, and determine whether we can decrease this.

@AlastairHolmes giving this a p0 since we want to test the bandaid solution ASAP.

cc @dandanlen @msgmaxim

@morelazers
Copy link
Author

morelazers commented Dec 9, 2021

@msgmaxim can you give us an example of the packets (I'm mostly interested in their sizes) sent at each stage of the Keygen / Signing processes?

We determined yesterday that a KeygenRequested event with 150 * 32 byte Validator Ids is nearly 5 mb kb in size.

@andyjsbell
Copy link

Don't you mean 5KB?

@msgmaxim
Copy link
Contributor

#954 actually mitigates the problem somewhat as it reduces the sizes of all crypto primities by a factor of 2 or 3 when serialized.

Pre #954 each Point on the curve is 144 bytes, while a scalar is 71 bytes. With the new serialization scheme Point is 41 bytes, Scalar: 40 bytes. (Annoyingly each point and scalar is serialized with "purpose" text field, which doesn't seem all that useful to me and adds about 10 bytes to every point/scalar, so we could probably drop them to optimise further).

To give some idea, the message sent in Stage 1 in keygen has t+1 Points (where t is the threshold) and 1 Scalars, so for n=150, t=100 we get ~14.5KB (or 4.1KB in the new scheme). Broadcast verification (e.g. during stage 2) is much worse as it requires putting messages received from all other participants into a single message. So in keygen with 150 participants each Stage 2 message is roughly 150 * 14.5KB = 2.2MB (615KB in the new scheme). We've discussed optimising broadcast verification before, but decided we could make the first release as is. Messages for all other stages in keygen are much smaller: Stage 3 is only a single Scalar, so even brodcast verification for it (Stage 4) is not a problem.

Signing seems to be much more manageable:
Stage 1: 2 Points
Stage 2: T * 2 Points (broadcast verification) [so 100 * 4.1 = 4.1KB per message with the new scheme]
Stage 3: 1 Scalar
Stage 4: T * 1 Scalar (broadcast verification)

@AlastairHolmes
Copy link
Contributor

I merged a possible fix into the merge-config-option branch. So we can try it out.

@morelazers
Copy link
Author

Thank you @msgmaxim! That makes sense. So during a broadcast verification stage, we have 150 * 2.2mb (330mb) of both outgoing and incoming messages, totalling 660mb. I can see why that would cause the 15mb buffer to have a bad time.

We'll get your PR merged today and give it a crack, ideally with some profiling of the bandwidth used during the ceremony if possible @tomjohnburton.

@AlastairHolmes
Copy link
Contributor

I think the only currently feasible change to further decrease this problem is to batch the p2p message, which will make all the broadcast stages very small, and only the single private stage would be large. This is a refactor I would like to do anyway. Also due to #983 I think we would like to move handling the connections between networks ourselves instead of using substrate. So we wouldn't use the RPC for p2p messages anymore at all.

@morelazers
Copy link
Author

Ok - does it make sense to close this and supercede it with #1103?

@msgmaxim also has ideas about how to reduce the amount of data being sent over the wire in the happy case quite significantly.

@AlastairHolmes
Copy link
Contributor

AlastairHolmes commented Jan 3, 2022

I think so. I would like to make a batching of p2p messages refactor ticket also (Low priority). It's basically an interface change between the p2p and the multisig code. Instead of the multisig passing each message to be sent individually on the channel to the p2p code, it could give the p2p code all the messages to be sent (for a single stage) at once (As a single item in the channel), this allows us to avoid duplicating the message in the broadcast. This also will simply the multisig tests as it makes it simple for the tests to retrieve all the output messages from a given stage with a single wait on the channel, instead of having wait of the channel once for each expected message separately.

Although there probably needs to be a ticket for what @msgmaxim is doing as well, if there isnt one.

@morelazers
Copy link
Author

Ok cheers - feel free to make the ticket and the plan 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants