RPC Websocket buffer overflowing #966

morelazers · 2021-12-09T08:57:07Z

Description

Currently it appears as though there's a potential buffer overlow in our usage of the Websocket RPC connection between the CFE and the State Chain Node. The issue is visible when attempting Keygen with a large enough set of participants (~40+).

Others have reported issues arising due to the size of RPC responses.

The max buffer size was added as a config option to Substrate recently, we should cherry pick this into our version of Substrate to see if it can work as a bandaid solution for now.

As a long term solution, we should evaluate the amount of data we're trying to push over the RPC connection at any one time, and determine whether we can decrease this.

@AlastairHolmes giving this a p0 since we want to test the bandaid solution ASAP.

cc @dandanlen @msgmaxim

morelazers · 2021-12-09T09:05:31Z

@msgmaxim can you give us an example of the packets (I'm mostly interested in their sizes) sent at each stage of the Keygen / Signing processes?

We determined yesterday that a KeygenRequested event with 150 * 32 byte Validator Ids is nearly 5 mb kb in size.

andyjsbell · 2021-12-09T09:13:08Z

Don't you mean 5KB?

msgmaxim · 2021-12-10T04:43:16Z

#954 actually mitigates the problem somewhat as it reduces the sizes of all crypto primities by a factor of 2 or 3 when serialized.

Pre #954 each Point on the curve is 144 bytes, while a scalar is 71 bytes. With the new serialization scheme Point is 41 bytes, Scalar: 40 bytes. (Annoyingly each point and scalar is serialized with "purpose" text field, which doesn't seem all that useful to me and adds about 10 bytes to every point/scalar, so we could probably drop them to optimise further).

To give some idea, the message sent in Stage 1 in keygen has t+1 Points (where t is the threshold) and 1 Scalars, so for n=150, t=100 we get ~14.5KB (or 4.1KB in the new scheme). Broadcast verification (e.g. during stage 2) is much worse as it requires putting messages received from all other participants into a single message. So in keygen with 150 participants each Stage 2 message is roughly 150 * 14.5KB = 2.2MB (615KB in the new scheme). We've discussed optimising broadcast verification before, but decided we could make the first release as is. Messages for all other stages in keygen are much smaller: Stage 3 is only a single Scalar, so even brodcast verification for it (Stage 4) is not a problem.

Signing seems to be much more manageable:
Stage 1: 2 Points
Stage 2: T * 2 Points (broadcast verification) [so 100 * 4.1 = 4.1KB per message with the new scheme]
Stage 3: 1 Scalar
Stage 4: T * 1 Scalar (broadcast verification)

AlastairHolmes · 2021-12-10T07:37:06Z

I merged a possible fix into the merge-config-option branch. So we can try it out.

morelazers · 2021-12-10T08:05:20Z

Thank you @msgmaxim! That makes sense. So during a broadcast verification stage, we have 150 * 2.2mb (330mb) of both outgoing and incoming messages, totalling 660mb. I can see why that would cause the 15mb buffer to have a bad time.

We'll get your PR merged today and give it a crack, ideally with some profiling of the bandwidth used during the ceremony if possible @tomjohnburton.

AlastairHolmes · 2022-01-03T10:36:47Z

I think the only currently feasible change to further decrease this problem is to batch the p2p message, which will make all the broadcast stages very small, and only the single private stage would be large. This is a refactor I would like to do anyway. Also due to #983 I think we would like to move handling the connections between networks ourselves instead of using substrate. So we wouldn't use the RPC for p2p messages anymore at all.

morelazers · 2022-01-03T10:41:21Z

Ok - does it make sense to close this and supercede it with #1103?

@msgmaxim also has ideas about how to reduce the amount of data being sent over the wire in the happy case quite significantly.

AlastairHolmes · 2022-01-03T10:53:17Z

I think so. I would like to make a batching of p2p messages refactor ticket also (Low priority). It's basically an interface change between the p2p and the multisig code. Instead of the multisig passing each message to be sent individually on the channel to the p2p code, it could give the p2p code all the messages to be sent (for a single stage) at once (As a single item in the channel), this allows us to avoid duplicating the message in the broadcast. This also will simply the multisig tests as it makes it simple for the tests to retrieve all the output messages from a given stage with a single wait on the channel, instead of having wait of the channel once for each expected message separately.

Although there probably needs to be a ticket for what @msgmaxim is doing as well, if there isnt one.

morelazers · 2022-01-03T10:56:06Z

Ok cheers - feel free to make the ticket and the plan 👍

morelazers added bug Something isn't working State Chain p0-dropeverything labels Dec 9, 2021

morelazers assigned AlastairHolmes Dec 9, 2021

morelazers added p2-somedaysoon and removed p0-dropeverything labels Dec 11, 2021

morelazers closed this as completed Jan 3, 2022

AlastairHolmes mentioned this issue Jan 3, 2022

[SC-2880] Batch P2P Messages for a single multisig stage #1104

Closed

msgmaxim mentioned this issue Jan 5, 2022

Consider optimising the happy path in broadcast verification #1114

Closed

msgmaxim mentioned this issue Apr 19, 2022

Limit message sizes in multisig stages #1540

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPC Websocket buffer overflowing #966

RPC Websocket buffer overflowing #966

morelazers commented Dec 9, 2021 •

edited

Loading

morelazers commented Dec 9, 2021 •

edited

Loading

andyjsbell commented Dec 9, 2021

msgmaxim commented Dec 10, 2021

AlastairHolmes commented Dec 10, 2021

morelazers commented Dec 10, 2021

AlastairHolmes commented Jan 3, 2022

morelazers commented Jan 3, 2022

AlastairHolmes commented Jan 3, 2022 •

edited

Loading

morelazers commented Jan 3, 2022

RPC Websocket buffer overflowing #966

RPC Websocket buffer overflowing #966

Comments

morelazers commented Dec 9, 2021 • edited Loading

Description

morelazers commented Dec 9, 2021 • edited Loading

andyjsbell commented Dec 9, 2021

msgmaxim commented Dec 10, 2021

AlastairHolmes commented Dec 10, 2021

morelazers commented Dec 10, 2021

AlastairHolmes commented Jan 3, 2022

morelazers commented Jan 3, 2022

AlastairHolmes commented Jan 3, 2022 • edited Loading

morelazers commented Jan 3, 2022

morelazers commented Dec 9, 2021 •

edited

Loading

morelazers commented Dec 9, 2021 •

edited

Loading

AlastairHolmes commented Jan 3, 2022 •

edited

Loading