Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype using QUIC instead of tcp and proprietary handshakes #1385

Closed
evan-forbes opened this issue Jun 10, 2024 · 4 comments
Closed

Prototype using QUIC instead of tcp and proprietary handshakes #1385

evan-forbes opened this issue Jun 10, 2024 · 4 comments
Assignees
Labels
p2p refactor WS: Big Blonks 🔭 Improving consensus critical gossiping protocols

Comments

@evan-forbes
Copy link
Member

While not technically purpose built for p2p applications, QUIC streams could replace our existing mechanisms for multiplexing messages and offer significant benefits. Besides removing multiple round trips per the handshake, QUIC also offers the ability to avoid HOL and to make use of more widely used (potentially more battle tested and debugged) software.

AC

Create a prototype that uses QUIC instead of tcp and secret conn. Compare its high level performance (consensus throughput and peer drops) using our network tests and report them here.

@evan-forbes evan-forbes added the WS: Big Blonks 🔭 Improving consensus critical gossiping protocols label Jun 10, 2024
@evan-forbes
Copy link
Member Author

The only reason why we would want to prototype without using libp2p is to help debug the current issues in comet where we are unable to utilize all of the bandwidth with multiple nodes and latency.

tbc, the goal of this issue is not to write a proprietary p2p stack based on QUIC

@rach-id
Copy link
Member

rach-id commented Dec 3, 2024

Investigation setup

We added support for QUIC streams in tendermint here #1466, using go-quic library.

Then, to benchmark the performance, we implemented a mock reactor that floods the network with raw data and traces the data sent/received. This reactor was enabled both for the QUIC refactor: https://github.com/celestiaorg/celestia-core/tree/quic-bench-reactor and also the native tm stack: https://github.com/celestiaorg/celestia-core/tree/updating-the-mock-reactor-updated.

Then, for better results, we disabled all the reactors except PEX so that the only data sent across the network is the mock reactor data.

After running the experiments, we noticed that the quic connections in the QUIC refactor hang after a while. So, to have more accurate results, we implemented a simple project https://github.com/rach-id/quic-bench that uses go-quic to create a QUIC network where each peer floods its peers with random data. This allowed benchmarking the QUIC performance without even the overhead added by tendermint, protobuf encoding/decoding, peers handling, etc.

To run the benchmarks, we used this branch of congest https://github.com/celestiaorg/congest/tree/tm-quic-benchmarks which creates a network of servers, provisions them, runs the network, then collects the logs.

The plots below were generated using the traces collected from the network using congest and processed using https://github.com/celestiaorg/traces_analysis.

The servers' setup:

  • 16CPU 32GB RAM servers with ~5Gbps download speed and ~2.5Gbps upload (as measured using speedtest-cli)
  • 52 peers distributed across the world. the plots below show the regions.
  • For the QUIC benchmarks, we increased the udp buffers in all the servers:
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
sudo sysctl -w net.core.rmem_default=8388608
sudo sysctl -w net.core.wmem_default=8388608
sudo sysctl -w net.ipv4.udp_mem="8388608 8388608 16777216"
sudo sysctl -w net.ipv4.udp_rmem_min=1638400
sudo sysctl -w net.ipv4.udp_wmem_min=1638400
  • For the tendermint native stack benchmarks, we used BBR + mptcp in all the servers:
# Load the BBR module
echo "Loading BBR module..."
modprobe tcp_bbr
# Verify if the BBR module is loaded
if lsmod | grep -q "tcp_bbr"; then
  echo "BBR module loaded successfully."
else
  echo "Failed to load BBR module."
  exit 1
fi
# Add BBR to the list of available congestion control algorithms
echo "Updating sysctl settings..."
sysctl -w net.core.default_qdisc=fq
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Enable MPTCP
sysctl -w net.mptcp.enabled=1

# Set the path manager to ndiffports
sysctl -w net.mptcp.mptcp_path_manager=ndiffports

# Specify the number of subflows
SUBFLOWS=16
sysctl -w net.mptcp.mptcp_ndiffports=$SUBFLOWS
# Make the changes persistent across reboots
echo "Making changes persistent..."
echo "net.core.default_qdisc=fq" >> /etc/sysctl.conf
echo "net.ipv4.tcp_congestion_control=bbr" >> /etc/sysctl.conf

Also, we increase the send/receive rate to 100Gbps.

Note: the word validator and peer are used interchangeably here. Wherever a validator is mentioned, it doesn't refer to a tendermint validator but a p2p peer. All experiments were using the mock reactor or the quic bench tool. No actual consensus network was benchmarked in these results.

Findings

After running the different networks for a few minutes, we collected the logs and generated the following plots.

Native tendermint stack + BBR + mptcp

image
Peer upload vs download progression image
Region upload vs download average speed image

Native tendermint stack + CUBIC

image
Peer upload vs download progression image
Region upload vs download average speed image

Native tendermint stack + RENO

image
Peer upload vs download progression image
Region upload vs download average speed image

Tendermint stack QUIC refactor + large UDP buffers

image
Peer upload vs download progression image
Region upload vs download average speed image

QUIC benchmark tool + large UDP buffers + 4 streams per peer

image
Peer upload vs download progression image
Region upload vs download average speed image

QUIC benchmark tool + large UDP buffers + 32 streams per peer

image
Peer upload vs download progression image
Region upload vs download average speed image

QUIC benchmark tool + large UDP buffers + 1 stream per peer

image
Peer upload vs download progression image
Region upload vs download average speed image

Insights

We can see in the plots that the QUIC benchmark tool yields better results than the Tendermint QUIC refactor. So we will use it as a reference for the QUIC performance instead of the QUIC refactor. The reason being is the QUIC refactor was still a WIP and thus the performance can still be improved. However, the QUIC benchmarking tool uses out of the box go-quic without any changes. This makes it a good reference for performance, as that's the best we can.

So, if we compare the QUIC benchmarking tool performance with the native tendermint stack + BBR, we see that in both the worst cases, where we have servers in far places in the world, sometimes in a different continent, and the best cases, where the servers are in the same site, the performance of the native tendermint stack + BBR is better, the bandwidth is better utilized.

Additional insights

  • QUIC with 1 stream performs way better than QUIC with multiple streams which is expected
  • CUBIC and RENO allow reaching higher speeds when exchanging data with closer nodes but underperform when the nodes are far away. One reason for CUBIC/RENO to perform this way for far nodes is because the closer nodes are consuming most bandwidth.
  • QUIC with 1 stream performs almost the same as TM native with BBR for closer nodes. but performs way worse for far nodes.
  • QUIC with 1 stream performs slightly better or the same compared to CUBIC or RENO in the worst cases (far away nodes), but performs worse than CUBIC or RENO in the best cases (closer nodes).

Conclusions

Given the above numbers, it seems more reasonable to keep the existing tendermint stack and pair it with BBR to have the best possible performance. Then, we will need to focus on improving the message passing mechanisms to be able to fully utilize the bandwidth. This will be tracked in this issue, #1531.

We can consider switching to QUIC at some point if one of the following shows promising:

  • go-quic implements BBR and the benchmark results are higher than TCP + BBR.
  • Head of line blocking starts showing in the native tendermint stack, and we couldn't find a way to solve it.

@evan-forbes
Copy link
Member Author

I feel comfortable closing this issue for now

@rach-id
Copy link
Member

rach-id commented Dec 6, 2024

More insights: comparing different data sizes and streams in QUIC vs TCP + BBR +mptcp

The setups are the same as above.

QUIC

500 bytes data in 4 streams

image

5mb data in 4 streams

image

Note: when running the experiment, the CPU usage increases with the number of streams we're opening.

1 stream resources usage

image

256 stream resources usage

image

opening a new stream every minute

image

We see that after opening ~20-40 stream, QUIC keeps performing the same way. if we open <20 streams, the performance is best. This could also be due to the data being sent is increased with every stream.

TCP + BBR + mptcp

500 bytes

image

Apparently, the smaller the data being sent, the more performance we get from the comet's p2p stack.

5mb

There is a memory leak apparently somewhere that makes the nodes OOM after a short while.

We will try to fix it in #1548

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p2p refactor WS: Big Blonks 🔭 Improving consensus critical gossiping protocols
Projects
None yet
Development

No branches or pull requests

2 participants