Skip to content

Performance Results

Nuno Neto edited this page Nov 11, 2022 · 6 revisions

Here we present the performance results we obtained from running the microbenchmarks-async benchmark. For this test, we used 150 clients performing 60 concurrent requests.

The system was tested using 5 machines connected locally on 1GBe LAN with the following specs:

  • 2x AMD EPYC 7272 12 Core 2.9 GHz with SMT enabled.
  • 256 GB DDR4 RAM
  • A 256 GB NVME SSD
  • Ubuntu 20.04

We assigned 4 machines to run as replicas and the last machine to run as a client machine, which will host all of the clients.

Results

Performance:

Operations per second:

ops_per_second

The 95th percentile average of operations per second (in order to ignore the initial measurement and other issues unrelated to the protocol) is of 121914 operations per second. The average is of 111000.

Batch Size

batch_size

CPU Usage:

CPU Usage in replicas:

cpu_usage_0 com/4153112/200622617-08f21f74-6662-4e60-8986-cd7c4989da94.png)

As is visible in the graph the CPU usage is quite low, excluding the 1 thread which is constantly at 100%. This is because of the proposer and the way that it functions, since it does not perform any waiting for requests in the client pool since we want to propose the batch it is currently working on as soon as the consensus is ready for it. If we were waiting for messages from clients, we could reach a situation where no client messages meant that even if we already have a batch ready, the proposer wouldn't propose it because it was stuck waiting for requests from the clients. We didn't want to introduce a lot of sleeps since we wanted to keep the latency of batch proposing as short as possible.

This low CPU usage indicates to us that we are still able to scale the amount of operations handled by FeBFT significantly. We will explain why we did not in the following sections.

CPU Usage in clients:

cpu_usage_1000

As we can see, like with the replica CPU Usage, the CPU usage here is not very high which could be taken as an indication that we could increase the client count. This is not really the case, since the CPU usage in this case is quite fooling. In reality, if we tried to increase the amount of operations the client machine was producing what we would find is that the CPU is already being put through it's paces. This is due to the very large amount of threads the client machine has to handle (around 10 per client, thanks to our network multiplexing choice) which means the CPU usage is low because the CPU spends most of its time handling context switches instead of performing work. We chose this networking approach because there were no options in Rust for connection multiplexing (like we have with Netty for Java). This made the best choice for performance to be having one thread per socket (incoming and outgoing), leading to this excess of threads.

RAM Usage:

Ram usage in replica:

ram_usage_0 As we can see, the RAM usage is always increasing (starting at around 11.8GB and ending at around 12.6GB). This increase is because of the logging of messages and because of the fact we never actually perform enough operations to trigger a checkpoint, which would cause the log to be cleared and the RAM usage would drop. The RAM usage measured is of the entire system. We can see that because we don't use a Garbage collected programming language, the handling of the memory is smooth and causes no performance problems for the overall system.