optimize node startup speed and memory allocation #6952

lyfsn · 2024-04-15T12:38:38Z

PR description

In summary, if a user employs a large genesis file and activates the --genesis-state-hash-cache-enabled parameter, their node can start in under 10 seconds, requiring less than 2GB of memory. This is the essence of what this PR accomplishes.

Regarding the process of serializing the string into a GenesisConfigFile, many parts of the code follow a similar logic pattern: GenesisConfigFile.fromConfig(genesisContent).getConfigOptions(), which means obtaining config options from the GenesisConfigFile. In fact, it does not care about the rest of the content, especially the accounts information, while large genesis files often have accounts occupying a large portion.

Therefore, this PR introduces a new method, fromConfigWithoutAccounts, for all places where accounts are not needed. During the serialization of the string into a GenesisConfigFile, it ignores the accounts information. This change is should be safe because the only time accounts information is needed is during the calculation of the genesis state hash.
Based on the -genesis-state-hash-cache-enabled parameter added in feat: add --use-cached-genesis-state-hash paramater #6758, this PR further enhances the performance improvements that the parameter can offer.

Originally, the parameter's purpose was to cache and skip the calculation of the genesis state hash value. Building upon the previous improvements, this PR determines that if this parameter is enabled and there is a cached genesis state hash in the database, then it will directly ignore the accounts content when reading the genesis file.

The table below outlines the comparison of node startup time and total memory reclaimed before and after the optimizations made in this PR. All tests utilized a genesis file of 1.1GB in size.

	Time Consuming - Before optimization	Memory Allocated - Before optimization	Time Consuming - Optimized (this PR)	Memory Allocated - Optimized (this PR)
--genesis-state-hash-cache-enabled=false && block number=0	7min37s	184.95GB	7min24s	158.83GB
--genesis-state-hash-cache-enabled=false && block number=1	2min53s	125.47GB	2min36s	96.53GB
--genesis-state-hash-cache-enabled=true && block number=0	7min35s	183.33GB	7min34s	154.68GB
--genesis-state-hash-cache-enabled=true && block number=1	0min50s	66.75GB	0min9s	1.31GB

Without the --genesis-state-hash-cache-enabled parameter enabled, the startup time reduced from 2 minutes and 53 seconds to 2 minutes and 36 seconds, and the total memory reclaimed decreased from 125.47GB to 96.53GB. The optimization effect here is modest.

With the --genesis-state-hash-cache-enabled parameter enabled, the startup time can be reduced from 50 seconds to 9 seconds, and the total memory reclaimed from 66.75GB to 1.31GB, marking a significant optimization effect.

This is the unit test coverage report for the PR: Unit Test Coverage Report.

Fixed Issue(s)

Thanks for sending a pull request! Have you done the following?

Checked out our contribution guidelines?
Considered documentation and added the doc-change-required label to this PR if updates are required.
Considered the changelog and included an update if required.
For database changes (e.g. KeyValueSegmentIdentifier) considered compatibility and performed forwards and backwards compatibility tests

Locally, you can run these tests to catch failures early:

unit tests: ./gradlew build
acceptance tests: ./gradlew acceptanceTest
integration tests: ./gradlew integrationTest
reference tests: ./gradlew ethereum:referenceTests:referenceTests

Signed-off-by: lyfsn <dev.wangyu@proton.me>

fab-10

I understand Besu startup code is not easy, so added some initial comments to try to improve the code in BesuCommand

fab-10 · 2024-04-16T09:22:04Z

besu/src/main/java/org/hyperledger/besu/cli/BesuCommand.java

  }

+  private String genesisConfigString = "";


A Suppliers.memoize seems better suited for this field

I made a commit to use Suppliers.memoize: 8ed9e4c

fab-10 · 2024-04-16T09:25:34Z

besu/src/main/java/org/hyperledger/besu/cli/BesuCommand.java

  }

+  private String genesisConfigString = "";
+
  private String genesisConfig() {


now this method does not only return the genesis config, so the name is no more descriptive of its behavior

Do you mean the method genesisConfig()?

fab-10 · 2024-04-16T09:32:07Z

besu/src/main/java/org/hyperledger/besu/cli/BesuCommand.java

+      // If the genesis state hash is present in the database, we can use the genesis file without
+      pluginCommonConfiguration.init(
+          dataDir(),
+          dataDir().resolve(DATABASE_PATH),
+          getDataStorageConfiguration(),
+          getMiningParameters());
+      final KeyValueStorageProvider storageProvider = keyValueStorageProvider(keyValueStorageName);
+      if (storageProvider != null) {
+        boolean isGenesisStateHashPresent;
+        try {
+          // A null pointer exception may be thrown here if the database is not initialized.
+          VariablesStorage variablesStorage = storageProvider.createVariablesStorage();
+          Optional<Hash> genesisStateHash = variablesStorage.getGenesisStateHash();
+          isGenesisStateHashPresent = genesisStateHash.isPresent();
+        } catch (Exception ignored) {
+          isGenesisStateHashPresent = false;
+        }
+        if (isGenesisStateHashPresent) {
+          genesisConfigString = JsonUtil.getJsonFromFileWithout(genesisFile, "alloc");
+        }
+      }


please explain why this code is needed here now, so we can find a better organization for it

In the original code flow, genesis file (disk) --(1)--> genesis string --(2)--> genesis struct.

Here, the code is positioned at step (1).

If the node has the --genesis-state-hash-cache-enabled parameter enabled, the value of state root hash will be saved in variablesStorage.

The intent of this code is to determine if the value of state root hash already exists in the node's variablesStorage, and if so, during step (1), it will directly ignore the alloc field in the gensis.json file.

Because if the state root hash value already exists, it indicates that the account information in the genesis.json file has already been written into the state database, and there is no need to calculate the state root hash again later.

fab-10 · 2024-04-16T09:59:48Z

config/src/main/java/org/hyperledger/besu/config/JsonUtil.java

+  public static String getJsonFromFileWithout(
+      final File genesisFile, final String excludedFieldName) {
+    StringBuilder jsonBuilder = new StringBuilder();
+    JsonFactory jsonFactory =
+        JsonFactory.builder()
+            .configure(JsonFactory.Feature.INTERN_FIELD_NAMES, false)
+            .configure(JsonFactory.Feature.CANONICALIZE_FIELD_NAMES, false)
+            .build();
+    try (JsonParser parser = jsonFactory.createParser(genesisFile)) {
+      JsonToken token;
+      while ((token = parser.nextToken()) != null) {
+        if (token == JsonToken.START_OBJECT) {
+          jsonBuilder.append(handleObject(parser, excludedFieldName));
+        }
+      }
+    } catch (Exception e) {
+      throw new RuntimeException(e);
+    }
+    return jsonBuilder.toString();
+  }


Have you checked if Jackson natively support excluding fields without having to implement the parsing methods?

Yes, I have checked it, and this is the fastest method with the least memory usage. I will check it again in a moment.

Signed-off-by: lyfsn <dev.wangyu@proton.me>

ahamlat · 2024-04-16T15:23:06Z

@lyfsn Great work on the PR, I have few questions if you don't mind :

Is there a way to reproduce the numbers on my side, by sharing the big (fake) genesis file, and the configuration you used for your test ?
Is "requiring less than 2GB of memory" a strong prerequisite for your use case ?
Can you share the -Xmx value used during your test?
Have you checked GC activity during the test period ?
Have you done some profiling to see what is taking more than 7 minutes in the first and third test with this PR's code ?

lyfsn · 2024-04-16T15:39:35Z

@lyfsn Great work on the PR, I have few questions if you don't mind :

Is there a way to reproduce the numbers on my side, by sharing the big (fake) genesis file, and the configuration you used for your test ?

Is "requiring less than 2GB of memory" a strong prerequisite for your use case ?

Can you share the -Xmx value used during your test?

Have you checked GC activity during the test period ?

Have you done some profiling to see what is taking more than 7 minutes in the first and third test with this PR's code ?

You can use besu.json from this repository (though it's smaller than the one I used). LInk
This repository has 8 jfr files; they are about memory allocation records. LInk
Also, in this repository, there are some records about profiler records. Link
I run the binary Besu directly, so I do not set -Xmx explicitly. I run this on a 64GB server, so the default should be 16GB.

You can refer to this information first, and if you have more specific needs, please feel free to let me explain!

fab-10 · 2024-04-17T15:06:34Z

While reviewing your PR, I saw the need for a refactor of how the genesis file and the genesis options are managed, since with the time some tech debt has accumulated, and with that I had the idea of making the load of the accounts lazy, with the hope of moving that logic in the class that manage the genesis file, will let you know when I have something ready to test.

ahamlat · 2024-04-18T11:10:20Z

@lyfsn thanks for the troubleshooting data, awesome work.
I was able to reproduce some of the number with this PR, for --genesis-state-hash-cache-enabled=true && block number=0, but then I did the test on a previous release, 23.10.2 (without both PRs), and it was faster for the block 0 test :

block number=0 : 5 minutes
block number=1 : 4 minutes 17 seconds

I'm currently analyzing the regression as we know what code is causing it.
I would like to know what was the test you shared in you blog where Besu took more than 100 minutes to start

lyfsn · 2024-04-18T11:43:33Z

@lyfsn thanks for the troubleshooting data, awesome work. I was able to reproduce some of the number with this PR, for --genesis-state-hash-cache-enabled=true && block number=0, but then I did the test on a previous release, 23.10.2 (without both PRs), and it was faster for the block 0 test :

block number=0 : 5 minutes

block number=1 : 4 minutes 17 seconds

I'm currently analyzing the regression as we know what code is causing it. I would like to know what was the test you shared in you blog where Besu took more than 100 minutes to start

I guess there are two main reasons:

At that time, we used our devnet file, which is about 1.1GB, a bit larger than the mainnet file, which is about 800MB. The mainnet file was provided to you for convenience in using publicly available files for testing.
The server's hardware was limited at that time, with only 16GB of RAM and a set 16GB swap space. The swap space obviously slowed down the processing speed.

These were the parameters for the test at that time:

Client	Startup command	Version	OS	Arch	CPU core	RAM	Swap space
Besu	env BESU_OPTS=-Xmx16g ./besu --(....)	24.3.0/openjdk17	ubuntu 22.04	x64	2	16GB	16GB

Additionally, are you referring to this blog article? The article also detailed the testing conditions:
https://devlog.fusionist.io/posts/ethereum-execution-clients-memory-comparison-in-endurance/

fab-10 · 2024-04-18T13:35:20Z

are your genesis files public? I would like to test them on the refactor, if not I will try to generate one with random data

lyfsn · 2024-04-18T13:47:37Z

are your genesis files public? I would like to test them on the refactor, if not I will try to generate one with random data

I sent an email to your public email address; it contains the file access document.

fab-10 · 2024-04-22T19:00:48Z

@lyfsn could you test this PR?
The allocations are lazy loaded only when needed, and there is reduced memory usage when parsing it, with less object copies and allocations streaming.

fab-10 · 2024-06-03T08:18:56Z

Closing this as superseded by #6977

optimize node startup speed and memory allocation

6bcf2c7

Signed-off-by: lyfsn <dev.wangyu@proton.me>

fab-10 reviewed Apr 16, 2024

View reviewed changes

Merge branch 'main' into optimize-startup-speed-and-memory-alloc

56eb890

fab-10 reviewed Apr 16, 2024

View reviewed changes

genesisConfigString use Suppliers.memoize

8ed9e4c

Signed-off-by: lyfsn <dev.wangyu@proton.me>

fab-10 mentioned this pull request Jun 3, 2024

Improve genesis state performance at startup #6977

Merged

8 tasks

fab-10 closed this Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize node startup speed and memory allocation #6952

optimize node startup speed and memory allocation #6952

lyfsn commented Apr 15, 2024 •

edited

Loading

fab-10 left a comment

fab-10 Apr 16, 2024

lyfsn Apr 16, 2024

fab-10 Apr 16, 2024

lyfsn Apr 16, 2024 •

edited

Loading

fab-10 Apr 16, 2024

lyfsn Apr 16, 2024

fab-10 Apr 16, 2024

lyfsn Apr 16, 2024

ahamlat commented Apr 16, 2024

lyfsn commented Apr 16, 2024 •

edited

Loading

fab-10 commented Apr 17, 2024

ahamlat commented Apr 18, 2024

lyfsn commented Apr 18, 2024

fab-10 commented Apr 18, 2024

lyfsn commented Apr 18, 2024

fab-10 commented Apr 22, 2024

fab-10 commented Jun 3, 2024

optimize node startup speed and memory allocation #6952

optimize node startup speed and memory allocation #6952

Conversation

lyfsn commented Apr 15, 2024 • edited Loading

PR description

Fixed Issue(s)

Thanks for sending a pull request! Have you done the following?

Locally, you can run these tests to catch failures early:

fab-10 left a comment

Choose a reason for hiding this comment

fab-10 Apr 16, 2024

Choose a reason for hiding this comment

lyfsn Apr 16, 2024

Choose a reason for hiding this comment

fab-10 Apr 16, 2024

Choose a reason for hiding this comment

lyfsn Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

fab-10 Apr 16, 2024

Choose a reason for hiding this comment

lyfsn Apr 16, 2024

Choose a reason for hiding this comment

fab-10 Apr 16, 2024

Choose a reason for hiding this comment

lyfsn Apr 16, 2024

Choose a reason for hiding this comment

ahamlat commented Apr 16, 2024

lyfsn commented Apr 16, 2024 • edited Loading

fab-10 commented Apr 17, 2024

ahamlat commented Apr 18, 2024

lyfsn commented Apr 18, 2024

fab-10 commented Apr 18, 2024

lyfsn commented Apr 18, 2024

fab-10 commented Apr 22, 2024

fab-10 commented Jun 3, 2024

lyfsn commented Apr 15, 2024 •

edited

Loading

lyfsn Apr 16, 2024 •

edited

Loading

lyfsn commented Apr 16, 2024 •

edited

Loading