Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize node startup speed and memory allocation #6952

Closed

Conversation

lyfsn
Copy link
Contributor

@lyfsn lyfsn commented Apr 15, 2024

PR description

In summary, if a user employs a large genesis file and activates the --genesis-state-hash-cache-enabled parameter, their node can start in under 10 seconds, requiring less than 2GB of memory. This is the essence of what this PR accomplishes.

  1. Regarding the process of serializing the string into a GenesisConfigFile, many parts of the code follow a similar logic pattern: GenesisConfigFile.fromConfig(genesisContent).getConfigOptions(), which means obtaining config options from the GenesisConfigFile. In fact, it does not care about the rest of the content, especially the accounts information, while large genesis files often have accounts occupying a large portion.

    Therefore, this PR introduces a new method, fromConfigWithoutAccounts, for all places where accounts are not needed. During the serialization of the string into a GenesisConfigFile, it ignores the accounts information. This change is should be safe because the only time accounts information is needed is during the calculation of the genesis state hash.

  2. Based on the -genesis-state-hash-cache-enabled parameter added in feat: add --use-cached-genesis-state-hash paramater #6758, this PR further enhances the performance improvements that the parameter can offer.

    Originally, the parameter's purpose was to cache and skip the calculation of the genesis state hash value. Building upon the previous improvements, this PR determines that if this parameter is enabled and there is a cached genesis state hash in the database, then it will directly ignore the accounts content when reading the genesis file.

The table below outlines the comparison of node startup time and total memory reclaimed before and after the optimizations made in this PR. All tests utilized a genesis file of 1.1GB in size.

Time Consuming - Before optimization Memory Allocated - Before optimization Time Consuming - Optimized (this PR) Memory Allocated - Optimized (this PR)
--genesis-state-hash-cache-enabled=false && block number=0 7min37s 184.95GB 7min24s 158.83GB
--genesis-state-hash-cache-enabled=false && block number=1 2min53s 125.47GB 2min36s 96.53GB
--genesis-state-hash-cache-enabled=true && block number=0 7min35s 183.33GB 7min34s 154.68GB
--genesis-state-hash-cache-enabled=true && block number=1 0min50s 66.75GB 0min9s 1.31GB

Without the --genesis-state-hash-cache-enabled parameter enabled, the startup time reduced from 2 minutes and 53 seconds to 2 minutes and 36 seconds, and the total memory reclaimed decreased from 125.47GB to 96.53GB. The optimization effect here is modest.

With the --genesis-state-hash-cache-enabled parameter enabled, the startup time can be reduced from 50 seconds to 9 seconds, and the total memory reclaimed from 66.75GB to 1.31GB, marking a significant optimization effect.

This is the unit test coverage report for the PR: Unit Test Coverage Report.

Fixed Issue(s)

Thanks for sending a pull request! Have you done the following?

  • Checked out our contribution guidelines?
  • Considered documentation and added the doc-change-required label to this PR if updates are required.
  • Considered the changelog and included an update if required.
  • For database changes (e.g. KeyValueSegmentIdentifier) considered compatibility and performed forwards and backwards compatibility tests

Locally, you can run these tests to catch failures early:

  • unit tests: ./gradlew build
  • acceptance tests: ./gradlew acceptanceTest
  • integration tests: ./gradlew integrationTest
  • reference tests: ./gradlew ethereum:referenceTests:referenceTests

Signed-off-by: lyfsn <dev.wangyu@proton.me>
Copy link
Contributor

@fab-10 fab-10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand Besu startup code is not easy, so added some initial comments to try to improve the code in BesuCommand

}

private String genesisConfigString = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Suppliers.memoize seems better suited for this field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a commit to use Suppliers.memoize: 8ed9e4c

}

private String genesisConfigString = "";

private String genesisConfig() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now this method does not only return the genesis config, so the name is no more descriptive of its behavior

Copy link
Contributor Author

@lyfsn lyfsn Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the method genesisConfig()?

Comment on lines 2368 to 2388
// If the genesis state hash is present in the database, we can use the genesis file without
pluginCommonConfiguration.init(
dataDir(),
dataDir().resolve(DATABASE_PATH),
getDataStorageConfiguration(),
getMiningParameters());
final KeyValueStorageProvider storageProvider = keyValueStorageProvider(keyValueStorageName);
if (storageProvider != null) {
boolean isGenesisStateHashPresent;
try {
// A null pointer exception may be thrown here if the database is not initialized.
VariablesStorage variablesStorage = storageProvider.createVariablesStorage();
Optional<Hash> genesisStateHash = variablesStorage.getGenesisStateHash();
isGenesisStateHashPresent = genesisStateHash.isPresent();
} catch (Exception ignored) {
isGenesisStateHashPresent = false;
}
if (isGenesisStateHashPresent) {
genesisConfigString = JsonUtil.getJsonFromFileWithout(genesisFile, "alloc");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please explain why this code is needed here now, so we can find a better organization for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original code flow, genesis file (disk) --(1)--> genesis string --(2)--> genesis struct.

Here, the code is positioned at step (1).

If the node has the --genesis-state-hash-cache-enabled parameter enabled, the value of state root hash will be saved in variablesStorage.

The intent of this code is to determine if the value of state root hash already exists in the node's variablesStorage, and if so, during step (1), it will directly ignore the alloc field in the gensis.json file.

Because if the state root hash value already exists, it indicates that the account information in the genesis.json file has already been written into the state database, and there is no need to calculate the state root hash again later.

Comment on lines +522 to +541
public static String getJsonFromFileWithout(
final File genesisFile, final String excludedFieldName) {
StringBuilder jsonBuilder = new StringBuilder();
JsonFactory jsonFactory =
JsonFactory.builder()
.configure(JsonFactory.Feature.INTERN_FIELD_NAMES, false)
.configure(JsonFactory.Feature.CANONICALIZE_FIELD_NAMES, false)
.build();
try (JsonParser parser = jsonFactory.createParser(genesisFile)) {
JsonToken token;
while ((token = parser.nextToken()) != null) {
if (token == JsonToken.START_OBJECT) {
jsonBuilder.append(handleObject(parser, excludedFieldName));
}
}
} catch (Exception e) {
throw new RuntimeException(e);
}
return jsonBuilder.toString();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked if Jackson natively support excluding fields without having to implement the parsing methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have checked it, and this is the fastest method with the least memory usage. I will check it again in a moment.

Signed-off-by: lyfsn <dev.wangyu@proton.me>
@ahamlat
Copy link
Contributor

ahamlat commented Apr 16, 2024

@lyfsn Great work on the PR, I have few questions if you don't mind :

  • Is there a way to reproduce the numbers on my side, by sharing the big (fake) genesis file, and the configuration you used for your test ?
  • Is "requiring less than 2GB of memory" a strong prerequisite for your use case ?
  • Can you share the -Xmx value used during your test?
  • Have you checked GC activity during the test period ?
  • Have you done some profiling to see what is taking more than 7 minutes in the first and third test with this PR's code ?

@lyfsn
Copy link
Contributor Author

lyfsn commented Apr 16, 2024

@lyfsn Great work on the PR, I have few questions if you don't mind :

  • Is there a way to reproduce the numbers on my side, by sharing the big (fake) genesis file, and the configuration you used for your test ?
  • Is "requiring less than 2GB of memory" a strong prerequisite for your use case ?
  • Can you share the -Xmx value used during your test?
  • Have you checked GC activity during the test period ?
  • Have you done some profiling to see what is taking more than 7 minutes in the first and third test with this PR's code ?
  • You can use besu.json from this repository (though it's smaller than the one I used). LInk
  • This repository has 8 jfr files; they are about memory allocation records. LInk
  • Also, in this repository, there are some records about profiler records. Link
  • I run the binary Besu directly, so I do not set -Xmx explicitly. I run this on a 64GB server, so the default should be 16GB.

You can refer to this information first, and if you have more specific needs, please feel free to let me explain!

@fab-10
Copy link
Contributor

fab-10 commented Apr 17, 2024

While reviewing your PR, I saw the need for a refactor of how the genesis file and the genesis options are managed, since with the time some tech debt has accumulated, and with that I had the idea of making the load of the accounts lazy, with the hope of moving that logic in the class that manage the genesis file, will let you know when I have something ready to test.

@ahamlat
Copy link
Contributor

ahamlat commented Apr 18, 2024

@lyfsn thanks for the troubleshooting data, awesome work.
I was able to reproduce some of the number with this PR, for --genesis-state-hash-cache-enabled=true && block number=0, but then I did the test on a previous release, 23.10.2 (without both PRs), and it was faster for the block 0 test :

  • block number=0 : 5 minutes
  • block number=1 : 4 minutes 17 seconds

I'm currently analyzing the regression as we know what code is causing it.
I would like to know what was the test you shared in you blog where Besu took more than 100 minutes to start

@lyfsn
Copy link
Contributor Author

lyfsn commented Apr 18, 2024

@lyfsn thanks for the troubleshooting data, awesome work. I was able to reproduce some of the number with this PR, for --genesis-state-hash-cache-enabled=true && block number=0, but then I did the test on a previous release, 23.10.2 (without both PRs), and it was faster for the block 0 test :

  • block number=0 : 5 minutes
  • block number=1 : 4 minutes 17 seconds

I'm currently analyzing the regression as we know what code is causing it. I would like to know what was the test you shared in you blog where Besu took more than 100 minutes to start

I guess there are two main reasons:

  • At that time, we used our devnet file, which is about 1.1GB, a bit larger than the mainnet file, which is about 800MB. The mainnet file was provided to you for convenience in using publicly available files for testing.
  • The server's hardware was limited at that time, with only 16GB of RAM and a set 16GB swap space. The swap space obviously slowed down the processing speed.

These were the parameters for the test at that time:

Client Startup command Version OS Arch CPU core RAM Swap space
Besu env BESU_OPTS=-Xmx16g ./besu --(....) 24.3.0/openjdk17 ubuntu 22.04 x64 2 16GB 16GB

Additionally, are you referring to this blog article? The article also detailed the testing conditions:
https://devlog.fusionist.io/posts/ethereum-execution-clients-memory-comparison-in-endurance/

@fab-10
Copy link
Contributor

fab-10 commented Apr 18, 2024

are your genesis files public? I would like to test them on the refactor, if not I will try to generate one with random data

@lyfsn
Copy link
Contributor Author

lyfsn commented Apr 18, 2024

are your genesis files public? I would like to test them on the refactor, if not I will try to generate one with random data

I sent an email to your public email address; it contains the file access document.

@fab-10
Copy link
Contributor

fab-10 commented Apr 22, 2024

@lyfsn could you test this PR?
The allocations are lazy loaded only when needed, and there is reduced memory usage when parsing it, with less object copies and allocations streaming.

@fab-10
Copy link
Contributor

fab-10 commented Jun 3, 2024

Closing this as superseded by #6977

@fab-10 fab-10 closed this Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants