Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spike] rocksdb investigation #8534

Closed
4 tasks done
rolfyone opened this issue Aug 27, 2024 · 6 comments · Fixed by #8532
Closed
4 tasks done

[spike] rocksdb investigation #8534

rolfyone opened this issue Aug 27, 2024 · 6 comments · Fixed by #8532
Assignees

Comments

@rolfyone
Copy link
Contributor

rolfyone commented Aug 27, 2024

  • Disk space ballooning
  • Memory management with rocksdb being outside of JVM bounds
  • Updating version to newer release
  • Validate migration from a current rocksdb instance to new version goes smoothly (rocksdb version upgraded and LZ4 compression being used by default now)
@gfukushima
Copy link
Contributor

Last attempt to use rocksDb as teku database ran into issues of high disk usage.
RocksDb is currently on 7.7.3. Latest release is at the time of writing is 9.5.2.

I've made an attempt to use teku with rocks on holesky and have already experienced high disk usage which may be related to the usage of some outdated default or require further tweaks to the db config.
The plan for this ticket will be to try and upgrade the rocksdb library and make it so it's a sustainable choice for teku users, specially having leveldb going have no regular releases (latest was Feb 24, 2021)

Further tweaks might be required to get the best performance out of rocksdb. For now I'm borrow some of the battle tested configs used in the besu project. I'll add more details once more conclusive numbers come up from the tests.

@gfukushima
Copy link
Contributor

gfukushima commented Sep 2, 2024

Flushing some of the latest attempts and findings:

1- The simple upgrade of RocksDb to 9.5.2 seems to be fairly stable using the current configs with the only cons being the database size which roughly doubles the size when compared to what we have currently with LevelDb.
Pros: Updated library / releases with optimisations / possibility of further tweaks to get the db in a better state (find the right balance between disk and memory?)

2- Tested a few tweaks up until commit 90ef567
Results were mixed, we do achieve a better database size but at the cost of memory.
Screenshot 2024-09-02 at 4 36 51 PM

Teku process gets OOM killed every few minutes.

[Mon Sep  2 00:28:56 2024] Out of memory: Killed process 374043 (java) total-vm:11921472kB, anon-rss:8925172kB, file-rss:3124kB, shmem-rss:0kB, UID:1027 pgtables:18272kB oom_score_adj:0
[Mon Sep  2 00:41:38 2024] Out of memory: Killed process 375204 (java) total-vm:11728264kB, anon-rss:8927108kB, file-rss:3120kB, shmem-rss:0kB, UID:1027 pgtables:18192kB oom_score_adj:0
[Mon Sep  2 00:54:36 2024] Out of memory: Killed process 376074 (java) total-vm:12270680kB, anon-rss:8919796kB, file-rss:3296kB, shmem-rss:0kB, UID:1027 pgtables:18484kB oom_score_adj:0
[Mon Sep  2 01:07:16 2024] Out of memory: Killed process 376945 (java) total-vm:11969932kB, anon-rss:8911880kB, file-rss:2924kB, shmem-rss:0kB, UID:1027 pgtables:18412kB oom_score_adj:0
[Mon Sep  2 01:20:08 2024] Out of memory: Killed process 377721 (java) total-vm:11885324kB, anon-rss:8890804kB, file-rss:3020kB, shmem-rss:0kB, UID:1027 pgtables:18104kB oom_score_adj:0
[Mon Sep  2 01:33:03 2024] Out of memory: Killed process 379135 (java) total-vm:12189184kB, anon-rss:8877488kB, file-rss:3140kB, shmem-rss:0kB, UID:1027 pgtables:18096kB oom_score_adj:0
[Mon Sep  2 01:45:43 2024] Out of memory: Killed process 379894 (java) total-vm:11977648kB, anon-rss:8927728kB, file-rss:3096kB, shmem-rss:0kB, UID:1027 pgtables:18188kB oom_score_adj:0
[Mon Sep  2 01:58:32 2024] Out of memory: Killed process 380645 (java) total-vm:11733648kB, anon-rss:8912288kB, file-rss:3256kB, shmem-rss:0kB, UID:1027 pgtables:18140kB oom_score_adj:0
[Mon Sep  2 02:11:17 2024] Out of memory: Killed process 381392 (java) total-vm:11666340kB, anon-rss:8887764kB, file-rss:3176kB, shmem-rss:0kB, UID:1027 pgtables:18220kB oom_score_adj:0
[Mon Sep  2 02:23:17 2024] Out of memory: Killed process 382821 (java) total-vm:11697672kB, anon-rss:8836272kB, file-rss:3148kB, shmem-rss:0kB, UID:1027 pgtables:18316kB oom_score_adj:0
[Mon Sep  2 02:37:19 2024] Out of memory: Killed process 383606 (java) total-vm:11772800kB, anon-rss:8885020kB, file-rss:2944kB, shmem-rss:0kB, UID:1027 pgtables:18088kB oom_score_adj:0
[Mon Sep  2 03:02:48 2024] Out of memory: Killed process 384441 (java) total-vm:11637084kB, anon-rss:8815580kB, file-rss:3104kB, shmem-rss:0kB, UID:1027 pgtables:18104kB oom_score_adj:0
[Mon Sep  2 03:15:25 2024] Out of memory: Killed process 385581 (java) total-vm:12010148kB, anon-rss:8890032kB, file-rss:2864kB, shmem-rss:0kB, UID:1027 pgtables:18584kB oom_score_adj:0
[Mon Sep  2 03:28:13 2024] Out of memory: Killed process 386956 (java) total-vm:12080796kB, anon-rss:8885268kB, file-rss:3172kB, shmem-rss:0kB, UID:1027 pgtables:18120kB oom_score_adj:0
[Mon Sep  2 03:38:43 2024] Out of memory: Killed process 387780 (java) total-vm:11630104kB, anon-rss:8824312kB, file-rss:3008kB, shmem-rss:0kB, UID:1027 pgtables:18116kB oom_score_adj:0
[Mon Sep  2 03:47:20 2024] Out of memory: Killed process 388596 (java) total-vm:11885608kB, anon-rss:8865624kB, file-rss:3028kB, shmem-rss:0kB, UID:1027 pgtables:18064kB oom_score_adj:0
[Mon Sep  2 04:00:43 2024] Out of memory: Killed process 389321 (java) total-vm:11717372kB, anon-rss:8793064kB, file-rss:3068kB, shmem-rss:0kB, UID:1027 pgtables:18296kB oom_score_adj:0
[Mon Sep  2 04:16:05 2024] Out of memory: Killed process 390159 (java) total-vm:11805660kB, anon-rss:8825008kB, file-rss:3172kB, shmem-rss:0kB, UID:1027 pgtables:17980kB oom_score_adj:0
[Mon Sep  2 04:30:16 2024] Out of memory: Killed process 391655 (java) total-vm:11688440kB, anon-rss:8842716kB, file-rss:3036kB, shmem-rss:0kB, UID:1027 pgtables:18244kB oom_score_adj:0
[Mon Sep  2 04:44:54 2024] Out of memory: Killed process 392612 (java) total-vm:11845524kB, anon-rss:8846968kB, file-rss:3072kB, shmem-rss:0kB, UID:1027 pgtables:18020kB oom_score_adj:0
[Mon Sep  2 04:51:19 2024] Out of memory: Killed process 393481 (java) total-vm:11832408kB, anon-rss:8837012kB, file-rss:3228kB, shmem-rss:0kB, UID:1027 pgtables:18012kB oom_score_adj:0
[Mon Sep  2 05:10:41 2024] Out of memory: Killed process 394066 (java) total-vm:11645784kB, anon-rss:8714256kB, file-rss:3216kB, shmem-rss:0kB, UID:1027 pgtables:17748kB oom_score_adj:0
[Mon Sep  2 05:23:14 2024] Out of memory: Killed process 395649 (java) total-vm:11775476kB, anon-rss:8828072kB, file-rss:3180kB, shmem-rss:0kB, UID:1027 pgtables:18144kB oom_score_adj:0
[Mon Sep  2 05:36:06 2024] Out of memory: Killed process 396439 (java) total-vm:11994008kB, anon-rss:8818412kB, file-rss:2996kB, shmem-rss:0kB, UID:1027 pgtables:17992kB oom_score_adj:0
[Mon Sep  2 05:48:59 2024] Out of memory: Killed process 397245 (java) total-vm:12107152kB, anon-rss:8967616kB, file-rss:3032kB, shmem-rss:0kB, UID:1027 pgtables:18264kB oom_score_adj:0
[Mon Sep  2 06:01:49 2024] Out of memory: Killed process 398015 (java) total-vm:11969852kB, anon-rss:8960672kB, file-rss:3156kB, shmem-rss:0kB, UID:1027 pgtables:18244kB oom_score_adj:0
[Mon Sep  2 06:27:18 2024] Out of memory: Killed process 398779 (java) total-vm:11761520kB, anon-rss:8919484kB, file-rss:3112kB, shmem-rss:0kB, UID:1027 pgtables:18408kB oom_score_adj:0
[Mon Sep  2 06:40:06 2024] Out of memory: Killed process 400614 (java) total-vm:12025004kB, anon-rss:8905456kB, file-rss:2936kB, shmem-rss:0kB, UID:1027 pgtables:18388kB oom_score_adj:0

I've spun up instances with 32GB to test the previous version and noticed that memory consumption is growing further (currently sitting at 13GB) for the teku-node process which kind of indicates that the off heap memory is increasing way above than expected since we set-Xmx5g for those instances.
Screenshot 2024-09-02 at 5 26 30 PM
Update: memory has stabilized around 16GB.

Currently testing a version disabling compression: fa9270e

And reducing cache to it's original size: db412bd

@gfukushima
Copy link
Contributor

Reducing cache to the previous default has brought nodes to a stable state: memory is slightly higher than nodes on LevelDb (roughly 1GB higher with rocks)
Disk usage is comparable (we're currently using LZ4 compression, leveldb uses Snappy by default).
I think this is a sustainable config we can start stressing and tweak as necessary.

Metrics have also shown some improvements in GC time.

Screenshot 2024-09-04 at 12 02 20 PM

@gfukushima gfukushima mentioned this issue Sep 5, 2024
2 tasks
@gfukushima
Copy link
Contributor

Upgrade scenario has been tested and it doesn't introduce any issues/regression.

@gfukushima
Copy link
Contributor

gfukushima commented Sep 12, 2024

The experiment with jemalloc has shown good results (using less memory than the same config using leveldb) +- 500MB.
Memory allocation seems more stable (you can less and lower spikes over time). This is a besu+teku pair just for context.

Screenshot 2024-09-12 at 7 15 45 PM

@gfukushima
Copy link
Contributor

Node running without jemalloc didn't show the same stability and does use a little bit more memory than when it was running with leveldb. Change deployed on the 9th. This is a geth+teku pair.
Screenshot 2024-09-17 at 6 00 38 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants