Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reth killed by OOM killer w/ 64GB RAM #5408

Closed
1 task done
gbrew opened this issue Nov 12, 2023 · 16 comments
Closed
1 task done

Reth killed by OOM killer w/ 64GB RAM #5408

gbrew opened this issue Nov 12, 2023 · 16 comments
Labels
C-bug An unexpected or incorrect behavior S-needs-investigation This issue requires detective work to figure out what's going wrong

Comments

@gbrew
Copy link
Contributor

gbrew commented Nov 12, 2023

Describe the bug

reth syncs and tracks the chain head with no issues for days, but every week or two it will be killed by the linux OOM killer. This is during normal operation, not sync, so I've opened a new issue for it as the other OOM issues look to be during sync. The machine is running ubuntu 22.04 and has 64GB RAM. There are several other processes using a decent amount of memory, but there should be plenty left over for reth. You can see the detailed breakdown of process memory usage in the dmesg.txt attached.

The machine is running reth, lighthouse, an arbitrum node which is using reth, and an RPC client program which is primarily calling eth_call and eth_subscribe on reth. It's not a particularly heavy RPC load, but there may be some bursts or activity. I don't see signs of anything crazy happening right before the OOM in any of the client logs.

I've had this happen twice now. You can see the kernel logs for both OOM events in the attached logs. From the most recent one, reth is using 39612664kB of anon-rss at the time it is killed. I've also pasted a screenshot from grafana of the memory stats before the latest OOM. Eyeballing the rest of the grafana stats I don't see anything concerning in that time period. There are 10 inflight requests and 14 readonly transactions open right before the crash, but this is not unusual for the days prior where no issues were observed. I'd be happy to send more data from grafana if you have a nice way for me to export it. Reth log before the crash is attached as well, but nothing in it caught my eye.

From the grafana jemalloc stats, there are two quick spikes in memory usage right about 20 minutes before the crash which look like they might be similar in nature to the crash, but didn't quite trigger the OOM killer. At the crash, the RSS goes from 20GB to 33GB in one tick of the graph, and appears to hit about 40GB at the crash. At the same time, the jemalloc stats show "active", "allocated", and "mapped" all jumping from <30GB to >400GB.

reth.log
dmesg.txt
reth-oom-dashboard

Let me know if there's anything else I can do to help track this down. Do you have any guesses as to what would be using ~40GB of anon-rss?

Steps to reproduce

  1. Sync reth
  2. Run some load against it
  3. Wait
  4. Boom!

Node logs

attached above

Platform(s)

Linux (x86)

What version/commit are you on?

reth Version: 0.1.0-alpha.10 Commit SHA: a9fa281 Build Timestamp: 2023-10-28T03:44:10.194397854Z Build Features: default,jemalloc Build Profile: release

What database version are you on?

1

What type of node are you running?

Archive (default)

What prune config do you use, if any?

None

If you've built Reth from source, provide the full command you used

cargo install --locked --path bin/reth --bin reth

Code of Conduct

  • I agree to follow the Code of Conduct
@gbrew gbrew added C-bug An unexpected or incorrect behavior S-needs-triage This issue needs to be labelled labels Nov 12, 2023
@shekhirin shekhirin added S-needs-investigation This issue requires detective work to figure out what's going wrong and removed S-needs-triage This issue needs to be labelled labels Nov 14, 2023
@shekhirin
Copy link
Collaborator

shekhirin commented Nov 14, 2023

Can you please show the "Transaction Pool" section of dashboard for the same time range?

@gbrew
Copy link
Contributor Author

gbrew commented Nov 14, 2023

Can you please show the "Transaction Pool" section of dashboard for the same time range?

Of course, here it is:
Screenshot 2023-11-14 at 7 52 50 AM
Screenshot 2023-11-14 at 7 53 24 AM

@gbrew
Copy link
Contributor Author

gbrew commented Nov 14, 2023

Also, I took a quick try at running the jemalloc heap profiler in reth, but I didn't manage to get any output. I just changed the default feature in bin/reth/Cargo.toml to be:

default = ["jemalloc-prof"]

And then tried running various commands like:

MALLOC_CONF="prof:true,lg_prof_interval:25" reth node

Has anybody gotten heap profiles using this? I'm sure I'm missing something simple, but would appreciate a recipe for getting this working if anybody has one.

@Rjected
Copy link
Member

Rjected commented Nov 14, 2023

Has anybody gotten heap profiles using this? I'm sure I'm missing something simple, but would appreciate a recipe for getting this working if anybody has one.

I should be able to finish this #3819 in the next few days, but in the meantime you'll need to use _RJEM_MALLOC_CONF="prof:true,lg_prof_interval:25" instead

@gbrew
Copy link
Contributor Author

gbrew commented Nov 20, 2023

Has anybody gotten heap profiles using this? I'm sure I'm missing something simple, but would appreciate a recipe for getting this working if anybody has one.

I should be able to finish this #3819 in the next few days, but in the meantime you'll need to use _RJEM_MALLOC_CONF="prof:true,lg_prof_interval:25" instead

Thanks @Rjected , that was super helpful. I've now been running reth with heap profiling & symbols for about a day using _RJEM_MALLOC_CONF=prof:true,prof_gdump:true,lg_prof_sample:19 . I'm hoping the high-water-mark profiling will be helpful when I hit the OOM condition, but we'll see. So far it has generated 160GB of heap profiles, so hopefully that doesn't grow too fast to be useful.

I did look at the output for one of the profiles after an hour or two of running when reth seemed to be in a pretty steady-state memory-wise. I'm attaching the svg just in case anybody wants to have a look. I'll admit I can't make much sense of it with all the runtime stuff going on, but it almost looks like it is saying that revm_interpreter::interpreter::stack::Stack::new has 33GB allocated at the time. Seems crazy, so I'll assume I'm wrong about that, but I'd love to hear if anybody can make some sense of it:

reth_memory_profile

@Rjected
Copy link
Member

Rjected commented Nov 20, 2023

Has anybody gotten heap profiles using this? I'm sure I'm missing something simple, but would appreciate a recipe for getting this working if anybody has one.

I should be able to finish this #3819 in the next few days, but in the meantime you'll need to use _RJEM_MALLOC_CONF="prof:true,lg_prof_interval:25" instead

Thanks @Rjected , that was super helpful. I've now been running reth with heap profiling & symbols for about a day using _RJEM_MALLOC_CONF=prof:true,prof_gdump:true,lg_prof_sample:19 . I'm hoping the high-water-mark profiling will be helpful when I hit the OOM condition, but we'll see. So far it has generated 160GB of heap profiles, so hopefully that doesn't grow too fast to be useful.

I did look at the output for one of the profiles after an hour or two of running when reth seemed to be in a pretty steady-state memory-wise. I'm attaching the svg just in case anybody wants to have a look. I'll admit I can't make much sense of it with all the runtime stuff going on, but it almost looks like it is saying that revm_interpreter::interpreter::stack::Stack::new has 33GB allocated at the time. Seems crazy, so I'll assume I'm wrong about that, but I'd love to hear if anybody can make some sense of it:

Hmm, are you tracing a lot with record_stack_snapshots? It could be that a stack is being initialized, with STACK_CAPACITY, many times, on each step:
https://github.com/bluealloy/revm/blob/5aff2f1146d3650f59fdd8e148143013b1ecd7c2/crates/interpreter/src/interpreter/stack.rs#L43-L44

@gbrew
Copy link
Contributor Author

gbrew commented Nov 21, 2023

https://github.com/bluealloy/revm/blob/5aff2f1146d3650f59fdd8e148143013b1ecd7c2/crates/interpreter/src/interpreter/stack.rs#L43-L44

I agree that does look like the allocation site that the profile is pointing to. I make a lot of debug_traceCall RPC calls using the CallTracer. I don't know the reth internals, but that would certainly involve allocating an EVM stack. Given the STACK_CAPACITY, it looks like each stack should be 32KB, and the profile indicates 33GB, so that would indicate reth keeping 1 million stacks alive? Also, because I'm not seeing linear memory growth, it's not just a plain memory leak. The only thing I can come up with which would be consistent with this would be if there was a caching mechanism which was keeping all these EVM stacks alive. Or that the profile data I got is just bogus.

@Rjected
Copy link
Member

Rjected commented Nov 21, 2023

https://github.com/bluealloy/revm/blob/5aff2f1146d3650f59fdd8e148143013b1ecd7c2/crates/interpreter/src/interpreter/stack.rs#L43-L44

I agree that does look like the allocation site that the profile is pointing to. I make a lot of debug_traceCall RPC calls using the CallTracer. I don't know the reth internals, but that would certainly involve allocating an EVM stack. Given the STACK_CAPACITY, it looks like each stack should be 32KB, and the profile indicates 33GB, so that would indicate reth keeping 1 million stacks alive? Also, because I'm not seeing linear memory growth, it's not just a plain memory leak. The only thing I can come up with which would be consistent with this would be if there was a caching mechanism which was keeping all these EVM stacks alive. Or that the profile data I got is just bogus.

do you have any example debug_traceCall RPC that we can use to try to reproduce?

@gbrew
Copy link
Contributor Author

gbrew commented Nov 21, 2023

https://github.com/bluealloy/revm/blob/5aff2f1146d3650f59fdd8e148143013b1ecd7c2/crates/interpreter/src/interpreter/stack.rs#L43-L44

I agree that does look like the allocation site that the profile is pointing to. I make a lot of debug_traceCall RPC calls using the CallTracer. I don't know the reth internals, but that would certainly involve allocating an EVM stack. Given the STACK_CAPACITY, it looks like each stack should be 32KB, and the profile indicates 33GB, so that would indicate reth keeping 1 million stacks alive? Also, because I'm not seeing linear memory growth, it's not just a plain memory leak. The only thing I can come up with which would be consistent with this would be if there was a caching mechanism which was keeping all these EVM stacks alive. Or that the profile data I got is just bogus.

do you have any example debug_traceCall RPC that we can use to try to reproduce?

I don't have any RPC logs handy, but I'm making the calls from ethers-rs using the following code, using pending transactions from the mempool. So to reproduce you could just subscribe to pending transactions and use them in something like this:

    let tracing_options = GethDebugTracingOptions {
        enable_return_data: Some(false),
        enable_memory: Some(false),
        disable_stack: Some(true),
        disable_storage: Some(true),
        tracer: Some(GethDebugTracerType::BuiltInTracer(GethDebugBuiltInTracerType::CallTracer)),
        tracer_config: Some(GethDebugTracerConfig::BuiltInTracer(GethDebugBuiltInTracerConfig::CallTracer(CallConfig{only_top_call: Some(false), with_log: Some(true)}))),
        ..Default::default()
    };
    let tracing_options = GethDebugTracingCallOptions { tracing_options };

    let mut tx_req: TransactionRequest = tx.into();
    tx_req.nonce = None;
    tx_req.gas_price = None;

    let debug_trace = client.debug_trace_call(tx_req, block, tracing_options).await?;

I suspect it doesn't matter what the transaction is, but if there is an issue and it does turn out to be caching related you may need a bunch of unique ones.

@gbrew
Copy link
Contributor Author

gbrew commented Nov 21, 2023

Looks like the jemalloc profile was right on the money. Having looked at the code, TracingInspector::start_step() does allocate and save a full stack, even if stacks are turned off, as the unwrap_or_default will call Stack::new():

self.config.record_stack_snapshots.then(|| interp.stack.clone()).unwrap_or_default();

If we're allocating a full 32KB stack per EVM instruction executed, then I think tracing a single txn could reasonably end up with ~1 million stacks allocated at once, and hit the 32GB of memory that I was seeing in the jemalloc profile. This lines up with my observation of getting an occasional huge memory spike and OOM too, as I imagine those were just from transactions with a very large number of entries in the trace.

I'm testing this right now by changing Stack::new() in revm to allocate a zero-capacity vec. So far memory usage reported by jemalloc looks greatly improved: resident & mapped are more like 800MB than the previous steady state around 8GB. Still a bit early to say, but I'll run for a few days to confirm.

I'll see if I can come up with a nice way to fix this. It'd be easy to fix just the case where stacks are not being used in TracingInspector by using an Option, but I'm hoping there's a way that will fix the case where the stack is actually needed as well.

@mattsse
Copy link
Collaborator

mattsse commented Nov 21, 2023

thanks for digging!

fixed in #5521 by not using default, hehe

@gbrew
Copy link
Contributor Author

gbrew commented Nov 21, 2023

thanks for digging!

fixed in #5521 by not using default, hehe

No problem, thanks @mattsse ! You guys made the digging pretty easy with all the jemalloc tooling. I just read the PR and it looks like it will work to me, but I'll leave the review to somebody who knows the codebase better. The case where stacks are enabled will still be really bad, but that can be a separate issue. I can open an issue for that if it would be helpful.

Here's the before and after with the relevant jemalloc stats. Pretty dramatic!

Screenshot 2023-11-21 at 11 57 42 AM

@mattsse
Copy link
Collaborator

mattsse commented Nov 21, 2023

I can open an issue for that if it would be helpful.

yeah that would be helpful, pretty sure we can make this a bit smarter if we can double check what we actually need from the stack and maybe add some additional configs

@gbrew
Copy link
Contributor Author

gbrew commented Nov 21, 2023

I can open an issue for that if it would be helpful.

yeah that would be helpful, pretty sure we can make this a bit smarter if we can double check what we actually need from the stack and maybe add some additional configs

Cool, Just opened #5522 for this.

@gbrew
Copy link
Contributor Author

gbrew commented Nov 21, 2023

Here's the jemalloc profile after my basic fix, just in case anybody's curious:

jemalloc_reth_profile_fixed

@gbrew
Copy link
Contributor Author

gbrew commented Nov 25, 2023

@mattsse I'll close this unless there's some reason to keep it open. I've verified that the changes above have greatly decreased memory consumption when tracing.

@gbrew gbrew closed this as completed Nov 27, 2023
@github-project-automation github-project-automation bot moved this from Todo to Done in Reth Tracker Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug An unexpected or incorrect behavior S-needs-investigation This issue requires detective work to figure out what's going wrong
Projects
Archived in project
Development

No branches or pull requests

4 participants