Emit warnings when locks are held for too long #141

lithp · 2021-10-15T21:27:43Z

Quick note: I'm not sure who to give this to so I gave it to you @njgheorghita, but feel free to pass it off to somebody else if you think that's appropriate!

This PR is not yet finished, there's some cleanup to do (such as removing code duplication, and logging when rw_write!() calls hold onto the lock for too long, but opening a PR now to get feedback on the general approach.

Add rw_read and rw_write macros which wrap access to locks
- They emit warnings if lock acqusition takes over 100ms
- They emit warnings if locks are held for over 100ms
Use rw_read and rw_write macros in most places RwLock's are accessed

Using these macros I was able to immediately identify the cause of a
deadlock during boot.

Example output:

$ RUST_LOG=debug TRIN_INFURA_PROJECT_ID=XXX cargo run -- --networks state --bootnodes $ENR
Launching trin
[more output]
Oct 15 14:25:25.497 DEBUG trin_state::network: Attempting bond with bootnode ENR: NodeId: 0x76f9..6f28, Socket: Some(71.202.127.37:4567)    
Oct 15 14:25:25.598  WARN trin_core::portalnet::events: [trin-core/src/portalnet/events.rs:31] took more than 100 ms to acquire lock    
Oct 15 14:25:25.598  WARN trin_core: [trin-state/src/network.rs:38] lock held for over 100ms, not yet released    
Oct 15 14:25:25.699  WARN trin_core::portalnet::overlay: [trin-core/src/portalnet/overlay.rs:279] waiting more than 100ms to acquire lock, still waiting

This makes it trivial to figure out where the deadlock is!

carver

Neat. I didn't do a deep dive on the macro, but generally LGTM 👍🏻

carver · 2021-10-15T22:45:58Z

trin-core/src/lib.rs

+}
+
+#[macro_export]
+macro_rules! rw_read {


If we're going to leave this lock macro in place everywhere (which I'm cool with), then we should probably only run this tooling overhead on non-release builds. Something like branching on debug_assertions (I found that trick in this post).

I experimented with using debug_assertions: 0cb2445

I think I'd prefer to drop this commit though. It makes the code a fair amount less readable, doesn't seem like it drops very much overhead, and prevents us from being able to diagnose deadlocks if someone running a release build tries to send us their logs.

What do you think?

SamWilsn · 2021-10-16T02:19:52Z

trin-core/src/lib.rs

+}
+
+#[macro_export]
+macro_rules! rw_read {


How would you feel about implementing this as an extension trait, using std::panic::Location::caller and #[track_caller]?

Would be more natural than a macro, with much less code to compile.

Oh perfect, thank you. Yes, I used a macro exactly because I was ignorant of the existence of track_caller and friends. I'll give using them a try!

I gave it a try but ended up failing. I don't know enough Rust to feel productive at figuring out what's going on so I think I'll leave this as a macro for now.

When I apply #[track_caller] to the method everything builds cleanly but the caller is not tracked. I'm using cargo 1.55.0, and support for #[track_caller] in extensions is from 1.46.0, so this should work. My best guess is that the trait is somehow interacting badly with the #[async_trait] macro I needed to use, I expect that when this trait rewrites the method signature it doesn't faithfully copy over attributes.

Deleted the previous comment, I was misreading the output, still haven't been able to get #[track_caller] to work.

The problem seems to be that #[track_caller] does not work for async functions. rust-lang/rust#78840

Would something like this work?

Indeed, it does! Pushed some commits which incorporate it, thanks for the help

SamWilsn · 2021-10-16T05:24:19Z

trin-core/src/lib.rs

+        let now = std::time::Instant::now();
+
+        loop {
+            tokio::select! {


Since there's already a dependency on futures, might be more understandable to write this using futures::future::select?

I'll give it a try

Since there's already a dependency on futures, might be more understandable to write this using futures::future::select?

What are the pros of using future::select here if we have already tokio as dependency? Looking into this PR, it seems to me that tokio::select may be the better alternative?

I find the function based approach easier to read, though I don't really have that strong of an opinion on the subject.

I wasn't able to find much of a difference between the two select!s but refactored this code a bit by unrolling the loop which should make this code a lot more understandable for first time readers.

njgheorghita

lgtm! nothing to add from my end

- Add read_with_warn() and write_with_warn() methods to RwLock - They emit warnings if lock acqusition takes over 100ms - They emit warnings if locks are held for over 100ms - Use read_with_warn() and write_with_warn() in most places RwLock's are accessed Using these methods I was able to immediately identify the cause of a deadlock during boot.

lithp requested a review from njgheorghita October 15, 2021 21:27

lithp mentioned this pull request Oct 15, 2021

bugfix - remove deadlock when pinging bootnodes #142

Merged

carver approved these changes Oct 15, 2021

View reviewed changes

SamWilsn reviewed Oct 16, 2021

View reviewed changes

This was referenced Oct 19, 2021

Overlay network pings from test harness are not pinging #147

Closed

Add http/ipc jsonrpc testing to harness #148

Merged

njgheorghita approved these changes Oct 19, 2021

View reviewed changes

lithp force-pushed the lithp/add-lock-debug-logging branch 2 times, most recently from 8ecb3b9 to e57ada4 Compare October 21, 2021 21:19

lithp force-pushed the lithp/add-lock-debug-logging branch from e57ada4 to 00f7588 Compare October 21, 2021 21:30

lithp merged commit 140d2d1 into ethereum:master Oct 22, 2021

lithp deleted the lithp/add-lock-debug-logging branch October 22, 2021 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit warnings when locks are held for too long #141

Emit warnings when locks are held for too long #141

lithp commented Oct 15, 2021

carver left a comment

carver Oct 15, 2021

lithp Oct 19, 2021

SamWilsn Oct 16, 2021

lithp Oct 16, 2021

lithp Oct 18, 2021

lithp Oct 18, 2021

lithp Oct 18, 2021

SamWilsn Oct 19, 2021

lithp Oct 19, 2021

SamWilsn Oct 16, 2021

lithp Oct 16, 2021

ogenev Oct 16, 2021

SamWilsn Oct 18, 2021

lithp Oct 19, 2021

njgheorghita left a comment

Emit warnings when locks are held for too long #141

Emit warnings when locks are held for too long #141

Conversation

lithp commented Oct 15, 2021

carver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njgheorghita left a comment

Choose a reason for hiding this comment