Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit warnings when locks are held for too long #141

Merged
merged 1 commit into from
Oct 22, 2021

Conversation

lithp
Copy link
Contributor

@lithp lithp commented Oct 15, 2021

Quick note: I'm not sure who to give this to so I gave it to you @njgheorghita, but feel free to pass it off to somebody else if you think that's appropriate!

This PR is not yet finished, there's some cleanup to do (such as removing code duplication, and logging when rw_write!() calls hold onto the lock for too long, but opening a PR now to get feedback on the general approach.


  • Add rw_read and rw_write macros which wrap access to locks

    • They emit warnings if lock acqusition takes over 100ms
    • They emit warnings if locks are held for over 100ms
  • Use rw_read and rw_write macros in most places RwLock's are accessed

Using these macros I was able to immediately identify the cause of a
deadlock during boot.


Example output:

$ RUST_LOG=debug TRIN_INFURA_PROJECT_ID=XXX cargo run -- --networks state --bootnodes $ENR
Launching trin
[more output]
Oct 15 14:25:25.497 DEBUG trin_state::network: Attempting bond with bootnode ENR: NodeId: 0x76f9..6f28, Socket: Some(71.202.127.37:4567)    
Oct 15 14:25:25.598  WARN trin_core::portalnet::events: [trin-core/src/portalnet/events.rs:31] took more than 100 ms to acquire lock    
Oct 15 14:25:25.598  WARN trin_core: [trin-state/src/network.rs:38] lock held for over 100ms, not yet released    
Oct 15 14:25:25.699  WARN trin_core::portalnet::overlay: [trin-core/src/portalnet/overlay.rs:279] waiting more than 100ms to acquire lock, still waiting    

This makes it trivial to figure out where the deadlock is!

Copy link
Collaborator

@carver carver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat. I didn't do a deep dive on the macro, but generally LGTM 👍🏻

}

#[macro_export]
macro_rules! rw_read {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to leave this lock macro in place everywhere (which I'm cool with), then we should probably only run this tooling overhead on non-release builds. Something like branching on debug_assertions (I found that trick in this post).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with using debug_assertions: 0cb2445

I think I'd prefer to drop this commit though. It makes the code a fair amount less readable, doesn't seem like it drops very much overhead, and prevents us from being able to diagnose deadlocks if someone running a release build tries to send us their logs.

What do you think?

}

#[macro_export]
macro_rules! rw_read {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about implementing this as an extension trait, using std::panic::Location::caller and #[track_caller]?

Would be more natural than a macro, with much less code to compile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh perfect, thank you. Yes, I used a macro exactly because I was ignorant of the existence of track_caller and friends. I'll give using them a try!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a try but ended up failing. I don't know enough Rust to feel productive at figuring out what's going on so I think I'll leave this as a macro for now.

When I apply #[track_caller] to the method everything builds cleanly but the caller is not tracked. I'm using cargo 1.55.0, and support for #[track_caller] in extensions is from 1.46.0, so this should work. My best guess is that the trait is somehow interacting badly with the #[async_trait] macro I needed to use, I expect that when this trait rewrites the method signature it doesn't faithfully copy over attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted the previous comment, I was misreading the output, still haven't been able to get #[track_caller] to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem seems to be that #[track_caller] does not work for async functions. rust-lang/rust#78840

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it does! Pushed some commits which incorporate it, thanks for the help

let now = std::time::Instant::now();

loop {
tokio::select! {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's already a dependency on futures, might be more understandable to write this using futures::future::select?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a try

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's already a dependency on futures, might be more understandable to write this using futures::future::select?

What are the pros of using future::select here if we have already tokio as dependency? Looking into this PR, it seems to me that tokio::select may be the better alternative?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the function based approach easier to read, though I don't really have that strong of an opinion on the subject.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to find much of a difference between the two select!s but refactored this code a bit by unrolling the loop which should make this code a lot more understandable for first time readers.

Copy link
Collaborator

@njgheorghita njgheorghita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! nothing to add from my end

@lithp lithp force-pushed the lithp/add-lock-debug-logging branch 2 times, most recently from 8ecb3b9 to e57ada4 Compare October 21, 2021 21:19
- Add read_with_warn() and write_with_warn() methods to RwLock
  - They emit warnings if lock acqusition takes over 100ms
  - They emit warnings if locks are held for over 100ms

- Use read_with_warn() and write_with_warn() in most places RwLock's are accessed

Using these methods I was able to immediately identify the cause of a
deadlock during boot.
@lithp lithp force-pushed the lithp/add-lock-debug-logging branch from e57ada4 to 00f7588 Compare October 21, 2021 21:30
@lithp lithp merged commit 140d2d1 into ethereum:master Oct 22, 2021
@lithp lithp deleted the lithp/add-lock-debug-logging branch October 22, 2021 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants