Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write panic errors to logs (#445) #446

Merged
merged 6 commits into from
May 1, 2024
Merged

Conversation

gvbgduh
Copy link
Contributor

@gvbgduh gvbgduh commented Apr 4, 2024

Loggin from panic::set_hook.

It uses error macro from the same package module log

It works, but only after the logger is initialized by the kaspad daemon, before that it does nothing, so there is a gap.

>:~/.rusty-kaspa/kaspa-testnet-11/logs$ cat rusty-kaspa_err.log
2024-04-05 01:17:20.187+11:00 [ERROR] Panic at kaspad/src/daemon.rs:239: Test logger panic
>:~/.rusty-kaspa/kaspa-testnet-11/logs$ cat rusty-kaspa.log
2024-04-05 01:17:20.187+11:00 [INFO ] Logger is initiated with the log level: INFO and the log directory: "/home/george/.rusty-kaspa/kaspa-testnet-11/logs"
2024-04-05 01:17:20.187+11:00 [INFO ] kaspad v0.13.5
2024-04-05 01:17:20.187+11:00 [INFO ] Application directory: /home/george/.rusty-kaspa
2024-04-05 01:17:20.187+11:00 [INFO ] Data directory: /home/george/.rusty-kaspa/kaspa-testnet-11/datadir
2024-04-05 01:17:20.187+11:00 [INFO ] Logs directory: /home/george/.rusty-kaspa/kaspa-testnet-11/logs
2024-04-05 01:17:20.187+11:00 [ERROR] Panic at kaspad/src/daemon.rs:239: Test logger panic

@@ -8,6 +9,23 @@ pub fn configure_panic() {
default_hook(panic_info);
println!("Exiting...");
// TODO: setup a wait time and fold the log system properly
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also this comment, it's probably partially addressed, but I'm not sure about the case when the logger is not initialized yet and also what's meant by the wait time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means we should make sure that logs are actually flushed into the files before the process exits. This can possibly be done by making special calls to the underlying log library (maybe @tiram88 would know how), or at least adding a short sleep here

Copy link
Contributor Author

@gvbgduh gvbgduh Apr 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, it's also the requirement I haven't clarified yet. When is it ok to lose the logs?

Even if we try to wait, the logger might never be initialized as the panic condition might happen before.
Another approach is probably to initialize a fallback logger within the hook or even write to the log file directly if needed.
But, it can cause a race condition, as the logger system can only be initialized once (link), unless we use the different mechanism for the fallback.
But, even with this, logs still will be lost in cases like no disk space or potentially the lack of resources/permissions.

Also, the default behavior of the log facade is to ignore failures writing logs:

And it seems that there is not much can be done other than deciding whether to panic on log failures globally or not. (Unless we can rely on the specific loggers implementations, but more research needed.)
It's not very helpful in this case, as panicking on a log failure inside the panic itself should causes an infinite recursion code flow, but it should be possible to avoid.

So, what would be the reasonable approach and what trade offs are acceptable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably propose to keep this implementation, but also write the panic error to the dedicated file.
In this case, we write to logs when the logger is available, and we write the panic file regardless of the logger status (as long as we can write to the file system).
The issue here is it probably needs some standardization, like location of the file and its content/structure, as it might also be helpful to extend it to the panic report of sorts, providing additional details about the host.
But, I'm not too sure how suitable this approach is.

@gvbgduh
Copy link
Contributor Author

gvbgduh commented Apr 4, 2024

I'm still thinking about unit tests

@aspect
Copy link
Collaborator

aspect commented Apr 7, 2024

I am concerned that calling error!() before the logger is initialized could cause a panic by itself. If this is the case perhaps a separate function that sets a static atomic bool after the logger init indicating that it is safe to use error!() would make sense. If this is not the case I guess this comment can be ignored.

@gvbgduh
Copy link
Contributor Author

gvbgduh commented Apr 7, 2024

I am concerned that calling error!() before the logger is initialized could cause a panic by itself. If this is the case perhaps a separate function that sets a static atomic bool after the logger init indicating that it is safe to use error!() would make sense. If this is not the case I guess this comment can be ignored.

It's quite the opposite, the default behavior is to silently swallow all log failures, I have corresponding links in another comment. But a static atomic bool might still be of the help.

@@ -16,6 +17,8 @@ mod logger;
}
}

pub static LOGGING_INITIALIZED: AtomicBool = AtomicBool::new(false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a second thought, it might be better to keep it private and have only a public getter in the module

Copy link
Collaborator

@coderofstuff coderofstuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this with these configs, adding a panic in the code

  • --logdir=./logs - panic is logged in both std and error log files as an Error
  • --nologfiles - still works fine

PR is functional.

@biryukovmaxim
Copy link
Collaborator

biryukovmaxim commented Apr 23, 2024

@gvbgduh hi! Thanks for your efforts, it's nice to see our gang of contributors getting bigger!

  1. there are some locations where panic handler is settled before logger initialization. like here
impl Runtime {
    pub fn from_args(args: &Args) -> Self {
        // Configure the panic behavior
        kaspa_core::panic::configure_panic();

        let log_dir = get_log_dir(args);

        // Initialize the logger
        kaspa_core::log::init_logger(log_dir.as_deref(), &args.log_level);

        Self { log_dir: log_dir.map(|log_dir| log_dir.to_owned()) }
    }
}

could you check them?

  1. log::logger() function already has a fallback
    we can rely on it.
    so if we set hook using the function. when we panic before initialization - it won't produce logs
    after - it will work.

  2. Imo no need to use atomic bool, moreover the current approach blocks the thread(since it's sync function) waiting for logger initialization. it means the only way to set the logger - is from another thread, unlikely it will ever be done

  3. to handle the case where we panic before the initialization. (if we really want to handle it now and here) I would suggest to use OnceBool or lazy_static unit () to make it impossible to set false in purposes to determine if logger is initialized.
    the state machine would look like:

   None before logger initialized 
   Some(()) after initialization
   No way to make it None again
Only one possible transition

and direct writing to file when it's not initialized

@gvbgduh
Copy link
Contributor Author

gvbgduh commented Apr 23, 2024

thanks @biryukovmaxim, very interesting.

thb, i haven't exercised the thought about 3, but it makes perfect sense as i call sleep in the sync function.

i'd agree we don't really need atomic bool as i see it just recreates the fallback logic from the link in 2, it's just that the initialization state is private to the log crate.
i think it should probably be possible to hack around some logic about the logger itself, all in all, it'll be NoOp logger otherwise, but, i think it's a bit too hacky.

if we really want to handle it now and here

yes, from the practical point of view, i'm really on the fence it's worth the trouble, as i'm not sure there is a lot of business value in covering the gaps for panic before the logging system is up as in most cases it would be within the interaction with CLI.

i'd personally propose to revert to the first revision: we do our best effort to log, but it'll be up to the logging system;
and have a separate ticket for the panic report: on every panic write a dedicate report directly to a file. that file should be modelled and have additional context like host details, host health, etc. that would cover cases when logging wasn't up, but there was still unattended panic during the execution.

but, i'm happy to follow the OnceBool path as well.

Copy link
Collaborator

@coderofstuff coderofstuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address @biryukovmaxim's comments

Reason: Waiting here will block the thread, and as the thread is locked
and we wait for the logger system to be initiated that would wait for
the thread to be released itself, making the waiting pointless
@gvbgduh
Copy link
Contributor Author

gvbgduh commented May 1, 2024

As waiting there is pointless, reverting to the initial commit, where logging is done relying on the logging fallback itself.
Errors caused by panic before the logging system is ready would be ignored (follow up issue #465).

},
};
// Log the panic
error!("Panic at {}:{}: {}", file, line, message);
Copy link
Collaborator

@biryukovmaxim biryukovmaxim May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PanicInfo implements Display almost the same way as yours. Could you check how the hook work with panic_any passing custom type?

There's method message , however it's only accessable on nightly toolchain. Although std uses it when implements Display

I would suggest to use Display fmt if it works with panic_any, otherwise there's no difference to me

Copy link
Collaborator

@coderofstuff coderofstuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I retested this by injecting a panic! in the code and also running with --nologfiles and it's functional.

Sample logged panic found in the log files:

2024-04-30 22:52:39.202-06:00 [INFO ] kaspad v0.13.6
2024-04-30 22:52:39.202-06:00 [INFO ] Application directory: /home/dev/.rusty-kaspa
2024-04-30 22:52:39.202-06:00 [INFO ] Data directory: /home/dev/.rusty-kaspa/kaspa-mainnet/datadir
2024-04-30 22:52:39.202-06:00 [INFO ] Logs directory: ./logs
thread 'main' panicked at kaspad/src/daemon.rs:249:5:
At the disco
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Exiting...
2024-04-30 22:52:39.203-06:00 [ERROR] Panic at kaspad/src/daemon.rs:249: At the disco

PR looks much simpler than before.

@gvbgduh
Copy link
Contributor Author

gvbgduh commented May 1, 2024

Testing
Setup:

cargo run --release --bin kaspad -- --testnet --netsuffix=11

Manually introduced panic after login init:

>~/rusty-kaspa$ git diff
diff --git a/kaspad/src/daemon.rs b/kaspad/src/daemon.rs
index dd8a215..d60c3b0 100644
--- a/kaspad/src/daemon.rs
+++ b/kaspad/src/daemon.rs
@@ -240,6 +240,9 @@ pub fn create_core_with_runtime(runtime: &Runtime, args: &Args, fd_total_budget:
         }
     }

+    // Point of failure after the logs are initialized
+    panic!("This is a test panic with logging initiated");
+
     let consensus_db_dir = db_dir.join(CONSENSUS_DB);

Stderr:

2024-05-01 14:53:43.032+10:00 [INFO ] kaspad v0.13.6
2024-05-01 14:53:43.033+10:00 [INFO ] Application directory: /home/george/.rusty-kaspa
2024-05-01 14:53:43.033+10:00 [INFO ] Data directory: /home/george/.rusty-kaspa/kaspa-testnet-11/datadir
2024-05-01 14:53:43.033+10:00 [INFO ] Logs directory: /home/george/.rusty-kaspa/kaspa-testnet-11/logs
thread 'main' panicked at kaspad/src/daemon.rs:244:5:
This is a test panic with logging initiated
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Exiting...
2024-05-01 14:53:43.033+10:00 [ERROR] Panic at kaspad/src/daemon.rs:244: This is a test panic with logging initiated

Logs:

george@Darksteel:~/rusty-kaspa$ ls /home/george/.rusty-kaspa/kaspa-testnet-11/logs
rusty-kaspa.log  rusty-kaspa_err.log
george@Darksteel:~/rusty-kaspa$ cat /home/george/.rusty-kaspa/kaspa-testnet-11/logs/rusty-kaspa.log
2024-05-01 14:57:01.610+10:00 [INFO ] kaspad v0.13.6
2024-05-01 14:57:01.610+10:00 [INFO ] Application directory: /home/george/.rusty-kaspa
2024-05-01 14:57:01.610+10:00 [INFO ] Data directory: /home/george/.rusty-kaspa/kaspa-testnet-11/datadir
2024-05-01 14:57:01.611+10:00 [INFO ] Logs directory: /home/george/.rusty-kaspa/kaspa-testnet-11/logs
2024-05-01 14:57:01.611+10:00 [ERROR] Panic at kaspad/src/daemon.rs:244: This is a test panic with logging initiated
$ cat /home/george/.rusty-kaspa/kaspa-testnet-11/logs/rusty-kaspa_err.log
2024-05-01 14:57:01.611+10:00 [ERROR] Panic at kaspad/src/daemon.rs:244: This is a test panic with logging initiated

@biryukovmaxim
Copy link
Collaborator

@gvbgduh could you go through the code and reorder all places where panic handler is settled before logger initialization?

Copy link
Collaborator

@coderofstuff coderofstuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, my bad. Please check the comments again.

Also, I noticed in the full panic log it looks like this:

thread 'main' panicked at kaspad/src/daemon.rs:249:5:
At the disco

Notice the :249:5: above but the log in the file only has this:

2024-04-30 22:52:39.203-06:00 [ERROR] Panic at kaspad/src/daemon.rs:249: At the disco

The panic log in the file is missing the :5 - this could be useful in lines where there are multiple things happening.

@gvbgduh
Copy link
Contributor Author

gvbgduh commented May 1, 2024

@biryukovmaxim yep, can do
@coderofstuff indeed, thanks, let me fix it

@gvbgduh
Copy link
Contributor Author

gvbgduh commented May 1, 2024

Updated test:

$ cat /home/george/.rusty-kaspa/kaspa-testnet-11/logs/rusty-kaspa.log | tail -n 2
2024-05-01 15:25:36.190+10:00 [INFO ] Logs directory: /home/george/.rusty-kaspa/kaspa-testnet-11/logs
2024-05-01 15:25:36.190+10:00 [ERROR] Panic at kaspad/src/daemon.rs:244:5: This is a test panic with logging initiated
$ cat /home/george/.rusty-kaspa/kaspa-testnet-11/logs/rusty-kaspa_err.log | tail -n 2
2024-05-01 15:05:35.678+10:00 [ERROR] Panic at kaspad/src/daemon.rs:202: This is a test panic without logging initiated
2024-05-01 15:25:36.190+10:00 [ERROR] Panic at kaspad/src/daemon.rs:244:5: This is a test panic with logging initiated

@gvbgduh
Copy link
Contributor Author

gvbgduh commented May 1, 2024

@biryukovmaxim just for a thought exercise, is it worth considering cases when there is panic before kaspa_core::panic::configure_panic();?

@biryukovmaxim
Copy link
Collaborator

biryukovmaxim commented May 1, 2024

@biryukovmaxim just for a thought exercise, is it worth considering cases when there is panic before kaspa_core::panic::configure_panic();?

I think it's a rare case, that mostly can appear during development phase, where regular panic should be okay.
What can we do with that? Setting another panic(call it prehook) handler before current? Should we considering panics before prehook again? =D

@gvbgduh
Copy link
Contributor Author

gvbgduh commented May 1, 2024

@biryukovmaxim thanks, the last commit should cover all places where the panic hook is set

@michaelsutton
Copy link
Contributor

The name of the panicking thread is important too

@gvbgduh
Copy link
Contributor Author

gvbgduh commented May 1, 2024

good point, update the error message

    Finished release [optimized] target(s) in 1m 06s
     Running `target/release/kaspad --testnet --netsuffix=11`
thread 'main' panicked at kaspad/src/daemon.rs:225:5:
Test panic with logging
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Exiting...
2024-05-01 20:46:43.740+10:00 [ERROR] Panic at the thread main at kaspad/src/daemon.rs:225:5: Test panic with logging

~/rusty-kaspa$ cat /home/george/.rusty-kaspa/kaspa-testnet-11/logs/rusty-kaspa.log | tail -n 2
2024-05-01 20:15:01.498+10:00 [ERROR] Panic at kaspad/src/daemon.rs:225:5: Test panic with logging
2024-05-01 20:46:43.740+10:00 [ERROR] Panic at the thread main at kaspad/src/daemon.rs:225:5: Test panic with logging

~/rusty-kaspa$ cat /home/george/.rusty-kaspa/kaspa-testnet-11/logs/rusty-kaspa_err.log | tail -n 2
2024-05-01 20:15:01.498+10:00 [ERROR] Panic at kaspad/src/daemon.rs:225:5: Test panic with logging
2024-05-01 20:46:43.740+10:00 [ERROR] Panic at the thread main at kaspad/src/daemon.rs:225:5: Test panic with logging

@@ -19,18 +15,17 @@ pub fn configure_panic() {
Some(s) => *s,
None => match panic_info.payload().downcast_ref::<String>() {
Some(s) => &s[..],
None => "unknown",
None => "Box<dyn Any>",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, what would it produce?

Copy link
Contributor Author

@gvbgduh gvbgduh May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or it's still the string? is there a reason for such a value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, thanks :)

Copy link
Contributor

@michaelsutton michaelsutton May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also find a way to use this https://doc.rust-lang.org/src/core/panic/location.rs.html#198 but I think we're good

@michaelsutton michaelsutton merged commit 034d25d into kaspanet:master May 1, 2024
6 checks passed
@michaelsutton
Copy link
Contributor

Merged. Congratulations @gvbgduh for the first contribution 💯

smartgoo pushed a commit to smartgoo/rusty-kaspa that referenced this pull request Jun 18, 2024
* Write panic errors to logs (kaspanet#445)

* Remove the comment as it doesn't need to be addressed

Reason: Waiting here will block the thread, and as the thread is locked
and we wait for the logger system to be initiated that would wait for
the thread to be released itself, making the waiting pointless

* Add column to the log message for the panic error

* Move panic setup after the logger init in all places

* Add the thread name to the panic error

* default hook invoke position + minor style changes
D-Stacks pushed a commit to D-Stacks/rusty-kaspa that referenced this pull request Jul 12, 2024
* Write panic errors to logs (kaspanet#445)

* Remove the comment as it doesn't need to be addressed

Reason: Waiting here will block the thread, and as the thread is locked
and we wait for the logger system to be initiated that would wait for
the thread to be released itself, making the waiting pointless

* Add column to the log message for the panic error

* Move panic setup after the logger init in all places

* Add the thread name to the panic error

* default hook invoke position + minor style changes
D-Stacks pushed a commit to D-Stacks/rusty-kaspa that referenced this pull request Jul 17, 2024
* Write panic errors to logs (kaspanet#445)

* Remove the comment as it doesn't need to be addressed

Reason: Waiting here will block the thread, and as the thread is locked
and we wait for the logger system to be initiated that would wait for
the thread to be released itself, making the waiting pointless

* Add column to the log message for the panic error

* Move panic setup after the logger init in all places

* Add the thread name to the panic error

* default hook invoke position + minor style changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants