Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow more verbose logging in release binaries #1365

Open
bnaecker opened this issue Jul 7, 2022 · 10 comments
Open

Allow more verbose logging in release binaries #1365

bnaecker opened this issue Jul 7, 2022 · 10 comments
Labels
Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more)

Comments

@bnaecker
Copy link
Collaborator

bnaecker commented Jul 7, 2022

This came up during the demo dry-run today. We had an issue whose resolution might have been helped by inspecting trace-level log messages, which we can get via slog-dtrace. However, most (all?) of our crates actually filter out anything above debug at compile time, using this slog feature. We should probably allow more verbose logging in release mode, at least until we have specific reason to remove them.

@bnaecker
Copy link
Collaborator Author

Ugh, so this is not as simple as just changing the slog feature from release_max_level_debug to release_max_level_trace. The main problem is that most of Omicron's crates rely on Dropshot's ConfigLogging to configure slog. That includes a LevelFilter at the root of the slog::Logger hierarchy. We then call slog_dtrace::with_drain in most places to duplicate Dropshot's log messages to DTrace. However, by that time, messages are already filtered by the LevelFilter.

Instead, we'll need to create the slog_dtrace::Drain type first, then stack on the other loggers after that.

@davepacheco
Copy link
Collaborator

As I understand it, this makes it a problem to enable DTrace at any more fine-grained level than what's been configured. But I assume that the slog feature would be enough for us to enable trace-level debugging in the config file.

@bnaecker
Copy link
Collaborator Author

I think that's true, it just unfortunately means you can't have a distinction between the level of log messages sent to say stdout and to DTrace. I don't know how big of a deal that'll be in practice. I'm also not sure I understand why we're not getting messages. I wrote a test that was supposed to verify exactly this behavior, getting messages in the DTrace probes that are filtered out by one of the other drains. It's possible that Dropshot's use of slog_async here is different from slog_term as used in the test. Or something else, slog is honestly a pretty opaque library and ecosystem.

@jclulow
Copy link
Collaborator

jclulow commented Jul 13, 2022

I wonder if we could turn this on its head a bit. Today we use, say, the slog::trace!() macro directly to emit TRACE-level messages. These get compiled out.

What if, instead, we had magic_dtrace_slog::trace!(), a macro that was identical in signature and operation to slog::trace!(), except that we inject a USDT probe at the macro site as well as calling the "real" slog::trace!() immediately after, all from within the macro expansion. slog::trace!() calls we make may be compiled out as they are today, but our probe would not be. It's possible we could even be somewhat clever and use is-enabled probes to reduce the disable probe effect.

Of course, that requires source-level changes to use our magic slog wrapper, but I suspect that may be OK for our purposes; we'd at least be able to hit all of Omicron, Propolis, Crucible, any Dropshot server, any clients built by Progenitor, etc.

@bnaecker
Copy link
Collaborator Author

I'm reasonably sure that Cargo features are biting us again here. slog implements its static maximum logging level as features, e.g., release_max_level_warn or release_max_level_trace. The __slog_static_max_level function has some cfg-style branches that return the statically-configured logging level. That's used in the main log! macro, to emit messages only if they're not filtered. So far so good.

But Omicron has dependencies which also use the feature flags to control the maximum release logging level, such as Dropshot, and those request the release_max_level_debug feature. Because of feature unification, when we compile slog as a dependency of Omicron, we'll get both that feature, and also release_max_level_trace. However, due to the way that __slog_static_max_level function is written, it'll return the minimum over all the values implied by the features of the crates in the dependency graph, not the maximum. That is, if one crate requests release_max_level_off, we'll never get any logging messages at all.

I tested this, by using a local patched copy of dropshot, which lowers the maximum logging level to release_max_level_warn. Now, both DTrace and the standard output logging streams only receive warning messages (or lower). No info! for you!

So what do we do about this? Josh's suggestion is one approach, which effectively bypasses slog altogether and just emits our own messages. I'm not exactly sure I understand how we'll get all the key-value pairs in the whole chain, since Drains only provide those during "serialization", and they are free to ignore or omit messages basically whenever they want. That is, slog conflates which key-value pairs exist with which are written out to the logging stream.

Another option would be to ensure that all of our dependencies use release_max_level_trace. We'll need to make sure we stick level-filters in to prevent those from going to stdout or log-files (if that's what we want), which is a bit frustrating.

One more option would be to submit a patch to slog which reverses the order of the branches in __slog_static_max_level. One could argue that, even with feature unification, one should respect the most-verbose dependency rather than the least. That's probably debatable though. Another patch could be to not abuse features in this way, but rely on some other mechanism to compile out log messages in release builds, though I'm not sure what that would be.

@davepacheco
Copy link
Collaborator

However, due to the way that __slog_static_max_level function is written, it'll return the minimum over all the values implied by the features of the crates in the dependency graph, not the maximum.

That's wild. It'd be worth checking if that's intended. I could see going either way, but either way, this seems to violates Cargo's expectations here. Assuming this doesn't change:

Another option would be to ensure that all of our dependencies use release_max_level_trace.

I'd propose that any crates used as dependencies not specify any slog max_level features at all, which allows the top-level target to decide what it wants. This would be a problem if a crate was both used as a dependency and gets built independently for its binaries [and we want those binaries to have debug- or trace-level logging]. If we have anything like that today I wonder if we should separate them anyway.

We'll need to make sure we stick level-filters in to prevent those from going to stdout or log-files (if that's what we want), which is a bit frustrating.

There are two things here. First is building something to support trace level but allowing it to be filtered at debug level at runtime. We already do this in Omicron, Dropshot, and Steno -- they all use the same config block (processed by Dropshot, I think), which includes a level. Even if we don't use the same implementation everywhere, it's pretty reasonable to me to say that any logger construction has a required parameter which is the desired level.

Then there's the earlier problem where you might want the log file level to be info but the DTrace level to be trace.


Sorry if this is a silly question, but can we have dropshot's ConfigLog::to_logger() create a pipeline that goes from Logger -> branch 1: slog_dtrace drain + branch 2: the existing branch (level filter + the async drain to terminal or file)?

@bnaecker
Copy link
Collaborator Author

bnaecker commented Jul 13, 2022

@davepacheco Still reading everything else, but on the last question, that's basically what we already have. Dropshot calls Logger::root, but since Logger impls Drain, we can kinda ignore that and put the slog-dtrace drain at the root instead. This is done in here in Nexus, for example. You can see that works by setting the log-level to something like warn in the configuration file. The stdout stream only shows warn or lower, but DTrace still gets up to whatever the feature-implied level is.

@bnaecker
Copy link
Collaborator Author

I'd propose that any crates used as dependencies not specify any slog max_level features at all, which allows the top-level target to decide what it wants.

I like this. It's at least good that there's no default feature enabled in slog, which means it can be overridden like this.

There are two things here. First is building something to support trace level but allowing it to be filtered at debug level at runtime. We already do this in Omicron, Dropshot, and Steno -- they all use the same config block (processed by Dropshot, I think), which includes a level. Even if we don't use the same implementation everywhere, it's pretty reasonable to me to say that any logger construction has a required parameter which is the desired level

I agree, this should mostly be handled by Dropshot's logging configuration. I only meant in the case where we don't use that path, we'll need to remember this.

@davepacheco
Copy link
Collaborator

@davepacheco Still reading everything else, but on the last question, that's basically what we already have. Dropshot calls Logger::root, but since Logger impls Drain, we can kinda ignore that and put the slog-dtrace drain at the root instead. This is done in here in Nexus, for example. You can see that works by setting the log-level to something like warn in the configuration file. The stdout stream only shows warn or lower, but DTrace still gets up to whatever the feature-implied level is.

That makes sense, and that's great. Then I'm not sure what this meant earlier:

That includes a LevelFilter at the root of the slog::Logger hierarchy. We then call slog_dtrace::with_drain in most places to duplicate Dropshot's log messages to DTrace. However, by that time, messages are already filtered by the LevelFilter.

@bnaecker
Copy link
Collaborator Author

That makes sense, and that's great. Then I'm not sure what this meant earlier:

I was trying to root-cause it, and I was just wrong :)

I thought the issue was putting something that filters messages at the root. That's not it, the problem is the feature unification done by Cargo.

@smklein smklein added the Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more) label Nov 16, 2022
leftwo pushed a commit that referenced this issue Jun 26, 2024
Added a new package, crucible-dtrace that pulls from buildomat a package
that contains a set of DTrace scripts.  These scripts are extracted into
the global zone at /opt/oxide/crucible_dtrace/

Update Crucible to latest includes these updates:
Clean up dependency checking, fixing space leak (#1372)
Make a DTrace package (#1367)
Use a single context in all messages (#1363)
Remove `DownstairsWork`, because it's redundant (#1371)
Remove `WorkState`, because it's implicit (#1370)
Do work immediately upon receipt of a job, if possible (#1366)
Move 'do work for one job' into a helper function (#1365)
Remove `DownstairsWork` from map when handling it (#1361)
Using `block_in_place` for IO operations (#1357)
update omicron deps; use re-exported dropshot types in oximeter-producer configuration (#1369)
Parameterize more tests (#1364)
Misc cleanup, remove sqlite references. (#1360)
Fix `Extent::close` docstring (#1359)
Make many `Region` functions synchronous (#1356)
Remove `Workstate::Done` (unused) (#1355)
Return a sorted `VecDeque` directly (#1354)
Combine `proc_frame` and `do_work_for` (#1351)
Move `do_work_for` and `do_work` into `ActiveConnection` (#1350)
Support arbitrary Volumes during replace compare (#1349)
Remove the SQLite backend (#1352)
Add a custom timeout for buildomat tests (#1344)
Move `proc_frame` into `ActiveConnection` (#1348)
Remove `UpstairsConnection` from `DownstairsWork` (#1341)
Move Work into ConnectionState (#1340)
Make `ConnectionState` an enum type (#1339)
Parameterize `test_repair.sh` directories (#1345)
Remove `Arc<Mutex<Downstairs>>` (#1338)
Send message to Downstairs directly (#1336)
Consolidate `on_disconnected` and `remove_connection` (#1333)
Move disconnect logic to the Downstairs (#1332)
Remove invalid DTrace probes. (#1335)
Fix outdated comments (#1331)
Use message passing when a new connection starts (#1330)
Move cancellation into Downstairs, using a token to kill IO tasks (#1329)
Make the Downstairs own per-connection state (#1328)
Move remaining local state into a `struct ConnectionState` (#1327)
Consolidate negotiation + IO operations into one loop (#1322)
Allow replacement of a target in a read_only_parent (#1281)
Do all IO through IO tasks (#1321)
Make `reqwest_client` only present if it's used (#1326)
Move negotiation into Downstairs as well (#1320)
Update Rust crate clap to v4.5.4 (#1301)
Reuse a reqwest client when creating Nexus clients (#1317)
Reuse a reqwest client when creating repair client (#1324)
Add % to keep buildomat happy (#1323)
Downstairs task cleanup (#1313)
Update crutest replace test, and mismatch printing. (#1314)
Added more DTrace scripts. (#1309)
Update Rust crate async-trait to 0.1.80 (#1298)
leftwo added a commit that referenced this issue Jun 26, 2024
Update Crucible and Propolis to the latest

Added a new package, crucible-dtrace that pulls from buildomat a package
that contains a set of DTrace scripts. These scripts are extracted into the 
global zone at /opt/oxide/crucible_dtrace/

Crucible latest includes these updates:
Clean up dependency checking, fixing space leak (#1372) Make a DTrace
package (#1367)
Use a single context in all messages (#1363)
Remove `DownstairsWork`, because it's redundant (#1371) Remove
`WorkState`, because it's implicit (#1370)
Do work immediately upon receipt of a job, if possible (#1366) Move 'do
work for one job' into a helper function (#1365) Remove `DownstairsWork`
from map when handling it (#1361) Using `block_in_place` for IO
operations (#1357)
update omicron deps; use re-exported dropshot types in oximeter-producer
configuration (#1369) Parameterize more tests (#1364)
Misc cleanup, remove sqlite references. (#1360)
Fix `Extent::close` docstring (#1359)
Make many `Region` functions synchronous (#1356)
Remove `Workstate::Done` (unused) (#1355)
Return a sorted `VecDeque` directly (#1354)
Combine `proc_frame` and `do_work_for` (#1351)
Move `do_work_for` and `do_work` into `ActiveConnection` (#1350) Support
arbitrary Volumes during replace compare (#1349) Remove the SQLite
backend (#1352)
Add a custom timeout for buildomat tests (#1344)
Move `proc_frame` into `ActiveConnection` (#1348)
Remove `UpstairsConnection` from `DownstairsWork` (#1341) Move Work into
ConnectionState (#1340)
Make `ConnectionState` an enum type (#1339)
Parameterize `test_repair.sh` directories (#1345)
Remove `Arc<Mutex<Downstairs>>` (#1338)
Send message to Downstairs directly (#1336)
Consolidate `on_disconnected` and `remove_connection` (#1333) Move
disconnect logic to the Downstairs (#1332)
Remove invalid DTrace probes. (#1335)
Fix outdated comments (#1331)
Use message passing when a new connection starts (#1330) Move
cancellation into Downstairs, using a token to kill IO tasks (#1329)
Make the Downstairs own per-connection state (#1328) Move remaining
local state into a `struct ConnectionState` (#1327) Consolidate
negotiation + IO operations into one loop (#1322) Allow replacement of a
target in a read_only_parent (#1281) Do all IO through IO tasks (#1321)
Make `reqwest_client` only present if it's used (#1326) Move negotiation
into Downstairs as well (#1320)
Update Rust crate clap to v4.5.4 (#1301)
Reuse a reqwest client when creating Nexus clients (#1317) Reuse a
reqwest client when creating repair client (#1324) Add % to keep
buildomat happy (#1323)
Downstairs task cleanup (#1313)
Update crutest replace test, and mismatch printing. (#1314) Added more
DTrace scripts. (#1309)
Update Rust crate async-trait to 0.1.80 (#1298)

Propolis just has this one update:
Allow boot order config in propolis-standalone
---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more)
Projects
None yet
Development

No branches or pull requests

4 participants