-
Notifications
You must be signed in to change notification settings - Fork 848
Reduce overhead incurred by enabling debug output. #8581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This replaces: #7674 |
|
Here is the result of performance testing in Yahoo prod of this PR: https://gist.github.com/ywkaras/ed6ebc39518d88f9b0d3ee007d0573de It shows that, with this change, debug output can be enabled globally with no loss of performance. |
|
@SolidWallOfCode I'll give you the Grigor pony if Linkedin will merge this into their internal ATS. On a darker note, I could also burn Box Alan if you don't keep me happy. |
5e8a031 to
0cd5a7b
Compare
0cd5a7b to
042212c
Compare
Setting proxy.config.diags.debug.enabled to 3 enables most core debug output (whose tag matches the debug regex), and plugin debug output from the new API macro TSDbg(). Setting it to 1 enables all (matching) debug output, but with the previous high overhead. Note that enabling a large qualtity of debug output will still have a big impact on performance. This commit could be merge into proprietary branches of trafficserver, without causing routine merge conflicts when aligning with the open source branch.
|
[approve ci autest] |
|
This change is deployed in Yahoo Prod, no problems seen, has been found to be very helpful. |
|
@cmcfarlen is going to take a look at this. |
| #include <tscore/Diags.h> | ||
|
|
||
| // The resistry of fast debug controllers has a ugly implementation to handle the whole-program initialization | ||
| // order problem with C++. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain how this handles the whole-program init order problem? Also, a general summary of the approach here how the debug tags are assembled into this registry and efficiently checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The registry data is "naturally" just a static member variable. But the registry may be accessed before initialization for the .cc file defining the registry data. So the data is instead defined as a static local in a static function that returns a referent as to the data. This is an ugly but standard solution to the order of initialization problem.
Under the old design, each "Debug()" call stores the debug tag. When debug was globally enabled, the tag was tested for a match with the configured PCRE for enabled tags. If it matched, the output was generated. But, the test for a match required significant computation, causing a big degradation in ATS performance.
Under the new design, each "Debug()" call stores a pointer into entry in the registry set for the tag. A set entry contains the tag string and an on/off flag for the tag. So each Debug call only has to dereference the pointer to the set entry, and then check the on/off flag. When the PCRE for enabled tags is changed, the registry set is iterated over, and only then is each tag checked for a match to the PCRE. The on/off flag for the tag is then set based on the check result.
include/tscore/Diags.h
Outdated
| #define Debug(tag__, ...) \ | ||
| do { \ | ||
| if (unlikely(diags()->on())) { \ | ||
| static DbgCtl ctl__(tag__); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
identifiers with double underscores are reserved for the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK gone.
| // Tag is a debug tag. Debug output associated with this control will be output when debug output | ||
| // is enabled globally, and the tag matches the configured debug tag regular expression. | ||
| // | ||
| DbgCtl(char const *tag) : _ptr(_get_ptr(tag)) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could call TSDbgCtlCreate instead of _get_ptr directly to get the asserts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean checking that tag is not null? I don't think we normally call TS API functions from core TS.
| ink_assert(sz > 0); | ||
|
|
||
| { | ||
| char *t = new char[sz + 1]; // Deleted implicitly by program termination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔Not sure a memcheck tool would agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, it's likely to be flagged.
cmcfarlen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR does dramatically increase performance of debug tracing. The way the tags are memoized leaves a lot of static variables all over the code (one static pointer per debug statement, one entry per tag in a global map). This is a time vs space trade-off, but doesn't add additional locking after the first time the statement is executed. I tested this without the volatile declaration on the tag struct and it still works fine so I recommend dropping that.
| } TSHttpConnectOptions; | ||
|
|
||
| typedef struct TSDbgCtl_s { | ||
| char volatile on; // Flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this volatile declaration is used is undefined behavior. See usage for volatile after C++11. particularly:
This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution, see std::memory_order). Any attempt to refer to a volatile object through a glvalue of non-volatile type (e.g. through a reference or pointer to non-volatile type) results in undefined behavior.
If you want to guarantee that changes are seen immediately in another thread, you will need to use some memory ordering mechanism. Otherwise, other threads will see changes the next time that memory is read.
I tested this without volatile and it still adjusts the debug that are printed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reality is that ATS often relies on de facto portability, rather than portability guaranteed by the Standard.
I guess memory ordering has been added to straight C, but I don't know the specifics, I'll have to research this. The point of the volatile is to emphasize that that the on flag should not be register cached. But it seems implausible the compiler could do that indefinitely anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's correct that (when using Debug() instead of Dbg() ) there's a significant increase in loaded data segment size, But, it's a delta on top of the high resource use of trace output even with the old design. Which seems worthwhile, to make tracing usable in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is a worthwhile trade off for performant debugging! It's good stuff.
I also think that in this case, any changes to on values will be seen "soon enough" by other threads and that volatile just isn't necessary here(especially since its accessed through a non-volatile pointer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it's unlikely, this code illustrates the sort of optimization that could happen to a Debug statement in a loop, if we don't do something to prevent the compiler from register-caching the 'on' flag: https://godbolt.org/z/r4K9TP1qW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternate to simply relying on volatile would be to use atomic_load_explicit() with relaxed memory order. But that would mean we'd have to require plugins to be build with C11 support, which I don't think we've done up till now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the Standard says that the behavior of reinterpret_cast is undefined, but ATS uses it over 1,400 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is closer to (but still not quite) the situation we have here. We are accessing a volatile through a (non-volatile) static pointer. The compiler must read from memory at least one time when the containing function is called. For the non-volatile case, the compiler can then assume the value won't change until another function is called or the original function is called again. This is what I meant by it will be re-read "soon enough".
All that said, if you are happier with the volatile qualifier and no-one else objects, I'm not overly concerned about it.
Setting proxy.config.diags.debug.enabled to 3 enables most core debug output (whose tag matches the debug
regex), and plugin debug output from the new API macro TSDbg(). Setting it to 1 enables all (matching) debug
output, but with the previous high overhead.
Note that enabling a large qualtity of debug output will still have a big impact on performance.
This commit could be merge into proprietary branches of trafficserver, without causing routine
merge conflicts when aligning with the open source branch.