-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Clone behavior is different when scoped #1170
Comments
By running dmesg after running the above test, we saw a segfault like this:
When we dug into this (by using objdump to see what was at this offset in libscope.so), it pointed to where we were calling a pcre2 function. When libscope.so see writes of console data, we call pcre2 which implements a regex filter. This filter allows users to define a whitelist of strings they want us to see - aka SCOPE_EVENT_CONSOLE_VALUE. We make this call to pcre2 on the stack of the client (here, the child process's) stack. From our previous experience (happens to be in go, which uses very small stack) we know that pcre2 has the potential to use a lot of stack space. The other way to confirm that the console write was a causing of this problem... we did not see this problem (did not blow the stack) if we commented out this code in initHook(). This code allows libscope.so to see the console writes in the first place.
To prevent this from happening, we need to stop using the client stack when pcre2 is involved. We've done this in our go code, so we need to apply a similar pattern here too (even when the app we're in was not written in go). |
We already have two existing wrappers that do the required stack switching that is needed.
W.r.t. completeness of our solution, we need to make sure that we use these _wrapper() functions everywhere. We don't have anything in contrib that depends on the pcre2 library, so there is nothing in the contrib directory that would need to use these wrapper functions. That only leaves the source code we've written for libscope.so. These searches demonstrate that we already use the required wrapper functions everywhere: The only reason we have this problem is that we had clauses at the top of our wrapper functions which effectively made it so the stack was only switched for go applications. The final solution is to just remove these clauses from our wrapper functions so non-go applications will change to use our own stack just like the go applications have been doing. |
Needed when we run pcre2 on threads we don't own.
Before the last commit, the splunk integration test had very poor performance results. By very poor I mean the overhead of unscoped vs scoped went from <10% to more than 100%. This was traced back to our memory allocator, which was being used by our stack switching code. Every malloc of a 32k stack was resulting in a mmap, every free of 32k stack was resulting in an munmap. These are both syscalls, resulting in context switches which were killing our performance. The last commit adds a pool of stacks that can be reused to avoid the context switching from the mmap/munmap syscalls. With this, the performance measured by the splunk integration test was restored to the level before we added this stack switching. Another thing of note is that we noticed that the pcre2 code was being called many (>700) times per http request/response. I've noted this in #968, for further investigation. |
Wow. This issue is one that keeps on giving. The sub-issues:
On aarch64 machines, go processes crash intermittently with a "bad g". This has only been seen on aarch64. This was uncovered in go integration tests (go_20) in test cases for "signalHandlerStatic" and "signalHandlerStaticStripped". By adding a go routine to constantly write to the console, these test now fail within a small number of iterations (on this branch). To see them ASAP, I'd recommend commenting out the other test cases and running the go_20 signalHandler tests in a loop to see the "bad g" failure. We may not have a perfect understanding, but we believe that when a signal happens while we are on the switched stack (for pcre2), go tries to retrieve the "g" from the stack. It's not going to find "g" on the stack when we're currently on our own stack executing pcre2 code, so go crashes. Without the new go routine that writes to the console, the "bad g" happens very infrequently, but importantly "bad g" in the signalHandlerStatic tests has also been observed on branches that do not have the stack pool implementation here. To the best of our knowledge, we could see this any time a signal handler interrupts us while we're running our c code. If this description is correct, we think a possible solution is to mimic how go knows if c code is currently running. It's not required nor part of the problem described here, but we'd also like to note that we don't need to do stack switching while running pcre2 on our own reporting thread. It may make sense to stop switching the stack when in our own thread. |
I've created #1469 to address 3. This allows us to merge in the solutions for the original issue (sub-issues 1 and 2). |
I've created a branch bug/1469-go-signals-arm that contains the go routine that writes to the console. See #1469 for more info. Closing this issue because it's been merged into release/1.4 |
Steps To Reproduce
With following code:
We can observe a difference in run unscoped:
and scoped:
Environment
No response
Requested priority
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered: