-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: runtime/pprof: add SetMaxDumpGoroutine to limit the number of goroutines dumped in a profile #50794
Comments
Could you explain your use case more? Usually "online profiling" is done using a CPU profile, not a goroutine profile. |
This proposal has been added to the active column of the proposals project |
Goroutine profiles stop the world, iterate over the entire list of live goroutines, and then restart the world. That whole-app pause can be hundreds of milliseconds when there are hundreds of thousands of goroutines. @doujiang24 , is the problem for your apps the length of time that the app cannot do any other useful work? If so, it sounds similar to what's described in #33250. Or is it the length of time that specific goroutine calling @aclements , I've found regular collection of goroutine profiles to be one of the best tools (combination of effectiveness and ease-of-use) for tracking down problems of "why did my app get slow", including ones like the net/http deadlock in issue 32388. When I encountered that one, it affected a single service instance and self-resolved after about 15 minutes, so we were only able to understand it thanks to regular and automated collection of goroutine profiles. See also fgprof. Though @doujiang24 likely has other uses. |
@aclements Sorry for the delay. |
I came out with a new idea about the API. usage:
|
The big question is whether the stop-the-world is the problem versus the size of the profile. If the former, then writing less data won't take appreciably less time. |
@rsc Thanks for your point. I did more testing and I'm sharing the test results. less data works for the previous case that takes ~300ms total (https://gist.github.com/doujiang24/9d066bce3a2bdd0f1b9fe1ef49699e4e) Also, I have got more testing results by adding more log in pprof.go. 100k goroutines, without the max goroutine limitation, means the standard behavior.
100k goroutines, with the max 1k goroutine limitation, by using the PR #50771
Maybe STW may take more time in some other cases, but I think this shows less data will really take appreciably less time. |
PR 50771 seems to be doing a random sample of the goroutines. Is that helpful in your use case, not to see all of them? /cc @aclements @mknyszek @prattmic for thoughts |
I suspect we could eliminate the STW here at the cost of losing "snapshot consistency" across the goroutine stacks. Just looking over the current implementation, we could certainly improve the performance of this, though I doubt we can squeeze 10x out of it short of substantially redesigning how traceback works (e.g., using frame pointers). |
I think it's still not clear to me where exactly the problem lies for your use-case, @doujiang24. Is the 300 ms delay between when you request the profile and when you get it the problem, or is the 130 ms (or however long the world is actually stopped) the problem, because it's a global latency hit. If it's the latter, one thing we could do instead of limiting the number of goroutines is relax the consistency of the goroutine dump. Depending on how strict we want to be on backwards compatibility with this, this might not require a new API. If the goal is statistical goroutine sampling anyway, then I think that's fine. If a major use-case is getting a consistent snapshot that always 100% makes sense, then either this isn't really an option or it needs a knob. I imagine that the global latency impact of this approach would be the same latency impact as stack scanning by the GC, which is definitely much, much less than 130 ms even for 100,000 goroutines. However, this would not help the end-to-end latency of the operation; it would still take 300 ms for that many goroutines, limiting how often you can sample. |
Here are some of my initial thoughts, I haven't thought too deeply about this. Currently, the goroutine profile is a complete instantaneous (one-off) view of what every goroutine is doing. Since it involves a stop-the-world (STW), we even get a consistent view of where every goroutine is at a single point in time [1] [2]. On the other hand, this proposal effectively turns this profile into an instantaneous (one-off) sampled view of what goroutines are doing, since we don't capture every goroutine. It would still maintain the consistent view due to STW. My question for brainstorming is whether this is the best interface for use-cases of this profile? Is sampling OK, or do we typically need all goroutines? Is a consistent snapshot time needed? If so, maybe this API is best. But I could also imagine perhaps we could provide a continuous sampling goroutine profile. Rather than a single instantaneous profile point, we sample goroutine state over time, eventually capturing all of them but never all at the same time. Would such a profile be useful, I'm not sure. [1] As opposed to collecting traces while goroutines are running, which would result in each goroutine getting traced at a slightly different time of execution. In this case it would be possible to observe what appear to be "impossible" states between 2 goroutines. [2] Modulo that some small bits of code not not preemptible and thus cannot be observed in the profile, but I don't think this matters much. |
It's worth asking what the purpose of the goroutine profile is. It's not particularly helpful for analyzing performance or memory usage. Presumably it is helpful for finding leaked goroutines: goroutines that are waiting for something to happen that is never going to happen. Are there other uses for this profile? If that is the main use can we have a more targeted goroutine profile? Such as, only list goroutines that have been blocked for some period of time? |
@rhysh Could you expand more on this comment? How was this helpful? I'm imagining noticing some goroutines that, while not deadlocked, tend to be blocked or temporarily stopped somewhere we don't expect to be hot, but I'd like to hear if that is what you did. |
A common workflow is to collect hundreds or thousands of goroutine profiles (from every instance of an app, every few minutes, for some time range), to get a count in each profile of the number of goroutines with a function matching Goroutine profiles were important in tracking down #33747, where a large number of waiters on a single lock would lead to performance collapse. A team I work with encountered the effects of #32388 last October. We used goroutine profiles to figure out what was going on there, and then discovered that someone had already reported the issue (and they had used goroutine profiles too). Sometimes the reason an app slows down is that its log collection daemon is running behind, and (with a blocking log library) its calls to log end up waiting on the logger's None of those are exactly leaks, because the thing those goroutines are waiting for does eventually happen. In the first and third cases, if load on the application decreases then the goroutines will each get a turn with the We also used them to debug the problem that led to aws/aws-sdk-go#3127, where a Around Go 1.13, an app I support with a typical response time of 1ms had occasional delays of 100ms. I used goroutine profiles to learn that those requests were delayed in particular functions by calls to It's a very general tool, very easy to deploy (vs the opt-in of block/mutex profiles), very easy to use (vs execution trace). |
Thanks for @rhysh 's use cases.
in the 1st case, it may be caused by some unexpected issue, eg. network block or unexpected lock in some corner case. Actually, we care about two-part overhead:
the 1st STW time is more important for us. @rsc Yeah, we want to what does the goroutines are doing or waiting(mostly wanted). I think random sampling is a good choice, just like sampling in CPU or memory profile.
@aclements @mknyszek I think this is a good choice for our use case. Glad to see this can be implemented, but there is another question we just break backward compatibility or introduce another new API to control it.
@mknyszek yes, that's it.
@prattmic sampling is OK for me usually. But I'm not sure if it is a good idea for others.
@ianlancetaylor yeah, in general, we are interested in the blocked goroutines while doing goroutine profile. Thank you, everyone! |
Change https://go.dev/cl/387415 mentions this issue: |
Assuming CL 387415 lands, do we need SetMaxDumpGoroutine at all? Or should this proposal be closed? Thanks. |
@rsc I think it's still better to have the SetMaxDumpGoroutine on my side, since it will reduce the total CPU cost. @rhysh Glad to see CL 387415 started. Do you think sample with limit conflicts with CL 387415? |
❤️ the discussion and proposed improvements, especially CL 387415 from @rhysh. Before deciding on a I'm planning to submit a separate proposal and a working proof of concept (see this video for a teaser) for this soon, but the general idea would be to use pprof labels to allow users to mark a small subset of their goroutines (e.g. every Nth goroutine handling incoming user request) as "tracer goroutines" and then sample their stack traces at 100 Hz and record timestamps + goroutine ids + stacks. The output format could be the trace event format or protobuf format used by perfetto. Perhaps this is entirely orthogonal to |
Collecting the profile concurrently would complicate the sampling algorithm, but otherwise I think the two changes would be able to work together, @doujiang24 . (I don't have an algorithm off the top of my head for how to do that without O(samples) light work to prepare, but at least it's less than O(goroutines) of heavy work.) I want to be clear that as far as I know, CL 387415 (at PS 3) does not compromise on the consistency of the goroutine profile (based on the test infrastructure in CL 387416). The key information that goroutine profiles provide is sometimes the presence/absence/state of a single goroutine. In #32388 it's a single goroutine holding up the http connection pool because of a slow network write (and 10k goroutines waiting). In #33747 it's the mysterious absence of any goroutine actually holding the lock (and 10k goroutines waiting). So I'm not convinced that sampling the goroutines is going to give good results for its users. An additional field on an extensible API like @felixge mentioned seems like a decent middleground. As for the CPU overhead, I'm not sure how to tell how much is too much. Here's a possible framework: The machines I use often provide CPU and Memory allocations in a particular ratio, such as 2 GB of memory per 1 CPU thread. Filling the memory with 2kB goroutine stacks allows 1e6 goroutines. Collecting a goroutine profile for all of those takes about 2µs each, for a total of 2 seconds of CPU time. How often does an app need to collect a goroutine profile before the collection takes an unacceptable fraction of the compute resources, when a goroutine profile every five minutes leads to a CPU overhead of less than 1%? Faster is nice, but maybe it's already fast enough. @doujiang24 , what framework and target do you have in mind for "We hope less overhead for profiling."? |
It sounds like there are enough available performance improvements (some of which are already being implemented) that we should hold off on adding any new API right now, which we would be stuck with forever. Do I have that right? |
That's my view, yes. |
@rhysh Sorry for the delay. Also, sampling is not a good idea for all cases but really useful in some cases.
Yeah, CL 387415 reduced the STW really helps, STW could introduce larger latency. |
Based on the discussion above, this proposal seems like a likely decline. |
okay, just confirm it clearly, is |
@doujiang24 As @rsc says above, our current opinion is "we should hold off on adding any new API right now, which we would be stuck with forever." Let's try the new changes out for a while and see to what extent there is still a real problem here to solve. Thanks. |
Okay, got it. Thank you all. |
No change in consensus, so declined. |
The goroutine profile needs to stop the world to get a consistent snapshot of all goroutines in the app. Leaving the world stopped while iterating over allgs leads to a pause proportional to the number of goroutines in the app (or its high-water mark). Instead, do only a fixed amount of bookkeeping while the world is stopped. Install a barrier so the scheduler confirms that a goroutine appears in the profile, with its stack recorded exactly as it was during the stop-the-world pause, before it allows that goroutine to execute. Iterate over allgs while the app resumes normal operations, adding each to the profile unless they've been scheduled in the meantime (and so have profiled themselves). Stop the world a second time to remove the barrier and do a fixed amount of cleanup work. This increases both the fixed overhead and per-goroutine CPU-time cost of GoroutineProfile. It also increases the wall-clock latency of the call to GoroutineProfile, since the scheduler may interrupt it to execute other goroutines. name old time/op new time/op delta GoroutineProfile/small/loaded-8 1.05ms ± 5% 4.99ms ±31% +376.85% (p=0.000 n=10+9) GoroutineProfile/sparse/loaded-8 1.04ms ± 4% 3.61ms ±27% +246.61% (p=0.000 n=10+10) GoroutineProfile/large/loaded-8 7.69ms ±17% 20.35ms ± 4% +164.50% (p=0.000 n=10+10) GoroutineProfile/small/idle 958µs ± 0% 1820µs ±23% +89.91% (p=0.000 n=10+10) GoroutineProfile/sparse/idle-8 1.00ms ± 3% 1.52ms ±17% +51.18% (p=0.000 n=10+10) GoroutineProfile/small/idle-8 1.01ms ± 4% 1.47ms ± 7% +45.28% (p=0.000 n=9+9) GoroutineProfile/sparse/idle 980µs ± 1% 1403µs ± 2% +43.22% (p=0.000 n=9+10) GoroutineProfile/large/idle-8 7.19ms ± 8% 8.43ms ±21% +17.22% (p=0.011 n=10+10) PingPongHog 511ns ± 8% 585ns ± 9% +14.39% (p=0.000 n=10+10) GoroutineProfile/large/idle 6.71ms ± 0% 7.58ms ± 3% +13.08% (p=0.000 n=8+10) PingPongHog-8 469ns ± 8% 509ns ±12% +8.62% (p=0.010 n=9+10) WakeupParallelSyscall/5µs 216µs ± 4% 229µs ± 3% +6.06% (p=0.000 n=10+9) WakeupParallelSyscall/5µs-8 147µs ± 1% 149µs ± 2% +1.12% (p=0.009 n=10+10) WakeupParallelSyscall/2µs-8 140µs ± 0% 142µs ± 1% +1.11% (p=0.001 n=10+9) WakeupParallelSyscall/50µs-8 236µs ± 0% 238µs ± 1% +1.08% (p=0.000 n=9+10) WakeupParallelSyscall/1µs-8 138µs ± 0% 140µs ± 1% +1.05% (p=0.013 n=10+9) Matmult 8.52ns ± 1% 8.61ns ± 0% +0.98% (p=0.002 n=10+8) WakeupParallelSyscall/10µs-8 157µs ± 1% 158µs ± 1% +0.58% (p=0.003 n=10+8) CreateGoroutinesSingle-8 328ns ± 0% 330ns ± 1% +0.57% (p=0.000 n=9+9) WakeupParallelSpinning/100µs-8 343µs ± 0% 344µs ± 1% +0.30% (p=0.015 n=8+8) WakeupParallelSyscall/20µs-8 178µs ± 0% 178µs ± 0% +0.18% (p=0.043 n=10+9) StackGrowthDeep-8 22.8µs ± 0% 22.9µs ± 0% +0.12% (p=0.006 n=10+10) StackGrowth 1.06µs ± 0% 1.06µs ± 0% +0.09% (p=0.000 n=8+9) WakeupParallelSpinning/0s 10.7µs ± 0% 10.7µs ± 0% +0.08% (p=0.000 n=9+9) WakeupParallelSpinning/5µs 30.7µs ± 0% 30.7µs ± 0% +0.04% (p=0.000 n=10+10) WakeupParallelSpinning/100µs 411µs ± 0% 411µs ± 0% +0.03% (p=0.000 n=10+9) WakeupParallelSpinning/2µs 18.7µs ± 0% 18.7µs ± 0% +0.02% (p=0.026 n=10+10) WakeupParallelSpinning/20µs-8 93.0µs ± 0% 93.0µs ± 0% +0.01% (p=0.021 n=9+10) StackGrowth-8 216ns ± 0% 216ns ± 0% ~ (p=0.209 n=10+10) CreateGoroutinesParallel-8 49.5ns ± 2% 49.3ns ± 1% ~ (p=0.591 n=10+10) CreateGoroutinesSingle 699ns ±20% 748ns ±19% ~ (p=0.353 n=10+10) WakeupParallelSpinning/0s-8 15.9µs ± 2% 16.0µs ± 3% ~ (p=0.315 n=10+10) WakeupParallelSpinning/1µs 14.6µs ± 0% 14.6µs ± 0% ~ (p=0.513 n=10+10) WakeupParallelSpinning/2µs-8 24.2µs ± 3% 24.1µs ± 2% ~ (p=0.971 n=10+10) WakeupParallelSpinning/10µs 50.7µs ± 0% 50.7µs ± 0% ~ (p=0.101 n=10+10) WakeupParallelSpinning/20µs 90.7µs ± 0% 90.7µs ± 0% ~ (p=0.898 n=10+10) WakeupParallelSpinning/50µs 211µs ± 0% 211µs ± 0% ~ (p=0.382 n=10+10) WakeupParallelSyscall/0s-8 137µs ± 1% 138µs ± 0% ~ (p=0.075 n=10+10) WakeupParallelSyscall/1µs 216µs ± 1% 219µs ± 3% ~ (p=0.065 n=10+9) WakeupParallelSyscall/2µs 214µs ± 7% 219µs ± 1% ~ (p=0.101 n=10+8) WakeupParallelSyscall/50µs 317µs ± 5% 326µs ± 4% ~ (p=0.123 n=10+10) WakeupParallelSyscall/100µs 450µs ± 9% 459µs ± 8% ~ (p=0.247 n=10+10) WakeupParallelSyscall/100µs-8 337µs ± 0% 338µs ± 1% ~ (p=0.089 n=10+10) WakeupParallelSpinning/5µs-8 32.2µs ± 0% 32.2µs ± 0% -0.05% (p=0.026 n=9+10) WakeupParallelSpinning/50µs-8 216µs ± 0% 216µs ± 0% -0.12% (p=0.004 n=10+10) WakeupParallelSpinning/1µs-8 20.6µs ± 0% 20.5µs ± 0% -0.22% (p=0.014 n=10+10) WakeupParallelSpinning/10µs-8 54.5µs ± 0% 54.2µs ± 1% -0.57% (p=0.000 n=10+10) CreateGoroutines-8 213ns ± 1% 211ns ± 1% -0.86% (p=0.002 n=10+10) CreateGoroutinesCapture 1.03µs ± 0% 1.02µs ± 0% -0.91% (p=0.000 n=10+10) CreateGoroutinesCapture-8 1.32µs ± 1% 1.31µs ± 1% -1.06% (p=0.001 n=10+9) CreateGoroutines 188ns ± 0% 186ns ± 0% -1.06% (p=0.000 n=9+10) CreateGoroutinesParallel 188ns ± 0% 186ns ± 0% -1.27% (p=0.000 n=8+10) WakeupParallelSyscall/0s 210µs ± 3% 207µs ± 3% -1.60% (p=0.043 n=10+10) StackGrowthDeep 121µs ± 1% 119µs ± 1% -1.70% (p=0.000 n=9+10) Matmult-8 1.82ns ± 3% 1.78ns ± 3% -2.16% (p=0.020 n=10+10) WakeupParallelSyscall/20µs 281µs ± 3% 269µs ± 4% -4.44% (p=0.000 n=10+10) WakeupParallelSyscall/10µs 239µs ± 3% 228µs ± 9% -4.70% (p=0.001 n=10+10) GoroutineProfile/sparse-nil/idle-8 485µs ± 2% 12µs ± 4% -97.56% (p=0.000 n=10+10) GoroutineProfile/small-nil/idle-8 484µs ± 2% 12µs ± 1% -97.60% (p=0.000 n=10+7) GoroutineProfile/small-nil/loaded-8 487µs ± 2% 11µs ± 3% -97.68% (p=0.000 n=10+10) GoroutineProfile/sparse-nil/loaded-8 507µs ± 4% 11µs ± 6% -97.78% (p=0.000 n=10+10) GoroutineProfile/large-nil/idle-8 709µs ± 2% 11µs ± 4% -98.38% (p=0.000 n=10+10) GoroutineProfile/large-nil/loaded-8 717µs ± 2% 11µs ± 3% -98.43% (p=0.000 n=10+10) GoroutineProfile/sparse-nil/idle 465µs ± 3% 1µs ± 1% -99.84% (p=0.000 n=10+10) GoroutineProfile/small-nil/idle 493µs ± 3% 1µs ± 0% -99.85% (p=0.000 n=10+9) GoroutineProfile/large-nil/idle 716µs ± 1% 1µs ± 2% -99.89% (p=0.000 n=7+10) name old alloc/op new alloc/op delta CreateGoroutinesCapture 144B ± 0% 144B ± 0% ~ (all equal) CreateGoroutinesCapture-8 144B ± 0% 144B ± 0% ~ (all equal) name old allocs/op new allocs/op delta CreateGoroutinesCapture 5.00 ± 0% 5.00 ± 0% ~ (all equal) CreateGoroutinesCapture-8 5.00 ± 0% 5.00 ± 0% ~ (all equal) name old p50-ns new p50-ns delta GoroutineProfile/small/loaded-8 1.01M ± 3% 3.87M ±45% +282.15% (p=0.000 n=10+10) GoroutineProfile/sparse/loaded-8 1.02M ± 3% 2.43M ±41% +138.42% (p=0.000 n=10+10) GoroutineProfile/large/loaded-8 7.43M ±16% 17.28M ± 2% +132.43% (p=0.000 n=10+10) GoroutineProfile/small/idle 956k ± 0% 1559k ±16% +63.03% (p=0.000 n=10+10) GoroutineProfile/small/idle-8 1.01M ± 3% 1.45M ± 7% +44.31% (p=0.000 n=10+9) GoroutineProfile/sparse/idle 977k ± 1% 1399k ± 2% +43.20% (p=0.000 n=10+10) GoroutineProfile/sparse/idle-8 1.00M ± 3% 1.41M ± 3% +40.47% (p=0.000 n=10+10) GoroutineProfile/large/idle-8 6.97M ± 1% 8.41M ±25% +20.54% (p=0.003 n=8+10) GoroutineProfile/large/idle 6.71M ± 1% 7.46M ± 4% +11.15% (p=0.000 n=10+10) GoroutineProfile/sparse-nil/idle-8 483k ± 3% 13k ± 3% -97.41% (p=0.000 n=10+9) GoroutineProfile/small-nil/idle-8 483k ± 2% 12k ± 1% -97.43% (p=0.000 n=10+8) GoroutineProfile/small-nil/loaded-8 484k ± 3% 10k ± 2% -97.93% (p=0.000 n=10+8) GoroutineProfile/sparse-nil/loaded-8 492k ± 2% 10k ± 4% -97.97% (p=0.000 n=10+8) GoroutineProfile/large-nil/idle-8 708k ± 2% 12k ±15% -98.36% (p=0.000 n=10+10) GoroutineProfile/large-nil/loaded-8 714k ± 2% 10k ± 2% -98.60% (p=0.000 n=10+8) GoroutineProfile/sparse-nil/idle 459k ± 1% 1k ± 1% -99.85% (p=0.000 n=10+10) GoroutineProfile/small-nil/idle 477k ± 1% 1k ± 0% -99.85% (p=0.000 n=10+9) GoroutineProfile/large-nil/idle 712k ± 1% 1k ± 1% -99.90% (p=0.000 n=7+10) name old p90-ns new p90-ns delta GoroutineProfile/small/loaded-8 1.13M ±10% 7.49M ±35% +562.07% (p=0.000 n=10+10) GoroutineProfile/sparse/loaded-8 1.10M ±12% 4.58M ±31% +318.02% (p=0.000 n=10+9) GoroutineProfile/large/loaded-8 8.78M ±24% 27.83M ± 2% +217.00% (p=0.000 n=10+10) GoroutineProfile/small/idle 967k ± 0% 2909k ±50% +200.91% (p=0.000 n=10+10) GoroutineProfile/sparse/idle-8 1.02M ± 3% 1.96M ±76% +92.99% (p=0.000 n=10+10) GoroutineProfile/small/idle-8 1.07M ±17% 1.55M ±12% +45.23% (p=0.000 n=10+10) GoroutineProfile/sparse/idle 992k ± 1% 1417k ± 3% +42.79% (p=0.000 n=9+10) GoroutineProfile/large/idle 6.73M ± 0% 7.99M ± 8% +18.80% (p=0.000 n=8+10) GoroutineProfile/large/idle-8 8.20M ±25% 9.18M ±25% ~ (p=0.315 n=10+10) GoroutineProfile/sparse-nil/idle-8 495k ± 3% 13k ± 1% -97.36% (p=0.000 n=10+9) GoroutineProfile/small-nil/idle-8 494k ± 2% 13k ± 3% -97.36% (p=0.000 n=10+10) GoroutineProfile/small-nil/loaded-8 496k ± 2% 13k ± 1% -97.41% (p=0.000 n=10+10) GoroutineProfile/sparse-nil/loaded-8 544k ±11% 13k ± 1% -97.62% (p=0.000 n=10+9) GoroutineProfile/large-nil/idle-8 724k ± 1% 13k ± 3% -98.20% (p=0.000 n=10+10) GoroutineProfile/large-nil/loaded-8 729k ± 3% 13k ± 2% -98.23% (p=0.000 n=10+10) GoroutineProfile/sparse-nil/idle 476k ± 4% 1k ± 1% -99.85% (p=0.000 n=9+10) GoroutineProfile/small-nil/idle 537k ±10% 1k ± 0% -99.87% (p=0.000 n=10+9) GoroutineProfile/large-nil/idle 729k ± 0% 1k ± 1% -99.90% (p=0.000 n=7+10) name old p99-ns new p99-ns delta GoroutineProfile/sparse/loaded-8 1.27M ±33% 20.49M ±17% +1514.61% (p=0.000 n=10+10) GoroutineProfile/small/loaded-8 1.37M ±29% 20.48M ±23% +1399.35% (p=0.000 n=10+10) GoroutineProfile/large/loaded-8 9.76M ±23% 39.98M ±22% +309.52% (p=0.000 n=10+8) GoroutineProfile/small/idle 976k ± 1% 3367k ±55% +244.94% (p=0.000 n=10+10) GoroutineProfile/sparse/idle-8 1.03M ± 3% 2.50M ±65% +142.30% (p=0.000 n=10+10) GoroutineProfile/small/idle-8 1.17M ±34% 1.70M ±14% +45.15% (p=0.000 n=10+10) GoroutineProfile/sparse/idle 1.02M ± 3% 1.45M ± 4% +42.64% (p=0.000 n=9+10) GoroutineProfile/large/idle 6.92M ± 2% 9.00M ± 7% +29.98% (p=0.000 n=8+9) GoroutineProfile/large/idle-8 8.74M ±23% 9.90M ±24% ~ (p=0.190 n=10+10) GoroutineProfile/sparse-nil/idle-8 508k ± 4% 16k ± 2% -96.90% (p=0.000 n=10+9) GoroutineProfile/small-nil/idle-8 508k ± 4% 16k ± 3% -96.91% (p=0.000 n=10+9) GoroutineProfile/small-nil/loaded-8 542k ± 5% 15k ±15% -97.15% (p=0.000 n=10+10) GoroutineProfile/sparse-nil/loaded-8 649k ±16% 15k ±18% -97.67% (p=0.000 n=10+10) GoroutineProfile/large-nil/idle-8 738k ± 2% 16k ± 2% -97.86% (p=0.000 n=10+10) GoroutineProfile/large-nil/loaded-8 765k ± 4% 15k ±17% -98.03% (p=0.000 n=10+10) GoroutineProfile/sparse-nil/idle 539k ±26% 1k ±17% -99.84% (p=0.000 n=10+10) GoroutineProfile/small-nil/idle 659k ±25% 1k ± 0% -99.84% (p=0.000 n=10+8) GoroutineProfile/large-nil/idle 760k ± 2% 1k ±22% -99.88% (p=0.000 n=9+10) Fixes #33250 For #50794 Change-Id: I862a2bc4e991cec485f21a6fce4fca84f2c6435b Reviewed-on: https://go-review.googlesource.com/c/go/+/387415 Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Rhys Hiltner <rhys@justin.tv> TryBot-Result: Gopher Robot <gobot@golang.org>
We are using pprof for online profiling and we found goroutine dumping may take too much time when there are many goroutines.
In this case, it will take ~300ms when there are 100k goroutines on my side.
https://gist.github.com/doujiang24/9d066bce3a2bdd0f1b9fe1ef49699e4e
It's a too long time since there are almost 10k-100k goroutines in our system.
I think the easier way is to introduce a new API
SetMaxDumpGoroutineNum
.I have implemented it in this PR: #50771
But, the
SetMaxDumpGoroutineNum
API introduced a new global variable.This may not be a good idea.
Any feedback would be greatly appreciated. Thank you!
The text was updated successfully, but these errors were encountered: