-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional blocking when using Provide() #453
Comments
Same problem as in ipfs/kubo#3657 ? |
Could I get some stack traces when this buildup is happening? The yamux, handleNewMessage, etc. goroutines are usually sleeping goroutines waiting on streams/connections. However, there are a couple of places where this could be happening:
|
Related to #453 but not a fix. This will cause us to actually return early when we start blocking on sending to some peers, but it won't really _unblock_ those peers. For that, we need to write with a context.
Related to #453 but not a fix. This will cause us to actually return early when we start blocking on sending to some peers, but it won't really _unblock_ those peers. For that, we need to write with a context.
A few days ago I noticed that I had a bug in my auto-scaling code resulting in occasionally not counting properly the tasks terminating, so I entirely rewrote this part into a much cleaner and more importantly, well tested package to remove that variable from this equation. While doing that I also wrote better telemetry and better stuck task detection. Here is what I found out.
Here is a sample distribution of how long it takes to return after the context done: Not surprisingly, using a shorter timeout (40s) significantly increase the number of stuck tasks. It looks like they just finish naturally, without any care for the context deadline. @Stebalien I see your
This pattern is the result of the auto-scaling doing it's thing: This scaling down is to keep a relatively constant CPU usage: Eventually, this means that During all that, the time it takes to publish a single value stay quite constant: Just for the sake of completion, memory usage stay constant once it reaches a plateau:
To finish, here is a deduped goroutine dump of a worker under load, past the point of increased CPU usage: .. and two pprof CPU profiles: one from a recently launched worker, one from a long running one. |
@MichaelMure my instinct here is that it's not a coincidence that CPU usage increases as the memory plateaus. It may be possible that Go is starting to aggressively run the garbage collector in order to free up memory and that is eating up all your CPU cycles. How much total memory does your box have and how much is being used by Go (e.g. by running |
Apparently I forgot to actually PR a fix: #462 Note: This is not a real fix. The real underlying issue is that we don't pipeline or open multiple streams. |
@aschmahmann I think you are likely correct but what is actually happening still escape me. This is an EC2 instance with 8GB of memory. As you can see, the memory usage plateau close to that value. That said, this datadog metric is a bit weird as it report the used memory + the cached memory (the one used for disk caching that can be freed anytime). When ssh-ing, Digging further, pprof report the following: So, 870MB instead of 3.4G, that's a lot of memory not accounted for. It doesn't explain either why that becomes a problem even though there is supposedly more RAM to use if needed. I guess my next move is to collect expvar metrics and see if something show up. |
@Stebalien FYI, it seems that #462 is doing a good job at reducing the stuck tasks problem Some are still not returning properly though (this is the duration after the context deadline): |
Could I get another goroutine dump after the patch? Also note, that patch doesn't really fix the problem, unfortunately. The real problem is that these calls are blocked until they time out. |
Here it is: goroutine-dedup.txt |
Sorry if this is turning into a debugging session, but looking at a heap profiling, That's a lot of timer still in use. Could that be a leak ? This particular worker has been running for two days, with about 3k concurrent provide at the moment. Edit: |
FYI, I also did a test with As the auto-scaling maintain a constant CPU usage, the task count, already lower at the start, plummet into ~1/5 of what it is with go 1.13. In turn as the concurrency is lower, a bunch of other metrics are affected on go 1.14: less and shorter GC pause, ~1/2 the heap allocation, same for the RES memory. |
Yes. #466
No, this is great! Well, sort of. We're about to replace that exact code so it's kind of a non-issue but we might as well fix it. |
(FYI, normal stack traces are preferred as they can be parsed with https://github.com/whyrusleeping/stackparse/) Also, gives me how long it's stuck. |
What datastore are you using? edit: Oh, you're using a map one. Try using an in-memory leveldb. Finding providers in a map datastore is very slow. |
There are 2000 stuck dials. Did you change some dial limits somewhere? |
I think #467 will fix the leak. |
Thanks for the fixes 👍
Ha, that's the one I keep losing track of :) It's not happening often now with #462: "only" 377 times in 12h which makes it hard to catch on a goroutine dump (and you can't automate that as there is no way to take a stacktrace of another goroutine as far as I know). Here is the distribution (again, this is after a 150s context timeout):
I just changed to an on-disk leveldb store. I'll keep an eye on it but I don't think the extra latency will be a problem.
Connection manager is set to 2000/3000. |
We have a file limit for concurrent file-descriptor consuming dials inside the swarm. I may have misread something, but it looked like we had 2000 concurrent dials which should be impossible (we limit to ~100). But I'll look again. |
Are you absolutely sure you didn't change https://github.com/libp2p/go-libp2p-swarm/blob/158818154931f12368cc99787728e3bd27dff9ba/swarm_dial.go#L62? |
I have edit: the reasoning was that we are running powerful machine in good networking condition and are only doing DHT publishing/bitswap. It should be ok to unleash more throughput. |
Ah, yes, that's it. That should be fine and is probably good for your use-case. I was just confused (and a bit concerned) when I saw so many parallel TCP dials. |
FYI, with the fixed metrics of #464, this is what I get: I'll let you figure out if that is 1) a sign of an actual problem and 2) if those should indeed be recorded as error. |
Thanks for highlighting that. That's normal and we shouldn't be recording that as an error. That's the error we get when the remote side resets the stream (e.g., the connection is closing, they no longer need to make requests, etc.). |
Very interesting... Are you sure your ISP/router didn't just start rate-limiting your dials? |
It's running on AWS EC2 so that's unlikely. |
Ignoring the mentioned oddities, actual problems have been resolved. Closing. Thanks for the help :) |
I have a worker node specialized in DHT
Provide
-ing based on go-ipfs core with the last DHT released (go-libp2p-kad-dht
v0.5.0,go-ipfs
8e9725fd0009
(master from Feb 7)). What I do on this worker is a highly concurrentProvide
(up to 5k concurrent tasks) to publish a high number of values in a reasonable amount of time.While this initially work quite well, I found out that some
Provide
get stuck over time, even though I pass them a context with a 5 minutes timeout. To detect that problem I setup an extra goroutine for each task to check at 5 minutes 30s if theProvide
returned.Here is an example run:
As you can see, there is indeed stuck
Provide
. As those never return, they eventually bring down the publish rate as they don't pick up new value to publish. Btw, don't be so concerned by the progressive ramping up at the beginning, it's just the worker spawning new tasks gradually to avoid congestion.To track down further this problem I wrote some code to, once a first blocking is detected, change the concurrency factor to zero. The result is that all the sane
Provide
complete and get removed, leaving only the blocked ones. When this happen, I found that hundreds of those are left, even though the teardown start as soon as the first is found.After the all the sane
Provide
returned, I took a goroutine dump:raw goroutine dump
deduped with goroutine-inspect (notice the duplicate count for each stacktrace)
Here is what I could infer from that:
A few points:
Provide
never return an error.Provide
actually block. I suspect a congestion or deadlock of some sort but can't pinpoint where.The text was updated successfully, but these errors were encountered: