-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safer socketmode #1150
Safer socketmode #1150
Conversation
…nel to close when it is finished, and consumer to see the close
…outer goroutine did not wait on all inner goroutines that had a chance to set it, also make sure to check error for context.Canceled appropriately
…l via a method that has an escape hatch; unable to change public Events field without breaking api, though
… when getting buffered (first) error
…ntexts for channel ops, though they are very similar now
…her small changes
Hi @kanata2 I know I haven't followed up with this as I intended. Unfortunately some other things came up and I haven't been able to add-to or otherwise refine this MR further. However that doesn't mean it cannot go in as-is and follow up with another if needed. Have you or any other devs/maintainers had a chance to look it over? |
Functionally revert slack-go#1170 as drafted approach takes correct path of closing chan on goroutine producer side
Hey @kanata2 , no problem on the slow response. Btw, I've functionally reverted #1170 in favor of what I had already fixed here (including that panic), in unison with the other issues I've described above. I would appreciate it if you, any other maintainers, and those involved with 1170 ( @lololozhkin & @parsley42 ) double check, as the scope of this MR is creeping which I would rather it not do 🙂
|
…er and ping time channel. Remove deadman file as it is no longer needed
@iaburton FYI, I'm running your branch in production Gopherbot right now. I've reached out to @kanata2 , but it may be that he's gotten too busy with other things to devote much time to this. I've been reading your patch for a while now this morning, but can't really take the time to fully ingest it (I'm not a daily go programmer, just a hobbyist w/ Gopherbot, which helps me with my real day job, DevOps). Your thinking and explanations are clear and sensible, so my inclination is to approve/merge as this fix covers a potential deadlock and race, and has a nicer implementation for the timeout. I'm going to wait a few more days to hear back from Naoki, though. My other inclination is to inquire if you'd be willing to serve as a maintainer. Looking at the commit history, Naoki has been doing most of the reviews and merges. Let me know if you'd be up for helping out. |
Hi @parsley42 thanks for running my branch! Pardon my absence as I've been dealing with a few things IRL 🙂 Before I comment on helping with maintenance may I ask how the branch has fared w/ Gopherbot? Is there anything I can help explain w/ respect to you having looked over the patch? |
Hi @iaburton yeah, work and RL are keeping me busy, too, as well as Naoki I guess. The original issue (IIRC) was that when the library stopped getting a ping from slack that wasn't sufficient to generate a reconnect - instead nothing happened until the ReadJSON timed out, which took a couple of minutes, causing a long pause when a connection went bad. The first bad fix just stopped waiting on the ReadJSON, and errored out on missed ping. I didn't see how this could lead to a panic later. Right now I'm heavily in AWS/Terraform land, and it would be a major context switch to fully grok your patch; I've been reading it for 20 minutes (about all I can allot) and still don't completely follow your changes in e.g. In any case, as far as Gopherbot goes, I can only tell you that two of my production robots haven't panicked in the last two weeks, and I'm not seeing any issues. When I looked at your GH profile, it appeared you spend a lot of time writing Go, so I thought maybe this would be a good fit - if not, I'll continue examining contributors to see who else might be able to pitch-in. What we don't want is me trying to review and merge patches. 😬 |
@iaburton Ok, it looks like I'm going to need to fill in for a while. I'd like the checks to run, but I think you'll need to close and re-open the PR to get them out of the expired state. |
Closing and re-opening for tests. |
@lololozhkin Have you looked at this patch? It deals with the same issue as your patch, but with some additions and changes. Wanted to get your thoughts on it? |
@parsley42, hi, haven't seen your messages, thank you for mentioning. Cool patch, I haven't seen the opportunity to not to wait one of the goroutines, because I didn't know that this goroutine was the only producer, well done! I am not maintainer of this repo, so I cannot approve it on github, so I can do it only in comments. Approve) Yes, this code fixes my issue too. It's a pity, that my code goes to trash, but this project doesn't need it) |
I've read this discussion, and haven't understand, does my patch fixed your problem with slow reconnects? Cause, my patch fixed panics, It's running in my production and 0 panics occurred for more over than a month. And yes, reconnects are handled normally. If ping is not received till deadman timer elapsed, context will be cancelled and propagated to json reading, the goroutine will return after ctx cancel. Am I missing something? |
@lololozhkin Sorry - more context; this patch is by @iaburton , not me. I am a maintainer and I could merge this patch, but wanted your opinion since it is an alternate fix. Based on the description, it may have value, even if only allowing the caller to pass in a context. What do you think? |
@parsley42 As I said, yes, from my point of view that patch fixes the problem from my PR. Moreover it fixes it in a better way, as I suppose. So, from my point of view, this PR should be merged fully, including allowing usage of contexts in Send and Ack, and fix of panics on reconnects :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patch looks fine and has been pretty heavily tested; also reviewed by @lololozhkin - merging.
@lololozhkin Ok, thanks - you approved in your first message but then replied again with other questions, so I just wanted to clarify. Thanks for being responsive. I am in the process of trying to recruit more active maintainers for this library and have sent you an invite - I think more seasoned Go developers probably wouldn't pick me as "the" maintainer, but right now it seems others are busy, so I'm trying to get some things moved through. If you'd like to help out, that'd be great; if you know other devs you'd recommend I'd be happy to hear. |
@parsley42 TY! Accepted your invitation, I think I could be usefull for this project :) I don't know other interested go devs, but if they appear in my life, I would tell it. |
PR preparation
Run
make pr-prep
from the root of the repository to run formatting, linting and tests.Done, though I only modified
socketmode
and I was running race tests with-count 1000
while making changes anyways.Should this be an issue instead
Eh, one of the things it fixes was already an issue, #1125 as mentioned in the relevant commit message.
I decided not to make a(nother) larger issue for my other changes, for reasons listed below.
API changes
As far as I can tell, there should be no breaking api changes (though a small behavior change in one location that used to have the potential to panic). There are however small API "additions" in terms of two methods that allow context use while previous versions did not.
The why
Hello!
I'm looking to use the socketmode package of this repo for a potentially largish app and while browsing the code & existing issues decided to fix some up myself as well as address some I found on my own. Below is a TLDR-ish summary of the changes:
Fixes Socketmode panic, write on closed channel #1125 as it allows the producer of the messages on the effected channel to close it when messages are no longer being written, this lets the consumer safely see the close and neither will panic.
Fix the potential for a race condition on firstErr. While all of these goroutines go through a sync.Once to write to the variable, one is not 'waited' on by the outer goroutine. In this case, since the outer goroutine does not go through a sync type to read the variable, a race was possible between the read and write linked earlier. Simple enough to fix, use a buffered channel and some selects; first goroutine to exit with an error sends it into the buffer and the others fall through the default. Outer goroutine then selects for the buffered error.
Fix potential deadlock in socketmode.Client.Events due to the use of buffering, and single channel sends. The fact the Events chan is a public field is unfortunate but not reason enough for this problem. If consumer goroutines are slow and the buffer fills while another goroutine is writing to Events, while yet another is canceling the context provided to
RunContext
, you may have deadlocked goroutines unable to send since the buffer is full and they are not in a select they can use to break out of the send. Again relatively easy fix, change all Event writes into a method that selects on the propagated context, negating any chance of getting stuck. Again it would be better to force Events to be a receive channel only on the Clients struct, or use a method to return it, but that isn't doable without breaking API changes.SendCtx
andAckCtx
as those too write to an internal channel in an unsafe manner, and so context methods were needed to break a potential deadlock. The OG methods are still there, and perhaps a comment should be added about the context methods being preferred.Other more miscellaneous changes are mostly around error handling and use of error (un)wrapping introduced in Go 1.13.
Fix race on deadman timer Elapsed+Reset. The deadman timer's channel was being read from one goroutine and reset from the pingHandler, which is not allowed;
!timer.Stop()
case you could have more than one goroutine doing concurrent receives. This was not a data race, but a race condition effecting ping/pong/timeout behavior.#### DRAFT StatusWhile I'm technically submitting this now, I don't consider it 100% yet as I'd like to verify some things in more real-world use,plus I'm considering performance and other changes, though those are likely to come in another PR. That, and it lets me getsome feedback on this early if any maintainers have comments/questions/etc. 🙂EDIT: Avoiding scope creep and stale MRs; performance and other changes can be followed up.