-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug that will cause concurrency access to search attributes map #6262
Fix bug that will cause concurrency access to search attributes map #6262
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
... and 8 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
I merged it for a hot fix. I'll add more test to this package later. |
From digging around in the code with the stack trace we caught on
this is definitely not a fix. the crash is occurring on this read on line 447: key := fmt.Sprintf(definition.HeaderFormat, sanitizedKey)
>>> if _, ok := attr[key]; ok { // skip if key already exists
continue
} which means the write needs to be something happening concurrently somewhere else, and copying it in this goroutine will just move the point where the still-dangerous read occurs. since this is run in a goroutine (which is at the bottom of the stack trace): func (p *parallelTaskProcessorImpl) taskWorker(shutdownCh chan struct{}) {
defer p.shutdownWG.Done()
for {
select {
case <-shutdownCh:
return
>>> case task := <-p.tasksCh:
p.executeTask(task, shutdownCh)
}
}
} we'll need to perform the copy before it crosses threads, i.e. before that channel is written to. so we've got code somewhere that's pushing a task to a concurrent processor, and then mutating what it pushed. that's pretty much always unsafe, and that's the flaw that needs to be fixed. aaah. and this data is coming from mutable state's well. we already knew that thing is a mess of race conditions due to no defensive copying. this is probably just another to add to the pile. |
Good digging. I didn't check the stack trace of crash assuming it was the recent change in this file that caused it. Looks like write is somewhere else. |
From more collaborative digging, we are currently at three possibilities:
There's also the suspicion that this may be failover-related, due to timing. 1 seems to require flawed locking around the mutable state, but the lock does seem to be getting acquired during this code's execution: https://github.com/uber/cadence/blob/c0cd4c51116edde8542e993460fe425034975ae7/service/history/task/transfer_standby_task_executor.go#L551 If ^ that lock is working correctly, there would be no self-concurrent read/write because there would only be one processing at a time. In that case, this is a useful PR either way. That map probably should not have these headers added. We're treating them roughly like workflow status or close time: a pseudo-attr that can be searched. 2 and 3 are hard to trace down. There's a LOT of shallow-copying of search-attrs, and mutable state info as a whole, through multiple types (hundreds of references, plus transitives). All of the ones I've looked at appear to just be passed through to things reading it, and probably protected by the lock at a higher level. I've looked at probably less than 10% of them though, and those were only easier ones. I'm inclined to believe that 3 is not happening, because copying data from one workflow to another seems unlikely.... If we ignore 3 and assume that 1's linked locking code should be correct, that means we either have a flawed mutable-state lock (i.e. Both of them mean that there is another deeper problem, and fixing that would also fix this crash (and may be causing other crashes). The former looks... probably fine? There's a lot of indirection, but the core lock looks alright and the surrounding stuff seems reasonable to me. |
What changed?
This is causing concurrent access to the map and crashed production instances.
Why?
Copy search attributes instead of mutate it.
How did you test it?
Potential risks
Release notes
Documentation Changes