-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
code review: potential thread safety issue in cache.go #5699
Comments
…a#5699) Cache.Deduplicate is designed to be execute while the caller is not holding any locks. The entry.deduplicate() method contains two statements: e.values = e.values.Deduplicate() e.needSort = false which need to be executed in that order (as observed by all threads). However, without a synchronization construct won't be. This fix makes a clone of each entry before performing the deduplication, then while briefly holding the lock, updates the store with the deduplicated clones Signed-off-by: Jon Seymour <jon@wildducktheories.com>
/cc @jwilder @jonseymour I think we may just need to add a cache lock to
|
Yep, I have added a lock to Cache.Deduplicate, but only around the updating of c.store. The de-duplication itself is still done outside of the lock, but on a clone of the original which won't be visible to another thread. |
Not sure I agree that there is a problem here. Those statements should never be re-ordered because they are executed in the same goroutine and re-ordering them would change the behavior of the program. This would violate the Go memory model: https://golang.org/ref/mem#tmp_2 |
The statement about "re-ordering them would change the behavior of the program" is true, but only for the gorourtine in which they occur. So yes, from the perspective of a single go-routine (the one executing the statements) the two statements must appear to execute in the specified order. The key part of the Go memory model that applies in this case are these words; "Within a single goroutine, there is no concurrency, so the two definitions are equivalent: a read r observes the value written by the most recent write w to v. When multiple goroutines access a shared variable v, they must use synchronization events to establish happens-before conditions that ensure reads observe the desired writes." This is the problem here - all go routines accessing the same data structure "must use synchronization events to establish happens-before conditions that ensure that reads observe the desired writes" (per the memory model) |
@jonseymour but all this seems to be a moot point if there is simply a lock put on |
The problem with doing that is that it locks out every other goroutine from access to that snapshot for the duration of the deduplication process which @jwilder has already determined (correctly, I believe) to causing severe performance issues in high load situations. The reason is, if the lock goes around the body of the entire method then that lock is in the request path of virtually every write and query request. With the lock in the position I have placed it, the lock is held for the smallest possible amount of time and requests will not be impeded during the expensive phase of the deduplication. (Sorry @e-dard - I may have misread what you were replying to. To be clear, I agree with you that a lock is required, but I think it needs be of limited scope because of the performance implications of doing otherwise). |
@jwilder I've created a test case that uses the go race detector to demonstrate the race condition I was concerned about and is the subject of this report.
To reproduce yourself, execute the following commands:
|
Fixes influxdata#5699. Cache.Deduplicate is designed to be execute while the caller is not holding any locks. The entry.deduplicate() method contains two statements: e.values = e.values.Deduplicate() e.needSort = false which need to be executed in that order (as observed by all threads). However, without a synchronization construct won't be. This fix makes a clone of each entry before performing the deduplication, then while briefly holding the lock, updates the store with the deduplicated clones A second fix guards access to snapshots with a readlock. Also a small tweak to remove unnecessary deduping of single element lists. Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
tsm: cache: improve thread safety of Cache.Deduplicate (see #5699)
This series re-rolls the fixes on influxdata#5719, influxdata#5699, influxdata#5832 without any other changes from 0.11.0 onto 0.10.1 for the purpose of addressing issue influxdata#5857. Signed-off-by: Jon Seymour <jon@wildducktheories.com>
I have been working with the cache code over the last few days (see #5689) and have noticed some lines in the cache code that appear to be lacking thread safety:
In particular:
https://github.com/influxdata/influxdb/blob/master/tsdb/engine/tsm1/cache.go#L61-L62
contains these lines:
It appears the assumption in the code is that every thread will observe those two statements executing in that order. In other words, that the deduplicated e.values will be assigned before e.needStore is set to false.
However, AFAICT these lines are not guarded with any kind of memory barrier or synchronization. Indeed, this comment in one of the ultimate callers of these lines (https://github.com/influxdata/influxdb/blob/master/tsdb/engine/tsm1/engine.go#L435-L438) indicates that locks are not held precisely to avoid inducing lengthy delays in concurrently executing code.
The problem is, however, that without some kind of synchronization construct, a CPU is free re-order the assignments so that the assignment to needsSort actually happens before the assignment to values - at least from the perspective of concurrently executing threads/goroutines.
One implication of this is that a query might obtain a reference to an undeduplicated values slice when it is expecting to receive a reference to a deduplicated slice. I can't actually predict how this would manifest itself in user level symptoms, but it might be worth considering whether some thread safety disciplines need to be applied to this code.
The text was updated successfully, but these errors were encountered: