Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code review: potential thread safety issue in cache.go #5699

Closed
jonseymour opened this issue Feb 16, 2016 · 7 comments · Fixed by #5701
Closed

code review: potential thread safety issue in cache.go #5699

jonseymour opened this issue Feb 16, 2016 · 7 comments · Fixed by #5701

Comments

@jonseymour
Copy link
Contributor

I have been working with the cache code over the last few days (see #5689) and have noticed some lines in the cache code that appear to be lacking thread safety:

In particular:

https://github.com/influxdata/influxdb/blob/master/tsdb/engine/tsm1/cache.go#L61-L62

contains these lines:

e.values = e.values.Deduplicate()
e.needSort = false

It appears the assumption in the code is that every thread will observe those two statements executing in that order. In other words, that the deduplicated e.values will be assigned before e.needStore is set to false.

However, AFAICT these lines are not guarded with any kind of memory barrier or synchronization. Indeed, this comment in one of the ultimate callers of these lines (https://github.com/influxdata/influxdb/blob/master/tsdb/engine/tsm1/engine.go#L435-L438) indicates that locks are not held precisely to avoid inducing lengthy delays in concurrently executing code.

The problem is, however, that without some kind of synchronization construct, a CPU is free re-order the assignments so that the assignment to needsSort actually happens before the assignment to values - at least from the perspective of concurrently executing threads/goroutines.

One implication of this is that a query might obtain a reference to an undeduplicated values slice when it is expecting to receive a reference to a deduplicated slice. I can't actually predict how this would manifest itself in user level symptoms, but it might be worth considering whether some thread safety disciplines need to be applied to this code.

jonseymour added a commit to jonseymour/influxdb that referenced this issue Feb 16, 2016
…a#5699)

Cache.Deduplicate is designed to be execute while the caller is not holding
any locks.

The entry.deduplicate() method contains two statements:

e.values = e.values.Deduplicate()
e.needSort = false

which need to be executed in that order (as observed by all threads). However,
without a synchronization construct won't be.

This fix makes a clone of each entry before performing the deduplication,
then while briefly holding the lock, updates the store with the deduplicated clones


Signed-off-by: Jon Seymour <jon@wildducktheories.com>
@e-dard
Copy link
Contributor

e-dard commented Feb 16, 2016

/cc @jwilder

@jonseymour I think we may just need to add a cache lock to cache.Deduplicate. The engine lock referred to in WriteSnapshot is not the same lock as the Cache one.

e.deduplicate has two callers, c.Deduplicate and c.merged. c.merged already acquires the c.mu lock.

@jonseymour
Copy link
Contributor Author

Yep, I have added a lock to Cache.Deduplicate, but only around the updating of c.store. The de-duplication itself is still done outside of the lock, but on a clone of the original which won't be visible to another thread.

@jwilder
Copy link
Contributor

jwilder commented Feb 16, 2016

Not sure I agree that there is a problem here. Those statements should never be re-ordered because they are executed in the same goroutine and re-ordering them would change the behavior of the program. This would violate the Go memory model: https://golang.org/ref/mem#tmp_2

@jonseymour
Copy link
Contributor Author

The statement about "re-ordering them would change the behavior of the program" is true, but only for the gorourtine in which they occur. So yes, from the perspective of a single go-routine (the one executing the statements) the two statements must appear to execute in the specified order.

The key part of the Go memory model that applies in this case are these words;

"Within a single goroutine, there is no concurrency, so the two definitions are equivalent: a read r observes the value written by the most recent write w to v. When multiple goroutines access a shared variable v, they must use synchronization events to establish happens-before conditions that ensure reads observe the desired writes."

This is the problem here - all go routines accessing the same data structure "must use synchronization events to establish happens-before conditions that ensure that reads observe the desired writes" (per the memory model)

@e-dard
Copy link
Contributor

e-dard commented Feb 16, 2016

@jonseymour but all this seems to be a moot point if there is simply a lock put on cache.Deduplicate, no? Since at that point only one goroutine will be able to access c.store at any given point.

@jonseymour
Copy link
Contributor Author

The problem with doing that is that it locks out every other goroutine from access to that snapshot for the duration of the deduplication process which @jwilder has already determined (correctly, I believe) to causing severe performance issues in high load situations.

The reason is, if the lock goes around the body of the entire method then that lock is in the request path of virtually every write and query request. With the lock in the position I have placed it, the lock is held for the smallest possible amount of time and requests will not be impeded during the expensive phase of the deduplication.

(Sorry @e-dard - I may have misread what you were replying to. To be clear, I agree with you that a lock is required, but I think it needs be of limited scope because of the performance implications of doing otherwise).

@jonseymour
Copy link
Contributor Author

@jwilder I've created a test case that uses the go race detector to demonstrate the race condition I was concerned about and is the subject of this report.

greywedge2:tsm1 jonseymour (3)$ git checkout b8f7f3c
Note: checking out 'b8f7f3c'.

greywedge2:tsm1 jonseymour (3)$ GORACE=history=1000 go test -race cache_race_test.go
==================
WARNING: DATA RACE
Read by goroutine 65:
  github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Cache).merged()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache.go:253 +0x3bd
  github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Cache).Values.func1()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache.go:209 +0xa3
  github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Cache).Values()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache.go:210 +0x142
  command-line-arguments_test.TestCacheRace.func1()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache_race_test.go:42 +0x90

Previous write by goroutine 114:
  [failed to restore the stack]

Goroutine 65 (running) created at:
  command-line-arguments_test.TestCacheRace()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache_race_test.go:43 +0xaf8
  testing.tRunner()
      /usr/local/Cellar/go/1.5.3/libexec/src/testing/testing.go:456 +0xdc

Goroutine 114 (running) created at:
  command-line-arguments_test.TestCacheRace()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache_race_test.go:52 +0xb7e
  testing.tRunner()
      /usr/local/Cellar/go/1.5.3/libexec/src/testing/testing.go:456 +0xdc
==================
==================
WARNING: DATA RACE
Read by goroutine 65:
  runtime.slicecopy()
      /usr/local/Cellar/go/1.5.3/libexec/src/runtime/slice.go:110 +0x0
  github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Cache).merged()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache.go:276 +0x86c
  github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Cache).Values.func1()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache.go:209 +0xa3
  github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Cache).Values()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache.go:210 +0x142
  command-line-arguments_test.TestCacheRace.func1()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache_race_test.go:42 +0x90

Previous write by goroutine 114:
  [failed to restore the stack]

Goroutine 65 (running) created at:
  command-line-arguments_test.TestCacheRace()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache_race_test.go:43 +0xaf8
  testing.tRunner()
      /usr/local/Cellar/go/1.5.3/libexec/src/testing/testing.go:456 +0xdc

Goroutine 114 (running) created at:
  command-line-arguments_test.TestCacheRace()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache_race_test.go:52 +0xb7e
  testing.tRunner()
      /usr/local/Cellar/go/1.5.3/libexec/src/testing/testing.go:456 +0xdc
==================
==================
WARNING: DATA RACE
Write by goroutine 114:
  github.com/influxdb/influxdb/tsdb/engine/tsm1.(*entry).deduplicate()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache.go:61 +0xbd
  github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Cache).Deduplicate()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache.go:155 +0xc5
  command-line-arguments_test.TestCacheRace.func2()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache_race_test.go:50 +0x76

Previous read by goroutine 10:
  github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Cache).merged()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache.go:253 +0x3bd
  github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Cache).Values.func1()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache.go:209 +0xa3
  github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Cache).Values()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache.go:210 +0x142
  command-line-arguments_test.TestCacheRace.func1()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache_race_test.go:42 +0x90

Goroutine 114 (running) created at:
  command-line-arguments_test.TestCacheRace()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache_race_test.go:52 +0xb7e
  testing.tRunner()
      /usr/local/Cellar/go/1.5.3/libexec/src/testing/testing.go:456 +0xdc

Goroutine 10 (finished) created at:
  command-line-arguments_test.TestCacheRace()
      /Users/jonseymour/ninja/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/cache_race_test.go:43 +0xaf8
  testing.tRunner()
      /usr/local/Cellar/go/1.5.3/libexec/src/testing/testing.go:456 +0xdc
==================
PASS
Found 3 data race(s)
FAIL    command-line-arguments  1.183s

To reproduce yourself, execute the following commands:

git checkout b8f7f3c
GORACE=history=1000 go test -race cache_race_test.go

jonseymour added a commit to jonseymour/influxdb that referenced this issue Feb 19, 2016
Fixes influxdata#5699.

Cache.Deduplicate is designed to be execute while the caller is not holding
any locks.

The entry.deduplicate() method contains two statements:

e.values = e.values.Deduplicate()
e.needSort = false

which need to be executed in that order (as observed by all threads). However,
without a synchronization construct won't be.

This fix makes a clone of each entry before performing the deduplication,
then while briefly holding the lock, updates the store with the deduplicated clones

A second fix guards access to snapshots with a readlock.

Also a small tweak to remove unnecessary deduping of single element lists.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
jonseymour added a commit to jonseymour/influxdb that referenced this issue Feb 22, 2016
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
jwilder added a commit that referenced this issue Feb 25, 2016
tsm: cache: improve thread safety of Cache.Deduplicate (see #5699)
jonseymour added a commit to jonseymour/influxdb that referenced this issue Feb 29, 2016
This series re-rolls the fixes on influxdata#5719, influxdata#5699, influxdata#5832 without any other
changes from 0.11.0 onto 0.10.1 for the purpose of addressing
issue influxdata#5857.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants