-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use lock in InMemoryNormalizedCache to avoid race conditions #552
use lock in InMemoryNormalizedCache to avoid race conditions #552
Conversation
I realised that not queue fixes race conditions but actual usage of Promises. Will open smaller PR for that |
actually only promises don't fix it ;/ Will open PR again... |
@@ -1,21 +1,37 @@ | |||
import Dispatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to import this explicitly - it should be part of Foundation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a pretty significant change, although it could be a very helpful one (I know we've got some TSAN issues I'm going to be looking at with #600).
Is there a way to add tests for this change?
@designatednerd It should be possible, I can replicate crash very easily (not sure if its easy to test) |
@RolandasRazma I mean add an automated test so we can make sure a) This is doing what it needs to and b) This doesn't regress in the future 😄 |
@designatednerd I got that :) Just was saying that its easy to replicate by hand but not sure how easy to replicate in test |
Maybe you could set up two clients that use the same store for the test and try to have them access/mutate things in a certain sequence? |
@designatednerd So I tried to write test to replicate bug (wanted to create separate PR) but because this is race condition test not always work, it slows down all other tests and it crashes everything when it works. Not sure if it should be merged together with all other tests. I will drop it here for now, let me know what you want to do... (Without changes in this PR I manage to replicate crash with this test every 3rd run or so) func testThreadedCache() throws {
let cache = InMemoryNormalizedCache()
let networkTransport = MockNetworkTransport(body: [
"data": [
"hero": [
"name": "Luke Skywalker",
"__typename": "Human"
]
]
])
let loadingClient = ApolloClient(networkTransport: networkTransport, store: ApolloStore(cache: cache))
let clearingClient = ApolloClient(networkTransport: networkTransport, store: ApolloStore(cache: cache))
let group = DispatchGroup()
for _ in 0...10000 {
group.enter()
DispatchQueue.global().async {
loadingClient.fetch(query: HeroNameQuery(), cachePolicy: .returnCacheDataElseFetch) { (result, error) in
group.leave()
}
}
group.enter()
DispatchQueue.global().async {
_ = clearingClient.clearCache()
group.leave()
}
}
let expectation = self.expectation(description: "Fetching query")
group.notify(queue: .main) {
expectation.fulfill()
}
self.wait(for: [expectation], timeout: 5)
} |
I'm really not sure why that would affect other tests, it certainly shouldn't. I'll try to take a look tomorrow. |
@designatednerd it affects in a sense that it has tight loop and that will delay other ones, also on crash it will take everything down and other tests won't have chance to run |
Ah I think I missed that the expectation was within the giant It also doesn't seem like we're doing any asserting on the results - in theory we should be able to check after the client's been cleared that it wouldn't have anything in the cache. |
I don't think it is about clearing, problem is that ivar can be accessed simultaneously. It usually breaks in |
OK, coming back to this, I'm a bit confused: Is the test passing when it's run with these changes in place, or is it still failing? Or are the failures happening on another branch that does not have these changes in place? |
Test in comments will crash on dev (without this PR) but it won't do it constantly (as this is race condition). I manage to crash every 3rd run or so |
Yeah then I'd say add it to this PR - that'll at least let us know if we've borked anything in the future. |
@designatednerd do you know any better way to test for race conditions? This doesn't feel right... |
@designatednerd maybe just add check if ivar accesses from multiple threads (as it will lead to crash sooner or later)? |
Race conditions are really hard to test for - the main thing I'd do is probably check the results of queries from both clients performed in order. it could be something where you just run everything in sequence - and theoretically you should get results back sequentially. This is the sequence I'd probably be checking: Set |
public init(records: RecordSet = RecordSet()) { | ||
self.records = records | ||
self.queue = DispatchQueue(label: "com.apollographql.InMemoryNormalizedCache", attributes: .concurrent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be using a serial queue to ensure that we don't have two blocks accessing it at the same time? Wouldn't concurrent still allow for race conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking same, but serial queue would slow down things and I can't replicate crash with concurrent one... Maybe fact that it is accessed on same thread is enough...
We have this PR in production for past couple of months (way before I created PR) and see 0 crashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, you're right a serial queue is less performant, but this concurrent queue doesn't seem to necessarily address the issue. Of course it's up to @designatednerd in the end, and I'm glad that this workaround has been able to work for your team in production. But I wouldn't feel comfortable about pulling in a workaround for an issue that we can't consistently reproduce and that doesn't seem to make perfect sense.
It seems that this may happen to avoid the race conditions present in your code base currently, but it doesn't completely avoid the possibility of them occurring.
I'd feel much more comfortable going with NSRecursiveLock
as a solution here than dispatching to another queue.
Additionally, though I understand it may not be possible, it would be great if we could work a bit more on trying to create a test case that consistently crashes.
I'd be willing to spend some time looking into this and try to reproduce a crashing test, but not sure I'll get much further than you did @RolandasRazma .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
completely agree it needs better testing. But the only thing I can think off is add check if ivar is accessed from multiple threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, I've been looking around to see all the work being done on this library recently and saw this PR and this thread, so hope you don't mind me chiming in. The reason why concurrent works is because your using it with barriers. If you only ever access a concurrent queue with barriers, you've essentially made it function exactly like a serial queue. A common approach to synchronizing with a queue, is using a concurrent queue, then use async barrier writes, and sync non-barrier reads. Keeps the happy-path fast, and the write path safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, I think async reads would work with the pattern described too, but would definitely need testing because I'm not 100% sure of any side-effects inside the task blocks here... if any. Otherwise, might as well take the first suggestion, if it's all "barrier'ed up", might as well make it serial and remove the barriers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if you are always using barriers, then serial is the same! Thanks @jimissacs!
Just sent this as a message to @designatednerd, but probably should just put it here for discussion. It looks like we are using both a DispatchQueue with .barrier AND a ReadWrite cache lock. That seems like overkill. Adding an additional queue on the cache itself seems like a crazy amount of unneeded locking. If there are still crashes that are being reported currently, then it indicates to me that we aren't properly locking on the correct places. But if we are going to add new locks, I'd like to kill some of the old ones and just figure out the correct places to be locking, rather than continuing to add more and more locks on top of each other. Does this seem to be right, or am I missing something about the locks and queues that are currently in place? |
Alright, so I've written a modified test. This one does not crash, but it does fail the equal assertion a random number of times, and it fails consistently. It's failing because the cache is being cleared between the When I implement the PR from @RolandasRazma this test passes! It's not showing the func testMultipleStoresOneCacheConcurrentAccess() throws {
let initialRecords: RecordSet = [
"2001": ["name": "R2-D2", "__typename": "Droid", "primaryFunction": "Protocol"]
]
let cache = InMemoryNormalizedCache(records: initialRecords)
let store1 = ApolloStore(cache: cache)
let store2 = ApolloStore(cache: cache)
let expectation = self.expectation(description: "Read updated query")
var data = HeroDetails(unsafeResultMap: ["name": "R2-D2",
"__typename": "Droid",
"primaryFunction": "Protocol"])
for i in 1...1000 {
DispatchQueue.global().async {
_ = store1.withinReadWriteTransaction { transaction1 in
data.name = "R4-D\(i)"
try! transaction1.write(object: data, withKey: "2001")
let heroName = try? transaction1.loadRecords(forKeys: ["2001"]).await().first??["name"] as? String
XCTAssertEqual(heroName, "R4-D\(i)")
if i == 1000 {
expectation.fulfill()
}
}
}
DispatchQueue.global().async {
store2.clearCache()
}
}
self.wait(for: [expectation], timeout: 5)
} However the more I think about this, the more I think that both the If I do both the clearing of the cache and the read/write on the same store, this test passes without the new code from this PR. |
I've just confirmed that in the test case @RolandasRazma provided, if you use one single store with both clients, the crash also goes away. |
I think I can reproduce it even with one store and cache, need to look into it |
Sorry I've been MIA on this for a while. I think the safest and most performant way to handle this is to handle locking as close to the source as possible. If certain types of caches don't even need the locking then we shouldn't force them to use it. I strongly support locking at the cache level. |
@AnthonyMDev Makes sense - what do you think about having the locking work in a protocol extension? |
I'm not a fan of that just because it limits the configuration of how the locking is implemented too much in my opinion. Writing locking code isn't fun, but it's not the most difficult thing in the world to do. When you are writing a caching layer, that's something that's expected. Locking for thread safety is too important and too particular to have a shared implementation for I think. |
I think Anthony makes some great points here - locking at the cache level rather than at any higher level is probably the least-risky way to go. I think he's also probably right that I'm trying too hard to throw Protocol Extensions at the problem. :) Is it worth continuing this within this particular PR? Do any of you have some time to work on this? |
Haha, to be fair, I LOVE extracting as much logic as possible into protocol extensions. It's definitely the right mindset to be in by default. I think this is just a rare exception to that. :) |
As for your questions: If either of you work on this, let me know so we don't work over each other. I'd like to provide my test cases to you also so that you can use them to help verify that the new locking works. I riffed off of @RolandasRazma 's tests and even though they still don't fail 100% of the time, they fail more consistently now. :) |
Unfortunately this isn't my highest priority at the moment but I'm happy to provide support for whoever has time to work on it! |
Roger that. It's not my highest priority right now either haha ;) But I'll try to make it happen when I get a chance. |
should we keep this open until it can be addressed properly so other people would know about problem? |
We can either keep it open or open a new issue to summarize what we've figured out and what needs to happen - I'll leave that up to you @RolandasRazma. |
I’m thinking I would keep the issue open, but you’re the repo maintainer, so up to you!
|
Was there any progress on this? I reverted to stock implantation (latest from GitHub) and we getting like 80 crashes a day. Stack is not very useful but it goes something like... __hidden |
Looks like simple @designatednerd what do you think about such fix? Can I get into deadlock somehow? |
I mean you always can get into deadlock, but I suspect if it doesn't deadlock in the tests which are doing a lot of different accesses of the code, it won't in practice. Probably worth adding a test which explicitly checks accessing from two different threads. Would love to see this update get over the line! |
I was testing with test @AnthonyMDev ended up with. Looks fine. Tried to use mutex as well but didn't noticed any speed difference and NSLock easier to use. We pushed it to production (live dangerously and stuff) but its unlikely we hear anything from users as people don't report deadlocks :) Will open PR so we can continue discussion. |
Awesome, looking forward to it! Heads up that I'm on vacation for the next week and a half so responses may be slow, but I'll try to get a look as I can. |
…t queue at the same time
@designatednerd pushed changes, also included @AnthonyMDev test as it crashes 100% of a time now without lock (after locks were removed from ApolloStore) |
@@ -2,6 +2,7 @@ import Foundation | |||
|
|||
public final class InMemoryNormalizedCache: NormalizedCache { | |||
private var records: RecordSet | |||
private let recordsLock = NSLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we aren't using NSRecursiveLock
here? If the implementation of accessing the records ever changes to need to access the cache recursively or anything, then this is going to deadlock.
If there is a good reason I'm missing to use a regular NSLock
, then great, but I prefer to default to recursive if possible for safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no good reason, switched
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me, and a far simpler change than earlier. I'm interested in eventually abstracting some of this logic into a property wrapper, but for now let's get this going!
FYI this won't ship as part of a release until I'm back on the ground to support it, but I'll merge it now if you want to use it off master
.
@designatednerd do you have an ETA on when you'll be back? I'm wondering when we should expect a new release that would include this fix. |
This shipped a year ago as part of 0.22.0. There have been substantial other changes around how we're handling this, I'd appreciate you filing a new issue with updated details if you're having problems. Thank you! |
we see some crashes trying to access
RecordSet
while doing.clear()
onInMemoryNormalizedCache
sentry.io log
This PR wraps access to
InMemoryNormalizedCache.records
in internal queue to avoid race conditions.