Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use lock in InMemoryNormalizedCache to avoid race conditions #552

Merged
merged 2 commits into from
Jan 25, 2020
Merged

use lock in InMemoryNormalizedCache to avoid race conditions #552

merged 2 commits into from
Jan 25, 2020

Conversation

RolandasRazma
Copy link
Contributor

@RolandasRazma RolandasRazma commented May 28, 2019

we see some crashes trying to access RecordSet while doing .clear() on InMemoryNormalizedCache

* thread #35, queue = 'com.apollographql.DataLoader', stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
  frame #0: 0x00000001c71bf530 libobjc.A.dylib`objc_msgSend + 16
  frame #1: 0x00000001f60ee364 libswiftCore.dylib`Swift.Dictionary._Variant.subscript.getter : (A) -> Swift.Optional<B> + 148
  frame #2: 0x0000000106044da8 Apollo`RecordSet.subscript.getter(key="User_12", self=Apollo.RecordSet @ 0x00000001701b20d8) at RecordSet.swift:24:19
  frame #3: 0x000000010602fc64 Apollo`closure #1 in InMemoryNormalizedCache.loadRecords($0="User_12", self=0x0000000282b1ef60) at InMemoryNormalizedCache.swift:9:42
  frame #4: 0x000000010602fce8 Apollo`thunk for @callee_guaranteed (@guaranteed String) -> (@owned Record?, @error @owned Error) at <compiler-generated>:0
  frame #5: 0x000000010602fdcc Apollo`partial apply for thunk for @callee_guaranteed (@guaranteed String) -> (@owned Record?, @error @owned Error) at <compiler-generated>:0
  frame #6: 0x00000001f6074ef4 libswiftCore.dylib`(extension in Swift):Swift.Collection.map<A>((A.Element) throws -> A1) throws -> Swift.Array<A1> + 748
  frame #7: 0x000000010602fb1c Apollo`InMemoryNormalizedCache.loadRecords(keys=1 value, self=0x0000000282b1ef60) at InMemoryNormalizedCache.swift:9:24
  frame #8: 0x0000000106030228 Apollo`protocol witness for NormalizedCache.loadRecords(forKeys:) in conformance InMemoryNormalizedCache at <compiler-generated>:0
  frame #9: 0x0000000106001c20 Apollo`partial apply at <compiler-generated>:0
  frame #10: 0x00000001060099c4 Apollo`closure #1 in DataLoader.dispatch(self=0x00000002831abd00) at DataLoader.swift:53:12
  frame #11: 0x0000000105ff0d50 Apollo`thunk for @escaping @callee_guaranteed () -> () at <compiler-generated>:0
  frame #12: 0x00000001084176f0 libdispatch.dylib`_dispatch_call_block_and_release + 24
  frame #13: 0x0000000108418c74 libdispatch.dylib`_dispatch_client_callout + 16
  frame #14: 0x0000000108420bf4 libdispatch.dylib`_dispatch_lane_serial_drain + 712
  frame #15: 0x00000001084218b4 libdispatch.dylib`_dispatch_lane_invoke + 456
  frame #16: 0x000000010842b77c libdispatch.dylib`_dispatch_workloop_worker_thread + 1148
  frame #17: 0x00000001c7bed114 libsystem_pthread.dylib`_pthread_wqthread + 304
  frame #18: 0x00000001c7befcd4 libsystem_pthread.dylib`start_wqthread + 4

sentry.io log
Screenshot 2019-07-29 at 13 18 32

This PR wraps access to InMemoryNormalizedCache.records in internal queue to avoid race conditions.

@RolandasRazma
Copy link
Contributor Author

I realised that not queue fixes race conditions but actual usage of Promises. Will open smaller PR for that

@RolandasRazma
Copy link
Contributor Author

actually only promises don't fix it ;/ Will open PR again...

@RolandasRazma RolandasRazma reopened this May 28, 2019
@RolandasRazma RolandasRazma deleted the InMemoryNormalizedCache_queue branch May 31, 2019 13:54
@RolandasRazma RolandasRazma restored the InMemoryNormalizedCache_queue branch May 31, 2019 13:54
@RolandasRazma RolandasRazma reopened this Jun 5, 2019
@@ -1,21 +1,37 @@
import Dispatch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to import this explicitly - it should be part of Foundation

Copy link
Contributor

@designatednerd designatednerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a pretty significant change, although it could be a very helpful one (I know we've got some TSAN issues I'm going to be looking at with #600).

Is there a way to add tests for this change?

@RolandasRazma
Copy link
Contributor Author

RolandasRazma commented Jul 12, 2019

@designatednerd It should be possible, I can replicate crash very easily (not sure if its easy to test)
we have setup where multiple clients use same store and as soon as you start doing stuff on multiple threads it crashes. We could put everything on one just because we know about bug, but other people might be facing same problem without realising reason

@designatednerd
Copy link
Contributor

@RolandasRazma I mean add an automated test so we can make sure a) This is doing what it needs to and b) This doesn't regress in the future 😄

@RolandasRazma
Copy link
Contributor Author

@designatednerd I got that :) Just was saying that its easy to replicate by hand but not sure how easy to replicate in test

@designatednerd
Copy link
Contributor

Maybe you could set up two clients that use the same store for the test and try to have them access/mutate things in a certain sequence?

@RolandasRazma
Copy link
Contributor Author

@designatednerd So I tried to write test to replicate bug (wanted to create separate PR) but because this is race condition test not always work, it slows down all other tests and it crashes everything when it works. Not sure if it should be merged together with all other tests. I will drop it here for now, let me know what you want to do... (Without changes in this PR I manage to replicate crash with this test every 3rd run or so)

  func testThreadedCache() throws {
    let cache = InMemoryNormalizedCache()
    
    let networkTransport = MockNetworkTransport(body: [
      "data": [
        "hero": [
          "name": "Luke Skywalker",
          "__typename": "Human"
        ]
      ]
    ])
    
    let loadingClient = ApolloClient(networkTransport: networkTransport, store: ApolloStore(cache: cache))
    let clearingClient = ApolloClient(networkTransport: networkTransport, store: ApolloStore(cache: cache))
    
    let group = DispatchGroup()
    
    for _ in 0...10000 {
      
      group.enter()
      DispatchQueue.global().async {
        loadingClient.fetch(query: HeroNameQuery(), cachePolicy: .returnCacheDataElseFetch) { (result, error) in
          group.leave()
        }
      }
      
      group.enter()
      DispatchQueue.global().async {
        _ = clearingClient.clearCache()
        group.leave()
      }
      
    }
    
    let expectation = self.expectation(description: "Fetching query")
    group.notify(queue: .main) {
      expectation.fulfill()
    }
    self.wait(for: [expectation], timeout: 5)
    
  }

@designatednerd
Copy link
Contributor

I'm really not sure why that would affect other tests, it certainly shouldn't. I'll try to take a look tomorrow.

@RolandasRazma
Copy link
Contributor Author

@designatednerd it affects in a sense that it has tight loop and that will delay other ones, also on crash it will take everything down and other tests won't have chance to run

@designatednerd
Copy link
Contributor

Ah I think I missed that the expectation was within the giant for loop when I was looking at this last night.

It also doesn't seem like we're doing any asserting on the results - in theory we should be able to check after the client's been cleared that it wouldn't have anything in the cache.

@RolandasRazma
Copy link
Contributor Author

RolandasRazma commented Jul 16, 2019

I don't think it is about clearing, problem is that ivar can be accessed simultaneously. It usually breaks in records.merge as it takes longer than clear

@designatednerd
Copy link
Contributor

OK, coming back to this, I'm a bit confused: Is the test passing when it's run with these changes in place, or is it still failing?

Or are the failures happening on another branch that does not have these changes in place?

@RolandasRazma
Copy link
Contributor Author

Test in comments will crash on dev (without this PR) but it won't do it constantly (as this is race condition). I manage to crash every 3rd run or so
As far as I can see it will not crash if this PR is applied

@designatednerd
Copy link
Contributor

Yeah then I'd say add it to this PR - that'll at least let us know if we've borked anything in the future.

@RolandasRazma
Copy link
Contributor Author

@designatednerd do you know any better way to test for race conditions? This doesn't feel right...

@RolandasRazma
Copy link
Contributor Author

@designatednerd maybe just add check if ivar accesses from multiple threads (as it will lead to crash sooner or later)?

@designatednerd
Copy link
Contributor

Race conditions are really hard to test for - the main thing I'd do is probably check the results of queries from both clients performed in order. it could be something where you just run everything in sequence - and theoretically you should get results back sequentially. This is the sequence I'd probably be checking:

Set Luke Skywalker with loading client
Fetch data with clearing client -> Should be Luke Skywalker because the set should finish before the fetch starts
Clear with clearing client
Fetch data with loading client -> should be cleared because the clear should finish before the fetch starts

public init(records: RecordSet = RecordSet()) {
self.records = records
self.queue = DispatchQueue(label: "com.apollographql.InMemoryNormalizedCache", attributes: .concurrent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be using a serial queue to ensure that we don't have two blocks accessing it at the same time? Wouldn't concurrent still allow for race conditions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking same, but serial queue would slow down things and I can't replicate crash with concurrent one... Maybe fact that it is accessed on same thread is enough...
We have this PR in production for past couple of months (way before I created PR) and see 0 crashes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, you're right a serial queue is less performant, but this concurrent queue doesn't seem to necessarily address the issue. Of course it's up to @designatednerd in the end, and I'm glad that this workaround has been able to work for your team in production. But I wouldn't feel comfortable about pulling in a workaround for an issue that we can't consistently reproduce and that doesn't seem to make perfect sense.

It seems that this may happen to avoid the race conditions present in your code base currently, but it doesn't completely avoid the possibility of them occurring.

I'd feel much more comfortable going with NSRecursiveLock as a solution here than dispatching to another queue.

Additionally, though I understand it may not be possible, it would be great if we could work a bit more on trying to create a test case that consistently crashes.

I'd be willing to spend some time looking into this and try to reproduce a crashing test, but not sure I'll get much further than you did @RolandasRazma .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completely agree it needs better testing. But the only thing I can think off is add check if ivar is accessed from multiple threads

Copy link
Contributor

@jimisaacs jimisaacs Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I've been looking around to see all the work being done on this library recently and saw this PR and this thread, so hope you don't mind me chiming in. The reason why concurrent works is because your using it with barriers. If you only ever access a concurrent queue with barriers, you've essentially made it function exactly like a serial queue. A common approach to synchronizing with a queue, is using a concurrent queue, then use async barrier writes, and sync non-barrier reads. Keeps the happy-path fast, and the write path safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I think async reads would work with the pattern described too, but would definitely need testing because I'm not 100% sure of any side-effects inside the task blocks here... if any. Otherwise, might as well take the first suggestion, if it's all "barrier'ed up", might as well make it serial and remove the barriers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if you are always using barriers, then serial is the same! Thanks @jimissacs!

@AnthonyMDev
Copy link
Contributor

Just sent this as a message to @designatednerd, but probably should just put it here for discussion.

It looks like we are using both a DispatchQueue with .barrier AND a ReadWrite cache lock. That seems like overkill. Adding an additional queue on the cache itself seems like a crazy amount of unneeded locking.

If there are still crashes that are being reported currently, then it indicates to me that we aren't properly locking on the correct places. But if we are going to add new locks, I'd like to kill some of the old ones and just figure out the correct places to be locking, rather than continuing to add more and more locks on top of each other.

Does this seem to be right, or am I missing something about the locks and queues that are currently in place?

@AnthonyMDev
Copy link
Contributor

Alright, so I've written a modified test. This one does not crash, but it does fail the equal assertion a random number of times, and it fails consistently. It's failing because the cache is being cleared between the write and the loadRecords() calls.

When I implement the PR from @RolandasRazma this test passes! It's not showing the EXC_BAD_ACCESS crash, but I think I'm getting to the same underlying behavior.

func testMultipleStoresOneCacheConcurrentAccess() throws {
    let initialRecords: RecordSet = [
      "2001": ["name": "R2-D2", "__typename": "Droid", "primaryFunction": "Protocol"]
    ]

    let cache = InMemoryNormalizedCache(records: initialRecords)

    let store1 = ApolloStore(cache: cache)
    let store2 = ApolloStore(cache: cache)

    let expectation = self.expectation(description: "Read updated query")

    var data = HeroDetails(unsafeResultMap: ["name": "R2-D2",
                                             "__typename": "Droid",
                                             "primaryFunction": "Protocol"])

    for i in 1...1000 {
      DispatchQueue.global().async {
        _ = store1.withinReadWriteTransaction { transaction1 in
          data.name = "R4-D\(i)"
          try! transaction1.write(object: data, withKey: "2001")

          let heroName = try? transaction1.loadRecords(forKeys: ["2001"]).await().first??["name"] as? String

          XCTAssertEqual(heroName, "R4-D\(i)")
          if i == 1000 {
            expectation.fulfill()
          }

        }
      }

      DispatchQueue.global().async {
        store2.clearCache()
      }
    }

    self.wait(for: [expectation], timeout: 5)
  }

However the more I think about this, the more I think that both the ApolloStore and the cache itself shouldn't have their own locks. It seems like it's a lot of overkill and causes performance issues. Also, every future implementation of a new cache type will need to handle its own locking.
I understand using the same cache with multiple clients, but I think that in that case, you should always use one shared ApolloStore. Since the store is already handling thread safety here, using the same cache in different stores should be considered a developer error and not supported.

If I do both the clearing of the cache and the read/write on the same store, this test passes without the new code from this PR.

@AnthonyMDev
Copy link
Contributor

I've just confirmed that in the test case @RolandasRazma provided, if you use one single store with both clients, the crash also goes away.

@RolandasRazma
Copy link
Contributor Author

I think I can reproduce it even with one store and cache, need to look into it

@AnthonyMDev
Copy link
Contributor

Sorry I've been MIA on this for a while. I think the safest and most performant way to handle this is to handle locking as close to the source as possible. If certain types of caches don't even need the locking then we shouldn't force them to use it.
Having the lock at any higher level than the cache itself lends itself to more issues like we are currently seeing where multiple stores are accessing the same cache.
Different caches may be able to implement locking differently to improve performance based on their implementation details as well.

I strongly support locking at the cache level.

@designatednerd
Copy link
Contributor

@AnthonyMDev Makes sense - what do you think about having the locking work in a protocol extension?

@AnthonyMDev
Copy link
Contributor

I'm not a fan of that just because it limits the configuration of how the locking is implemented too much in my opinion. Writing locking code isn't fun, but it's not the most difficult thing in the world to do. When you are writing a caching layer, that's something that's expected. Locking for thread safety is too important and too particular to have a shared implementation for I think.

@designatednerd
Copy link
Contributor

I think Anthony makes some great points here - locking at the cache level rather than at any higher level is probably the least-risky way to go. I think he's also probably right that I'm trying too hard to throw Protocol Extensions at the problem. :)

Is it worth continuing this within this particular PR? Do any of you have some time to work on this?

@AnthonyMDev
Copy link
Contributor

Haha, to be fair, I LOVE extracting as much logic as possible into protocol extensions. It's definitely the right mindset to be in by default. I think this is just a rare exception to that. :)

@AnthonyMDev
Copy link
Contributor

As for your questions:
I think the direction this PR took isn't what we are going to go with, so I don't think we need to continue with it. Thanks so, so, so MUCH @RolandasRazma for bringing this up, researching it, coming up with the test case, and providing so much information! If you or @designatednerd have time to implement the locking in the cache and kill the locking in the store, go for it.
I'm not sure when I'll be able to find time to take this on, but if no one else can, I will try to do it next week.

If either of you work on this, let me know so we don't work over each other. I'd like to provide my test cases to you also so that you can use them to help verify that the new locking works. I riffed off of @RolandasRazma 's tests and even though they still don't fail 100% of the time, they fail more consistently now. :)

@designatednerd
Copy link
Contributor

Unfortunately this isn't my highest priority at the moment but I'm happy to provide support for whoever has time to work on it!

@AnthonyMDev
Copy link
Contributor

Roger that. It's not my highest priority right now either haha ;) But I'll try to make it happen when I get a chance.

@RolandasRazma
Copy link
Contributor Author

should we keep this open until it can be addressed properly so other people would know about problem?

@designatednerd
Copy link
Contributor

We can either keep it open or open a new issue to summarize what we've figured out and what needs to happen - I'll leave that up to you @RolandasRazma.

@AnthonyMDev
Copy link
Contributor

AnthonyMDev commented Sep 25, 2019 via email

@RolandasRazma
Copy link
Contributor Author

RolandasRazma commented Jan 13, 2020

Was there any progress on this? I reverted to stock implantation (latest from GitHub) and we getting like 80 crashes a day. Stack is not very useful but it goes something like...

__hidden
__hidden
RecordSet.subscript.getter
redacted
_hidden
_hidden
InMemoryNormalizedCache.loadRecords

@RolandasRazma
Copy link
Contributor Author

Looks like simple NSLock around self.records access is enough to stop crashing with latest changes in ApolloStore and InMemoryNormalizedCache

@designatednerd what do you think about such fix? Can I get into deadlock somehow?

@designatednerd
Copy link
Contributor

I mean you always can get into deadlock, but I suspect if it doesn't deadlock in the tests which are doing a lot of different accesses of the code, it won't in practice. Probably worth adding a test which explicitly checks accessing from two different threads.

Would love to see this update get over the line!

@RolandasRazma
Copy link
Contributor Author

I was testing with test @AnthonyMDev ended up with. Looks fine. Tried to use mutex as well but didn't noticed any speed difference and NSLock easier to use. We pushed it to production (live dangerously and stuff) but its unlikely we hear anything from users as people don't report deadlocks :) Will open PR so we can continue discussion.

@designatednerd
Copy link
Contributor

Awesome, looking forward to it! Heads up that I'm on vacation for the next week and a half so responses may be slow, but I'll try to get a look as I can.

@RolandasRazma
Copy link
Contributor Author

@designatednerd pushed changes, also included @AnthonyMDev test as it crashes 100% of a time now without lock (after locks were removed from ApolloStore)

@@ -2,6 +2,7 @@ import Foundation

public final class InMemoryNormalizedCache: NormalizedCache {
private var records: RecordSet
private let recordsLock = NSLock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we aren't using NSRecursiveLock here? If the implementation of accessing the records ever changes to need to access the cache recursively or anything, then this is going to deadlock.

If there is a good reason I'm missing to use a regular NSLock, then great, but I prefer to default to recursive if possible for safety.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no good reason, switched

@RolandasRazma RolandasRazma changed the title use queue in InMemoryNormalizedCache to avoid race conditions use lock in InMemoryNormalizedCache to avoid race conditions Jan 21, 2020
Copy link
Contributor

@designatednerd designatednerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, and a far simpler change than earlier. I'm interested in eventually abstracting some of this logic into a property wrapper, but for now let's get this going!

FYI this won't ship as part of a release until I'm back on the ground to support it, but I'll merge it now if you want to use it off master.

@lukelabonte
Copy link

Looks reasonable to me, and a far simpler change than earlier. I'm interested in eventually abstracting some of this logic into a property wrapper, but for now let's get this going!

FYI this won't ship as part of a release until I'm back on the ground to support it, but I'll merge it now if you want to use it off master.

@designatednerd do you have an ETA on when you'll be back? I'm wondering when we should expect a new release that would include this fix.

@designatednerd
Copy link
Contributor

This shipped a year ago as part of 0.22.0. There have been substantial other changes around how we're handling this, I'd appreciate you filing a new issue with updated details if you're having problems. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants