Handoff often fails in production, perhaps due to a race condition #396

itamarst · 2014-07-30T18:57:26Z

Much of the time when doing a handoff, the following output is printed:

cannot open 'flocker/ca753bf2-f674-40e7-a0bd-532a3684c75d.mongodb-volume-example15': dataset does not exist
main function encountered error
Traceback (most recent call last):
Failure: twisted.internet.defer.FirstError: FirstError[#0, [Failure instance: Traceback: <class 'flocker.volume.filesystems.interfaces.FilesystemAlreadyExists'>: 
/usr/lib64/python2.7/site-packages/twisted/internet/endpoints.py:362:processEnded
/usr/lib/python2.7/site-packages/flocker/volume/filesystems/zfs.py:60:connectionLost
/usr/lib64/python2.7/site-packages/twisted/internet/defer.py:423:errback
/usr/lib64/python2.7/site-packages/twisted/internet/defer.py:490:_startRunCallbacks
--- <exception caught here> ---
/usr/lib64/python2.7/site-packages/twisted/internet/defer.py:577:_runCallbacks
/usr/lib/python2.7/site-packages/flocker/volume/filesystems/zfs.py:253:rename_failed
]]
Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
    self.__bootstrap_inner()
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
--- <exception caught here> ---
  File "/tmp/flocker-tutorial/local/lib/python2.7/site-packages/twisted/python/threadpool.py", line 196, in _worker
    result = context.call(ctx, function, *args, **kwargs)
  File "/tmp/flocker-tutorial/local/lib/python2.7/site-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/tmp/flocker-tutorial/local/lib/python2.7/site-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
  File "/tmp/flocker-tutorial/local/lib/python2.7/site-packages/flocker/common/_ipc.py", line 90, in get_output
    raise IOError("Bad exit", remote_command, e.returncode, e.output)
exceptions.IOError: ('Bad exit', ['flocker-changestate', '"version": 1\n"nodes":\n  "172.16.255.250": []\n  "172.16.255.251": ["mongodb-volume-example15"]\n', '"version": 1\n"applications":\n  "mongodb-volume-exam
ple15":\n    "image": "mongodb-tutorial"\n    "ports":\n    - "internal": 27017\n      "external": 27017\n    "volume":\n      # The location within the container where the data volume will be\n      # mounted:\n 
     "mountpoint": "/data"\n', '172.16.255.250:\n  applications:\n    mongodb-volume-example15:\n      image: unknown\n      ports: []\n      volume: {mountpoint: null}\n  version: 1\n172.16.255.251:\n  applicatio
ns: {}\n  version: 1\n', '172.16.255.250'], 1, '')

Sometimes this error does not occur, however. My hypothesis is that there's some race condition where it thinks push has finished but it hasn't, quite, so the acquire fails. (The AlreadyExistError is bogus, it does that no matter what the reason the "zfs rename" failed).

The text was updated successfully, but these errors were encountered:

itamarst · 2014-07-30T19:01:39Z

First pass for reproducing this might be trying to do a functional test for handoff.

itamarst · 2014-07-30T19:07:16Z

Adding a 1 second sleep betwen push and acquire didn't seem to prevent the issue, so maybe my hypothesis is wrong. In general more logging would be very helpful at this point...

itamarst · 2014-07-30T19:08:45Z

Note also that in the transcript above it's not even clear which machine is having the error, the origin or destination (both of them do zfs renames).

itamarst · 2014-07-30T19:16:12Z

The origin did do the rename, but mountpoint was not changed, so plausibly the origin is the one that failed somehow during rename. not clear why "dataset does not exist" though so maybe not? I will debug some more tomorrow unless someone takes this on and reproduces it with functional test.

itamarst · 2014-07-31T17:54:29Z

I still have no idea what's going on.

itamarst · 2014-07-31T18:35:18Z

I found an error from zfs command line program in logs for flocker-volume acquire basically saying "can't change mountpoint because directory is not empty".

My hypothesis:
In tests we have zfs pool setup with mountpoint different than the one where we actually mount for purposes of our higher-level API. We have extra logic that sets mountpoint to public value after rename in change_owner().

In real code the mountpoint of pool and mountpoint used by higher level code is the same. That extra logic is trying to change mountpoint to same value, but ZFS doesn't notice and complains about directory having contents (since it's already mounted there).

itamarst · 2014-07-31T18:40:07Z

The actual error:

2014-07-31 11:12:09-0700 [-] Process 'zfs' wrote stderr unhandled by <flocker.volume.filesystems.zfs._AccumulatingProtocol instance at 0x15e71b8>: cannot mount '/flocker/30a125b5-0208-4112-b332-242f11b5d907.mongodb-volume-example20': directory is not empty
        property may be set but unable to remount filesystem

itamarst · 2014-07-31T18:40:57Z

My hypothesis above may be wrong and the issue is race condition where the container starts up and therefore the mount change fails for that reason.

wallrj · 2014-08-01T11:26:50Z

I turned on logging in both flocker-reportstate and flocker-deploy and got the same error.

2014-08-01 04:20:17-0700 [-] Process 'zfs' wrote stderr unhandled by <flocker.volume.filesystems.zfs._AccumulatingProtocol instance at 0x3307f80>: cannot mount '/flocker/b482eb78-9936-4cde-8d94-179371803faf.mongodb-volume-example': directory is not empty

2014-08-01 04:20:17-0700 [-] Process 'zfs' wrote stderr unhandled by <flocker.volume.filesystems.zfs._AccumulatingProtocol instance at 0x3307f80>: filesystem successfully created, but not mounted

2014-08-01 04:20:17-0700 [-] main function encountered error
    Traceback (most recent call last):
    Failure: twisted.internet.defer.FirstError: FirstError[#0, [Failure instance: Traceback (failure with no frames): <class 'flocker.volume.filesystems.zfs.CommandFailed'>: 
    ]]

itamarst · 2014-08-01T14:04:02Z

OK, now I'm seeing the failure happening on the sending side, i.e. the one that is relinquishing contorl.

itamarst · 2014-08-01T14:30:15Z

I think I finally got it. I added a sleep of 20 seconds after _deploy.StopApplication.run and that seems to fix it.

It takes mongod about 10 seconds to stop... but this happens asynchronously, our command to gear finishes long before that. So we tell it to stop, do a push, and then try to rename ZFS dataset, but mongod is still running and using the files so the rename fails. Plus we've pushed data from a non-quiescent server.

Either GearClient, or more plausibly StopApplication/StartApplication should poll geard until the desired action has actually finished.

@itamarst

Gear.remove now fires its deferred after the unit has been removed Author: @itamarst Reviewer: @wallrj Fixes #396

itamarst added this to the Release 0.1 milestone Jul 30, 2014

itamarst added bug labels Jul 30, 2014

wallrj self-assigned this Jul 31, 2014

itamarst added in progress and removed in progress labels Jul 31, 2014

wallrj assigned itamarst and unassigned wallrj Jul 31, 2014

wallrj mentioned this issue Aug 1, 2014

Diagnose why renaming a filesystem fails. #199

Closed

itamarst mentioned this issue Aug 1, 2014

Existence check in ZFS receive path ("zfs list") emits spurious error message to stderr if the filesystem does not exist #402

Closed

This was referenced Aug 1, 2014

Integration tests for handoff #406

Closed

Integration tests for handoff - fixes 406 #407

Merged

Gear is async - fixes 396 #410

Merged

wallrj mentioned this issue Aug 4, 2014

Volume moving tutorial - fixes 131 #411

Merged

wallrj closed this as completed in #410 Aug 5, 2014

wallrj added a commit that referenced this issue Aug 5, 2014

Merge pull request #410 from ClusterHQ/gear-is-async-396

e18a5fc

Gear.remove now fires its deferred after the unit has been removed Author: @itamarst Reviewer: @wallrj Fixes #396

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handoff often fails in production, perhaps due to a race condition #396

Handoff often fails in production, perhaps due to a race condition #396

itamarst commented Jul 30, 2014

itamarst commented Jul 30, 2014

itamarst commented Jul 30, 2014

itamarst commented Jul 30, 2014

itamarst commented Jul 30, 2014

itamarst commented Jul 31, 2014

itamarst commented Jul 31, 2014

itamarst commented Jul 31, 2014

itamarst commented Jul 31, 2014

wallrj commented Aug 1, 2014

itamarst commented Aug 1, 2014

itamarst commented Aug 1, 2014

Handoff often fails in production, perhaps due to a race condition #396

Handoff often fails in production, perhaps due to a race condition #396

Comments

itamarst commented Jul 30, 2014

itamarst commented Jul 30, 2014

itamarst commented Jul 30, 2014

itamarst commented Jul 30, 2014

itamarst commented Jul 30, 2014

itamarst commented Jul 31, 2014

itamarst commented Jul 31, 2014

itamarst commented Jul 31, 2014

itamarst commented Jul 31, 2014

wallrj commented Aug 1, 2014

itamarst commented Aug 1, 2014

itamarst commented Aug 1, 2014