Scheduler lockfile for RunDeck task not being released #26

hellais · 2018-09-13T08:38:48Z

@hellais commented on Mon Sep 26 2016

It seems like there is an issue in the releasing of the lockfile for the scheduler of the RunDeck task.

This was found while looking into the issue related to https://github.com/TheTorProject/ooni-probe/issues/606.

As a result of this decks are only run once and then they are not re-triggered daily due to the lockfile still existing on disk.

The relevant code is: https://github.com/TheTorProject/ooni-probe/blob/master/ooni/agent/scheduler.py#L24.

In debugging this I noticed that the last run time is being updated properly (https://github.com/TheTorProject/ooni-probe/blob/master/ooni/agent/scheduler.py#L129), but for some reason the lockfile is never being released.

I would suggest as a way of debugging this to write some unittests that try to trigger a long running task and then queue a bunch of periodic tasks that acquire the lock on the filesystem and then see if at the end the lockfile is released.

@darkk can you look into this and see if you are able to see why this is happening?

@darkk commented on Fri Sep 30 2016

Everything seems to be OK with locking itself (well, DeferredFilesystemLock is hackish), but:

the RunDeck is stuck due to unhandled error
Scheduler has a tiny fiber-bomb as run-deck-web-full-MD5(@daily) is started every 30 seconds and added to the waiting queue
cloudfront bouncer can't be configured, failing with AttributeError: 'module' object has no attribute 'CANONICAL_BOUNCER_CLOUDFRONTED'
I see a lot of false positives, e.g. http://www.metasploit.org
retries are not accounted in task completion check that looks like bug
Full web test deck and Web test deck without HTTP Invalid Request Line are enabled with checkboxes, but they look like mutually exclusive, so it should be radiobutton in the Web UI
Web UI startup (time to load meaningful webpage) is suspiciously slow, ~10-15s

@hellais commented on Fri Sep 30 2016

Thanks for the good work on looking into this.

Let me address your feedback line by line:

the RunDeck is stuck due to unhandled error (1)

Do you have more information about what the unhandled error could be? Something else worth looking into that could be related is that I have noticed sometimes that doneNetTest (https://github.com/TheTorProject/ooni-probe/blob/master/ooni/nettest.py#L515) is called while some measurements are still running (you notice this because you see the summary and after that you also see the results of individual tests).

This is a sign that the logic for managing if a certain set of tests of measurements is probably a bit broken. This may also be related to (5).

Scheduler has a tiny fiber-bomb as run-deck-web-full-MD5(@daily) is started every 30 seconds and added to the waiting queue (2)

Yes you are right, it's documented inside of some comment if I recall correctly. If we want to be a bit smarter about this we could skip adding to the queue if it already has an element in it. There may be some twisted construct that abstracts this sort of logic already.

cloudfront bouncer can't be configured, failing with AttributeError: 'module' object has no attribute 'CANONICAL_BOUNCER_CLOUDFRONTED' (3)

It's interesting that you see this exception. It's probably an issue with setting the settings of the client to be type: cloudfronted instead of type: cloudfront (see: https://github.com/TheTorProject/ooni-probe/blob/master/ooni/backend_client.py#L287)

Do you have some traceback of this exception?

I see a lot of false positives, e.g. http://www.metasploit.org (4)

What is the reason for censorship being detected? The web_connectivity does produce some false positives, it's a bit better than the situation with the older http_requests test, but there are for sure improvements that can be made to improve it's accuracy.

If you have some example reports of it can you attach it here?

retries are not accounted in task completion check that looks like bug (5)

Yes, this seems like a bug and in general I was hoping that this would end up being addressed as part of the refactor documented in https://github.com/TheTorProject/ooni-probe/issues/563.

That ticket actually may contain some useful information in looking into the refactoring of the scheduler.

Full web test deck and Web test deck without HTTP Invalid Request Line are enabled with checkboxes, but they look like mutually exclusive, so it should be radiobutton in the Web UI

This is a very good point, though it's actually another deck and we will be adding new decks in the future. I will think a bit about how we can support this use case, but make it abstract enough to work also when we have more decks.

Web UI startup (time to load meaningful webpage) is suspiciously slow, ~10-15s

Yes, I have also noticed that sometimes the web UI will start blocking (like HTTP requests will be pending and the UI becomes irresponsive). I fear this is due to some code being blocking and the twisted event loop being stuck.

Is this what you are talking about or are you talking about the startup time?

@darkk commented on Fri Oct 07 2016

the RunDeck is stuck due to unhandled error

ONE of reliable ways to reproduce the issue was to throw an exception in task generator, like

2016-09-30 11:56:32+0000 [DNSDatagramProtocol (UDP)] Running <<class 'ooni.nettest.NetTestCaseWithLocalOptions'> inputs=<generator object inputProcessor at 0x7fe206cbb1e0>> test_web_connectivity
2016-09-30 11:56:32+0000 [-] Unhandled error in Deferred:
2016-09-30 11:56:32+0000 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/managers.py", line 117, in schedule
            self._fillSlots()
          File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/managers.py", line 61, in _fillSlots
            d.addCallback(lambda _: self._scheduleNextTask())
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 317, in addCallback
            callbackKeywords=kw)
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 306, in addCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/managers.py", line 61, in <lambda>
            d.addCallback(lambda _: self._scheduleNextTask())
          File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/managers.py", line 66, in _scheduleNextTask
            task = next(self._tasks)
          File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/nettest.py", line 598, in generateMeasurements
            test_input)
          File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/nettest.py", line 554, in makeMeasurement
            measurement = Measurement(test_instance, test_method, test_input)
          File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/tasks.py", line 114, in __init__
            self.testInstance.setUp()
          File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/nettests/blocking/web_connectivity.py", line 178, in setUp
            raise Exception("Invalid input")
        exceptions.Exception: Invalid input

If the exception is risen iside of generateMeasurements() then start_net_test_loader never progresses as net_test.done is never triggered. I'm unsure than I'm familiar with the code enough to refactor generateMeasurements() in-place, so I decided skip broken measurements emitted by makeMeasurement().

That's not the only reason as I still can reproduce the issue after the hotfix in non-reliable way. I'm still investigating it as it does not give so clean clues like tracebacks :)

doneNetTest is called while some measurements are still running

The easy way to trigger that behaviour is to run a test against TARPIT. twisted has huge timeout to trigger ResponseNeverReceived in this case, something like ~1100 seconds. So the Measurement is not running as TaskWithTimeout tried to commit a suicide, but some offspring deferreds doing retries were kept alive. www.wallpapergate.com works like a TARPIT for me.

It looks like a bug, but that's not a blocker, I'm forking it to separate ticket to document it, but I'm unsure that we'll ever fix it due to upcomming MeasurementKit integration.

If we want to be a bit smarter about fiber-bomb in scheduler we could skip adding to the queue if it already has an element in it.

The smartest thing I can imagine is if we're unable to run a scheduled task for two sequent timeslots, run os.kill(os.getpid(), signal.SIGTERM), if that's three timeslots, run SIGKILL. That's ugly, but that's the best thing we can do against occasional deadlocks. //Let it fail.//

CANONICAL_BOUNCER_CLOUDFRONTED

It looks like miscommunication between ooni-wui & ooni-probe that does not validate submitted form.
You can easily reproduce that running ooni-probe in a fresh VM :) Traceback is boring:

Starting ooniprobe agent.
To view the GUI go to http://127.0.0.1:8842
An error has occurred: b"AttributeError: 'module' object has no attribute 'CANONICAL_BOUNCER_CLOUDFRONTED'"
Please look at log file for more information.

2016-09-30 10:27:48+0000 [-] Traceback (most recent call last):
2016-09-30 10:27:48+0000 [-]   File "/usr/local/bin/ooniprobe-agent", line 11, in <module>
2016-09-30 10:27:48+0000 [-]     load_entry_point('ooniprobe==2.0.0rc4.dev0', 'console_scripts', 'ooniprobe-agent')()
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/scripts/ooniprobe_agent.py", line 217, in run
2016-09-30 10:27:48+0000 [-]     return start_agent(options)
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/scripts/ooniprobe_agent.py", line 101, in start_agent
2016-09-30 10:27:48+0000 [-]     twistd.runApp(twistd_config)
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/twistd.py", line 25, in runApp
2016-09-30 10:27:48+0000 [-]     _SomeApplicationRunner(config).run()
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 383, in run
2016-09-30 10:27:48+0000 [-]     self.postApplication()
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/_twistd_unix.py", line 248, in postApplication
2016-09-30 10:27:48+0000 [-]     self.startApplication(self.application)
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/_twistd_unix.py", line 444, in startApplication
2016-09-30 10:27:48+0000 [-]     app.startApplication(application, not self.config['no_save'])
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 664, in startApplication
2016-09-30 10:27:48+0000 [-]     service.IService(application).startService()
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/twisted/application/service.py", line 283, in startService
2016-09-30 10:27:48+0000 [-]     service.startService()
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/agent/agent.py", line 25, in startService
2016-09-30 10:27:48+0000 [-]     service.MultiService.startService(self)
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/twisted/application/service.py", line 283, in startService
2016-09-30 10:27:48+0000 [-]     service.startService()
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/agent/scheduler.py", line 435, in startService
2016-09-30 10:27:48+0000 [-]     self.refresh_deck_list()
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/agent/scheduler.py", line 370, in refresh_deck_list
2016-09-30 10:27:48+0000 [-]     for deck_id, deck in deck_store.list_enabled():
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/deck/store.py", line 153, in list_enabled
2016-09-30 10:27:48+0000 [-]     for deck_id, deck in self._list():
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/deck/store.py", line 141, in _list
2016-09-30 10:27:48+0000 [-]     self._update_cache()
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/deck/store.py", line 186, in _update_cache
2016-09-30 10:27:48+0000 [-]     deck_path=self.available_directory.child(deck_path).path
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/deck/deck.py", line 97, in __init__
2016-09-30 10:27:48+0000 [-]     self.open(deck_path)
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/deck/deck.py", line 105, in open
2016-09-30 10:27:48+0000 [-]     self.load(deck_data, global_options)
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/deck/deck.py", line 120, in load
2016-09-30 10:27:48+0000 [-]     self.bouncer = get_preferred_bouncer()
2016-09-30 10:27:48+0000 [-]   File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/backend_client.py", line 284, in get_preferred_bouncer
2016-09-30 10:27:48+0000 [-]     preferred_backend.upper()
2016-09-30 10:27:48+0000 [-] AttributeError: 'module' object has no attribute 'CANONICAL_BOUNCER_CLOUDFRONTED'

It's actually another deck and we will be adding new decks in the future.

IMHO, making decks "orthogonal" without any intersection between them is a way to go. I would love to hear a reasoning to avoid that in ooni-probe :)

sometimes the web UI will start blocking

OK, forking that to separate ticket.

@hellais commented on Fri Oct 07 2016

so I decided skip broken measurements emitted by makeMeasurement().

Yes this makes sense. I suggested a slightly better fix for this in the PR.

That's not the only reason as I still can reproduce the issue after the hotfix in non-reliable way. I'm still investigating it as it does not give so clean clues like tracebacks :)

Damn, what steps are you following to reproduce this bug?

It looks like a bug, but that's not a blocker, I'm forking it to separate ticket to document it, but I'm unsure that we'll ever fix it due to upcomming MeasurementKit integration.

Yes that seems like a wise thing to do. I agree it's not a blocker and we probably want to fix it as part of another issue (https://github.com/TheTorProject/ooni-probe/issues/618).

The smartest thing I can imagine is if we're unable to run a scheduled task for two sequent timeslots, run os.kill(os.getpid(), signal.SIGTERM), if that's three timeslots, run SIGKILL. That's ugly, but that's the best thing we can do against occasional deadlocks. //Let it fail.//

Hum, I'm unsure this is a good behaviour. The problem here is that actually the queue of scheduled tasks will in normal cases accumulate a very big backlog of tasks waiting on the lock (that is because the normal deck takes 1 hour to run and the scheduler will try to reschedule them and wait on the lock every 30 seconds).

So I think the problems here are actually 2:

How do we handle the situation where we have a deadlock and we will never get back from it. I think here maybe setting an upper-bound based on time (maybe the period of the scheduled task) and if by that time we are still waiting kill the process with fire, as you suggest above.
How can we avoid having a huge backlog of tasks waiting on the lock. This could be addressed with what I was describing above.

It looks like miscommunication between ooni-wui & ooni-probe that does not validate submitted form.

Damn you are right.

Filed this issue about it: OpenObservatory/ooni-wui#9

IMHO, making decks "orthogonal" without any intersection between them is a way to go.

Yes this makes sense. It does however make the configuration of decks slightly harder, but I think in general this is a good approach. Maybe what we can do is put the "risky" test (http-invalid-request-line) inside of it's own deck by itself and the web-connectivity one on it's own.

OK, forking that to separate ticket.

https://github.com/TheTorProject/ooni-probe/issues/619

@darkk commented on Tue Oct 11 2016

I decided to avoid suicide-on-deadlock as there are some task deferred to other threads and I'm unsure if ooni-probe is robust enough to handle interruptions and damaged data, so I don't consider wise to take a risk introducing such a questionable feature during RC stage.

@hellais commented on Sat Oct 15 2016

So it seems like issues related to this are not entirely solved.

When going through the process of updating from a lepidopter 1.6.1 to the latest 2.0.0 I was able to reproduce the permanent locking of the scheduler again...

Here is a relevant traceback:

2016-10-15 20:43:55,874081+0000 [Uninitialized] [!] Failed to run run-deck-http-invalid-8af7c3a6d7aec94e9ecaf3e329fb20ca
2016-10-15 20:43:55,875412+0000 [Uninitialized] Unhandled Error
        Traceback (most recent call last):
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 434, in errback
            self._startRunCallbacks(fail)
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
            self._runCallbacks()
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1241, in gotResult
            _inlineCallbacks(r, g, deferred)
        --- <exception caught here> ---
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1183, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0-py2.7.egg/ooni/agent/scheduler.py", line 139, in run
            yield self.task()
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1183, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0-py2.7.egg/ooni/agent/scheduler.py", line 281, in task
            yield deck.run(self.director)
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1183, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0-py2.7.egg/ooni/deck/deck.py", line 276, in run
            yield self.query_bouncer()
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1183, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0-py2.7.egg/ooni/deck/deck.py", line 193, in query_bouncer
            self.no_collector
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1183, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0-py2.7.egg/ooni/deck/backend.py", line 150, in lookup_collector_and_test_helpers
            response = yield bouncer.lookupTestCollector(required_nettests)
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1183, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0-py2.7.egg/ooni/backend_client.py", line 179, in lookupTestCollector
            raise e.CouldNotFindTestCollector
        ooni.errors.CouldNotFindTestCollector:

When this happens all the lockfiles are permanently locked leading to not scheduling of measurements.

This doesn't happen when I install it fresh on other systems, so I am not yet sure exactly why it's happening on this pi.

Edit:
deleting the lockfiles doesn't trigger re-running of the decks meaning that it's actually waiting on the mutex lock and not the filesystem lock.
Restarting the ooniprobe service resolves the issue.

@anadahz commented on Sun Oct 16 2016

Shall we introduce a newer update recipe to fix https://github.com/TheTorProject/ooni-probe/issues/612#issuecomment-254012251 ?

@hellais Have you tried reproducing this to other non Pi non fresh systems (upgrade, instead of fresh install)?

@hellais commented on Sun Oct 16 2016

LOL.

I realised that instead of commenting I edited your comment.

Posting it here are restoring the original.

Shall we introduce a newer update recipe to fix https://github.com/TheTorProject/ooni-probe/issues/612#issuecomment-254012251 ?

I am going to monitor how many lepidopters update today and if it seems like they all went offline I will push a new updated that sleeps a bit restarts the service.

@hellais Have you tried reproducing this to other non Pi non fresh systems (upgrade, instead of fresh install)?

No luck so far reproducing it on non raspberry pi deployments. It could also have been a temporary issue.

@darkk commented on Wed Oct 19 2016

are not entirely solved

Yep, AFAIR, we discussed that the scheduler is not exception-safe still. I'm going to fix it.

@hellais commented on Mon Dec 05 2016

What is the status of this? Is this still an issue? I have not seen this happening anymore since more recent releases.

The text was updated successfully, but these errors were encountered:

hellais mentioned this issue Sep 13, 2018

Scheduler lockfile for RunDeck task not being released ooni/probe#612

Closed

hellais added the bug Something isn't working label Nov 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler lockfile for RunDeck task not being released #26

Scheduler lockfile for RunDeck task not being released #26

hellais commented Sep 13, 2018

Scheduler lockfile for RunDeck task not being released #26

Scheduler lockfile for RunDeck task not being released #26

Comments

hellais commented Sep 13, 2018