You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would suggest as a way of debugging this to write some unittests that try to trigger a long running task and then queue a bunch of periodic tasks that acquire the lock on the filesystem and then see if at the end the lockfile is released.
@darkk can you look into this and see if you are able to see why this is happening?
Everything seems to be OK with locking itself (well, DeferredFilesystemLock is hackish), but:
the RunDeck is stuck due to unhandled error
Scheduler has a tiny fiber-bomb as run-deck-web-full-MD5(@daily) is started every 30 seconds and added to the waiting queue
cloudfront bouncer can't be configured, failing with AttributeError: 'module' object has no attribute 'CANONICAL_BOUNCER_CLOUDFRONTED'
I see a lot of false positives, e.g. http://www.metasploit.org
retries are not accounted in task completion check that looks like bug
Full web test deck and Web test deck without HTTP Invalid Request Line are enabled with checkboxes, but they look like mutually exclusive, so it should be radiobutton in the Web UI
Web UI startup (time to load meaningful webpage) is suspiciously slow, ~10-15s
Do you have more information about what the unhandled error could be? Something else worth looking into that could be related is that I have noticed sometimes that doneNetTest (https://github.com/TheTorProject/ooni-probe/blob/master/ooni/nettest.py#L515) is called while some measurements are still running (you notice this because you see the summary and after that you also see the results of individual tests).
This is a sign that the logic for managing if a certain set of tests of measurements is probably a bit broken. This may also be related to (5).
Scheduler has a tiny fiber-bomb as run-deck-web-full-MD5(@daily) is started every 30 seconds and added to the waiting queue (2)
Yes you are right, it's documented inside of some comment if I recall correctly. If we want to be a bit smarter about this we could skip adding to the queue if it already has an element in it. There may be some twisted construct that abstracts this sort of logic already.
cloudfront bouncer can't be configured, failing with AttributeError: 'module' object has no attribute 'CANONICAL_BOUNCER_CLOUDFRONTED' (3)
What is the reason for censorship being detected? The web_connectivity does produce some false positives, it's a bit better than the situation with the older http_requests test, but there are for sure improvements that can be made to improve it's accuracy.
If you have some example reports of it can you attach it here?
retries are not accounted in task completion check that looks like bug (5)
That ticket actually may contain some useful information in looking into the refactoring of the scheduler.
Full web test deck and Web test deck without HTTP Invalid Request Line are enabled with checkboxes, but they look like mutually exclusive, so it should be radiobutton in the Web UI
This is a very good point, though it's actually another deck and we will be adding new decks in the future. I will think a bit about how we can support this use case, but make it abstract enough to work also when we have more decks.
Web UI startup (time to load meaningful webpage) is suspiciously slow, ~10-15s
Yes, I have also noticed that sometimes the web UI will start blocking (like HTTP requests will be pending and the UI becomes irresponsive). I fear this is due to some code being blocking and the twisted event loop being stuck.
Is this what you are talking about or are you talking about the startup time?
ONE of reliable ways to reproduce the issue was to throw an exception in task generator, like
2016-09-30 11:56:32+0000 [DNSDatagramProtocol (UDP)] Running <<class 'ooni.nettest.NetTestCaseWithLocalOptions'> inputs=<generator object inputProcessor at 0x7fe206cbb1e0>> test_web_connectivity
2016-09-30 11:56:32+0000 [-] Unhandled error in Deferred:
2016-09-30 11:56:32+0000 [-] Unhandled Error
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/managers.py", line 117, in schedule
self._fillSlots()
File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/managers.py", line 61, in _fillSlots
d.addCallback(lambda _: self._scheduleNextTask())
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 317, in addCallback
callbackKeywords=kw)
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 306, in addCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/managers.py", line 61, in <lambda>
d.addCallback(lambda _: self._scheduleNextTask())
File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/managers.py", line 66, in _scheduleNextTask
task = next(self._tasks)
File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/nettest.py", line 598, in generateMeasurements
test_input)
File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/nettest.py", line 554, in makeMeasurement
measurement = Measurement(test_instance, test_method, test_input)
File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/tasks.py", line 114, in __init__
self.testInstance.setUp()
File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/nettests/blocking/web_connectivity.py", line 178, in setUp
raise Exception("Invalid input")
exceptions.Exception: Invalid input
If the exception is risen iside of generateMeasurements() then start_net_test_loader never progresses as net_test.done is never triggered. I'm unsure than I'm familiar with the code enough to refactor generateMeasurements() in-place, so I decided skip broken measurements emitted by makeMeasurement().
That's not the only reason as I still can reproduce the issue after the hotfix in non-reliable way. I'm still investigating it as it does not give so clean clues like tracebacks :)
doneNetTest is called while some measurements are still running
The easy way to trigger that behaviour is to run a test against TARPIT. twisted has huge timeout to trigger ResponseNeverReceived in this case, something like ~1100 seconds. So the Measurement is not running as TaskWithTimeout tried to commit a suicide, but some offspring deferreds doing retries were kept alive. www.wallpapergate.com works like a TARPIT for me.
It looks like a bug, but that's not a blocker, I'm forking it to separate ticket to document it, but I'm unsure that we'll ever fix it due to upcomming MeasurementKit integration.
If we want to be a bit smarter about fiber-bomb in scheduler we could skip adding to the queue if it already has an element in it.
The smartest thing I can imagine is if we're unable to run a scheduled task for two sequent timeslots, run os.kill(os.getpid(), signal.SIGTERM), if that's three timeslots, run SIGKILL. That's ugly, but that's the best thing we can do against occasional deadlocks. //Let it fail.//
CANONICAL_BOUNCER_CLOUDFRONTED
It looks like miscommunication between ooni-wui & ooni-probe that does not validate submitted form.
You can easily reproduce that running ooni-probe in a fresh VM :) Traceback is boring:
Starting ooniprobe agent.
To view the GUI go to http://127.0.0.1:8842
An error has occurred: b"AttributeError: 'module' object has no attribute 'CANONICAL_BOUNCER_CLOUDFRONTED'"
Please look at log file for more information.
2016-09-30 10:27:48+0000 [-] Traceback (most recent call last):
2016-09-30 10:27:48+0000 [-] File "/usr/local/bin/ooniprobe-agent", line 11, in <module>
2016-09-30 10:27:48+0000 [-] load_entry_point('ooniprobe==2.0.0rc4.dev0', 'console_scripts', 'ooniprobe-agent')()
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/scripts/ooniprobe_agent.py", line 217, in run
2016-09-30 10:27:48+0000 [-] return start_agent(options)
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/scripts/ooniprobe_agent.py", line 101, in start_agent
2016-09-30 10:27:48+0000 [-] twistd.runApp(twistd_config)
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/twistd.py", line 25, in runApp
2016-09-30 10:27:48+0000 [-] _SomeApplicationRunner(config).run()
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 383, in run
2016-09-30 10:27:48+0000 [-] self.postApplication()
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/_twistd_unix.py", line 248, in postApplication
2016-09-30 10:27:48+0000 [-] self.startApplication(self.application)
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/_twistd_unix.py", line 444, in startApplication
2016-09-30 10:27:48+0000 [-] app.startApplication(application, not self.config['no_save'])
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 664, in startApplication
2016-09-30 10:27:48+0000 [-] service.IService(application).startService()
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/twisted/application/service.py", line 283, in startService
2016-09-30 10:27:48+0000 [-] service.startService()
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/agent/agent.py", line 25, in startService
2016-09-30 10:27:48+0000 [-] service.MultiService.startService(self)
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/twisted/application/service.py", line 283, in startService
2016-09-30 10:27:48+0000 [-] service.startService()
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/agent/scheduler.py", line 435, in startService
2016-09-30 10:27:48+0000 [-] self.refresh_deck_list()
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/agent/scheduler.py", line 370, in refresh_deck_list
2016-09-30 10:27:48+0000 [-] for deck_id, deck in deck_store.list_enabled():
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/deck/store.py", line 153, in list_enabled
2016-09-30 10:27:48+0000 [-] for deck_id, deck in self._list():
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/deck/store.py", line 141, in _list
2016-09-30 10:27:48+0000 [-] self._update_cache()
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/deck/store.py", line 186, in _update_cache
2016-09-30 10:27:48+0000 [-] deck_path=self.available_directory.child(deck_path).path
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/deck/deck.py", line 97, in __init__
2016-09-30 10:27:48+0000 [-] self.open(deck_path)
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/deck/deck.py", line 105, in open
2016-09-30 10:27:48+0000 [-] self.load(deck_data, global_options)
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/deck/deck.py", line 120, in load
2016-09-30 10:27:48+0000 [-] self.bouncer = get_preferred_bouncer()
2016-09-30 10:27:48+0000 [-] File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0rc4.dev0-py2.7.egg/ooni/backend_client.py", line 284, in get_preferred_bouncer
2016-09-30 10:27:48+0000 [-] preferred_backend.upper()
2016-09-30 10:27:48+0000 [-] AttributeError: 'module' object has no attribute 'CANONICAL_BOUNCER_CLOUDFRONTED'
It's actually another deck and we will be adding new decks in the future.
IMHO, making decks "orthogonal" without any intersection between them is a way to go. I would love to hear a reasoning to avoid that in ooni-probe :)
so I decided skip broken measurements emitted by makeMeasurement().
Yes this makes sense. I suggested a slightly better fix for this in the PR.
That's not the only reason as I still can reproduce the issue after the hotfix in non-reliable way. I'm still investigating it as it does not give so clean clues like tracebacks :)
Damn, what steps are you following to reproduce this bug?
It looks like a bug, but that's not a blocker, I'm forking it to separate ticket to document it, but I'm unsure that we'll ever fix it due to upcomming MeasurementKit integration.
The smartest thing I can imagine is if we're unable to run a scheduled task for two sequent timeslots, run os.kill(os.getpid(), signal.SIGTERM), if that's three timeslots, run SIGKILL. That's ugly, but that's the best thing we can do against occasional deadlocks. //Let it fail.//
Hum, I'm unsure this is a good behaviour. The problem here is that actually the queue of scheduled tasks will in normal cases accumulate a very big backlog of tasks waiting on the lock (that is because the normal deck takes 1 hour to run and the scheduler will try to reschedule them and wait on the lock every 30 seconds).
So I think the problems here are actually 2:
How do we handle the situation where we have a deadlock and we will never get back from it. I think here maybe setting an upper-bound based on time (maybe the period of the scheduled task) and if by that time we are still waiting kill the process with fire, as you suggest above.
How can we avoid having a huge backlog of tasks waiting on the lock. This could be addressed with what I was describing above.
It looks like miscommunication between ooni-wui & ooni-probe that does not validate submitted form.
IMHO, making decks "orthogonal" without any intersection between them is a way to go.
Yes this makes sense. It does however make the configuration of decks slightly harder, but I think in general this is a good approach. Maybe what we can do is put the "risky" test (http-invalid-request-line) inside of it's own deck by itself and the web-connectivity one on it's own.
I decided to avoid suicide-on-deadlock as there are some task deferred to other threads and I'm unsure if ooni-probe is robust enough to handle interruptions and damaged data, so I don't consider wise to take a risk introducing such a questionable feature during RC stage.
So it seems like issues related to this are not entirely solved.
When going through the process of updating from a lepidopter 1.6.1 to the latest 2.0.0 I was able to reproduce the permanent locking of the scheduler again...
Here is a relevant traceback:
2016-10-15 20:43:55,874081+0000 [Uninitialized] [!] Failed to run run-deck-http-invalid-8af7c3a6d7aec94e9ecaf3e329fb20ca
2016-10-15 20:43:55,875412+0000 [Uninitialized] Unhandled Error
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 434, in errback
self._startRunCallbacks(fail)
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
self._runCallbacks()
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1241, in gotResult
_inlineCallbacks(r, g, deferred)
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1183, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0-py2.7.egg/ooni/agent/scheduler.py", line 139, in run
yield self.task()
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1183, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0-py2.7.egg/ooni/agent/scheduler.py", line 281, in task
yield deck.run(self.director)
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1183, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0-py2.7.egg/ooni/deck/deck.py", line 276, in run
yield self.query_bouncer()
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1183, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0-py2.7.egg/ooni/deck/deck.py", line 193, in query_bouncer
self.no_collector
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1183, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0-py2.7.egg/ooni/deck/backend.py", line 150, in lookup_collector_and_test_helpers
response = yield bouncer.lookupTestCollector(required_nettests)
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1183, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/lib/python2.7/dist-packages/ooniprobe-2.0.0-py2.7.egg/ooni/backend_client.py", line 179, in lookupTestCollector
raise e.CouldNotFindTestCollector
ooni.errors.CouldNotFindTestCollector:
When this happens all the lockfiles are permanently locked leading to not scheduling of measurements.
This doesn't happen when I install it fresh on other systems, so I am not yet sure exactly why it's happening on this pi.
Edit:
deleting the lockfiles doesn't trigger re-running of the decks meaning that it's actually waiting on the mutex lock and not the filesystem lock.
Restarting the ooniprobe service resolves the issue.
I am going to monitor how many lepidopters update today and if it seems like they all went offline I will push a new updated that sleeps a bit restarts the service.
@hellais Have you tried reproducing this to other non Pi non fresh systems (upgrade, instead of fresh install)?
No luck so far reproducing it on non raspberry pi deployments. It could also have been a temporary issue.
@hellais commented on Mon Sep 26 2016
It seems like there is an issue in the releasing of the lockfile for the scheduler of the
RunDeck
task.This was found while looking into the issue related to https://github.com/TheTorProject/ooni-probe/issues/606.
As a result of this decks are only run once and then they are not re-triggered daily due to the lockfile still existing on disk.
The relevant code is: https://github.com/TheTorProject/ooni-probe/blob/master/ooni/agent/scheduler.py#L24.
In debugging this I noticed that the last run time is being updated properly (https://github.com/TheTorProject/ooni-probe/blob/master/ooni/agent/scheduler.py#L129), but for some reason the lockfile is never being released.
I would suggest as a way of debugging this to write some unittests that try to trigger a long running task and then queue a bunch of periodic tasks that acquire the lock on the filesystem and then see if at the end the lockfile is released.
@darkk can you look into this and see if you are able to see why this is happening?
@darkk commented on Fri Sep 30 2016
Everything seems to be OK with locking itself (well,
DeferredFilesystemLock
is hackish), but:run-deck-web-full-MD5(@daily)
is started every 30 seconds and added to the waiting queueAttributeError: 'module' object has no attribute 'CANONICAL_BOUNCER_CLOUDFRONTED'
http://www.metasploit.org
@hellais commented on Fri Sep 30 2016
Thanks for the good work on looking into this.
Let me address your feedback line by line:
Do you have more information about what the unhandled error could be? Something else worth looking into that could be related is that I have noticed sometimes that
doneNetTest
(https://github.com/TheTorProject/ooni-probe/blob/master/ooni/nettest.py#L515) is called while some measurements are still running (you notice this because you see the summary and after that you also see the results of individual tests).This is a sign that the logic for managing if a certain set of tests of measurements is probably a bit broken. This may also be related to (5).
Yes you are right, it's documented inside of some comment if I recall correctly. If we want to be a bit smarter about this we could skip adding to the queue if it already has an element in it. There may be some twisted construct that abstracts this sort of logic already.
It's interesting that you see this exception. It's probably an issue with setting the
settings
of the client to betype: cloudfronted
instead oftype: cloudfront
(see: https://github.com/TheTorProject/ooni-probe/blob/master/ooni/backend_client.py#L287)Do you have some traceback of this exception?
What is the reason for censorship being detected? The web_connectivity does produce some false positives, it's a bit better than the situation with the older http_requests test, but there are for sure improvements that can be made to improve it's accuracy.
If you have some example reports of it can you attach it here?
Yes, this seems like a bug and in general I was hoping that this would end up being addressed as part of the refactor documented in https://github.com/TheTorProject/ooni-probe/issues/563.
That ticket actually may contain some useful information in looking into the refactoring of the scheduler.
This is a very good point, though it's actually another deck and we will be adding new decks in the future. I will think a bit about how we can support this use case, but make it abstract enough to work also when we have more decks.
Yes, I have also noticed that sometimes the web UI will start blocking (like HTTP requests will be pending and the UI becomes irresponsive). I fear this is due to some code being blocking and the twisted event loop being stuck.
Is this what you are talking about or are you talking about the startup time?
@darkk commented on Fri Oct 07 2016
ONE of reliable ways to reproduce the issue was to throw an exception in task generator, like
If the exception is risen iside of
generateMeasurements()
thenstart_net_test_loader
never progresses asnet_test.done
is never triggered. I'm unsure than I'm familiar with the code enough to refactorgenerateMeasurements()
in-place, so I decided skip broken measurements emitted bymakeMeasurement()
.That's not the only reason as I still can reproduce the issue after the hotfix in non-reliable way. I'm still investigating it as it does not give so clean clues like tracebacks :)
The easy way to trigger that behaviour is to run a test against TARPIT. twisted has huge timeout to trigger
ResponseNeverReceived
in this case, something like ~1100 seconds. So theMeasurement
is not running asTaskWithTimeout
tried to commit a suicide, but some offspring deferreds doing retries were kept alive.www.wallpapergate.com
works like a TARPIT for me.It looks like a bug, but that's not a blocker, I'm forking it to separate ticket to document it, but I'm unsure that we'll ever fix it due to upcomming MeasurementKit integration.
The smartest thing I can imagine is
if we're unable to run a scheduled task for two sequent timeslots, run os.kill(os.getpid(), signal.SIGTERM), if that's three timeslots, run SIGKILL
. That's ugly, but that's the best thing we can do against occasional deadlocks. //Let it fail.//It looks like miscommunication between ooni-wui & ooni-probe that does not validate submitted form.
You can easily reproduce that running ooni-probe in a fresh VM :) Traceback is boring:
IMHO, making decks "orthogonal" without any intersection between them is a way to go. I would love to hear a reasoning to avoid that in ooni-probe :)
OK, forking that to separate ticket.
@hellais commented on Fri Oct 07 2016
Yes this makes sense. I suggested a slightly better fix for this in the PR.
Damn, what steps are you following to reproduce this bug?
Yes that seems like a wise thing to do. I agree it's not a blocker and we probably want to fix it as part of another issue (https://github.com/TheTorProject/ooni-probe/issues/618).
Hum, I'm unsure this is a good behaviour. The problem here is that actually the queue of scheduled tasks will in normal cases accumulate a very big backlog of tasks waiting on the lock (that is because the normal deck takes 1 hour to run and the scheduler will try to reschedule them and wait on the lock every 30 seconds).
So I think the problems here are actually 2:
Damn you are right.
Filed this issue about it: OpenObservatory/ooni-wui#9
Yes this makes sense. It does however make the configuration of decks slightly harder, but I think in general this is a good approach. Maybe what we can do is put the "risky" test (http-invalid-request-line) inside of it's own deck by itself and the web-connectivity one on it's own.
https://github.com/TheTorProject/ooni-probe/issues/619
@darkk commented on Tue Oct 11 2016
I decided to avoid suicide-on-deadlock as there are some task deferred to other threads and I'm unsure if ooni-probe is robust enough to handle interruptions and damaged data, so I don't consider wise to take a risk introducing such a questionable feature during RC stage.
@hellais commented on Sat Oct 15 2016
So it seems like issues related to this are not entirely solved.
When going through the process of updating from a lepidopter 1.6.1 to the latest 2.0.0 I was able to reproduce the permanent locking of the scheduler again...
Here is a relevant traceback:
When this happens all the lockfiles are permanently locked leading to not scheduling of measurements.
This doesn't happen when I install it fresh on other systems, so I am not yet sure exactly why it's happening on this pi.
Edit:
deleting the lockfiles doesn't trigger re-running of the decks meaning that it's actually waiting on the mutex lock and not the filesystem lock.
Restarting the ooniprobe service resolves the issue.
@anadahz commented on Sun Oct 16 2016
Shall we introduce a newer update recipe to fix https://github.com/TheTorProject/ooni-probe/issues/612#issuecomment-254012251 ?
@hellais Have you tried reproducing this to other non Pi non fresh systems (upgrade, instead of fresh install)?
@hellais commented on Sun Oct 16 2016
LOL.
I realised that instead of commenting I edited your comment.
Posting it here are restoring the original.
I am going to monitor how many lepidopters update today and if it seems like they all went offline I will push a new updated that sleeps a bit restarts the service.
No luck so far reproducing it on non raspberry pi deployments. It could also have been a temporary issue.
@darkk commented on Wed Oct 19 2016
Yep, AFAIR, we discussed that the scheduler is not exception-safe still. I'm going to fix it.
@hellais commented on Mon Dec 05 2016
What is the status of this? Is this still an issue? I have not seen this happening anymore since more recent releases.
The text was updated successfully, but these errors were encountered: