-
-
Notifications
You must be signed in to change notification settings - Fork 22
test_dont_request_on_many_short_tasks CI failure #41
Comments
To summarize the error above, it expects there to be fewer than 2 workers, but there are exactly 2. |
Seems this is not really reproducible. 😕 Restarting made this error go away. Edit: That said, it has repeated a few times. So it is not perfectly reproducible, but is still some kind of issue. |
Saw a very similar error on Python 2.7 without the Test failure snippet____________________ test_dont_request_on_many_short_tasks _____________________
loop = <tornado.platform.epoll.EPollIOLoop object at 0x7fbf88ed9470>
def test_dont_request_on_many_short_tasks(loop):
with SGECluster(scheduler_port=0) as cluster:
adapt = Adaptive(cluster, interval=50, startup_cost=10)
with Client(cluster, loop=loop) as client:
cluster.scheduler.task_duration['slowinc'] = 0.001
futures = client.map(slowinc, range(1000), delay=0.001)
while not cluster.scheduler.workers:
sleep(0.01)
for i in range(20):
sleep(0.1)
> assert len(cluster.workers) < 2
E AssertionError: assert 2 < 2
E + where 2 = len({'15.1': WorkerSpec(job_id='15.1', kwargs={'nativeSpecification': '', 'cpus': 1, 'memory': None, 'memory_fraction': 0...., 'memory': None, 'memory_fraction': 0.5}, stdout='/dask-drmaa/worker.16.1.out', stderr='/dask-drmaa/worker.16.1.err')})
E + where {'15.1': WorkerSpec(job_id='15.1', kwargs={'nativeSpecification': '', 'cpus': 1, 'memory': None, 'memory_fraction': 0...., 'memory': None, 'memory_fraction': 0.5}, stdout='/dask-drmaa/worker.16.1.out', stderr='/dask-drmaa/worker.16.1.err')} = <SGECluster: 2 workers>.workers
dask_drmaa/tests/test_adaptive.py:105: AssertionError
----------------------------- Captured stderr call -----------------------------
tornado.application - ERROR - Future <tornado.concurrent.Future object at 0x7fbf89076f60> exception was never retrieved: Traceback (most recent call last):
File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/opt/anaconda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 309, in connect
**kwargs)
File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 183, in connect
af, addr, stream = yield connector.start()
File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 82, in start
self.try_connect(iter(self.primary_addrs))
File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 97, in try_connect
future = self.connect(af, addr)
File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 224, in _create_stream
return stream.connect(addr)
File "/opt/anaconda/lib/python3.6/site-packages/tornado/iostream.py", line 1128, in connect
self._add_io_state(self.io_loop.WRITE)
File "/opt/anaconda/lib/python3.6/site-packages/tornado/iostream.py", line 971, in _add_io_state
self.fileno(), self._handle_events, self._state)
File "/opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py", line 728, in add_handler
self._impl.register(fd, events | self.ERROR)
ValueError: I/O operation on closed epoll object ref: https://travis-ci.org/dask/dask-drmaa/jobs/305489948#L1929 |
Any ideas on this test, @mrocklin? I'm kind of at a loss. 😕 Seems to expect there always to be only 1 worker, but sometimes there are 2. Is this is a reasonable expectation? Given how much trouble it is causing, I'm leaning towards making it a known fail. |
My guess is that this is caused when workers are started but do not arrive quickly, so we start another one. Finally both arrive and cause the test to fail. At first glance I don't see a nice solution. |
That sounds plausible. Should we just up the sleep time before checking the number of workers then? |
I'm not sure where that sleep would be. We're probably calling |
Guess before entering this |
Yeah, this test is trying to verify that we don't make more workers than we need. It turns out that sometimes we do. I think that this test genuinely fails. I think that adding a sleep during the period where we might have more than one worker destroys the spirit of the test, as you suggest. |
Alright, I'm proposing we mark it as an xfail in PR ( #50 ). If we can figure out how to fix it, we can always drop the xfail. |
While the test has been marked as a known failure, the test failing is still problematic. So am reopening this issue so that it can be tracked. |
Seeing this test failure on the CI.
Test failure snippet
ref: https://travis-ci.org/dask/dask-drmaa/jobs/296660993
The text was updated successfully, but these errors were encountered: