test_dont_request_on_many_short_tasks CI failure #41

jakirkham · 2017-11-03T14:15:32Z

Seeing this test failure on the CI.

Test failure snippet

____________________ test_dont_request_on_many_short_tasks _____________________
loop = <tornado.platform.epoll.EPollIOLoop object at 0x7f8ba7153b00>
    def test_dont_request_on_many_short_tasks(loop):
        with SGECluster(scheduler_port=0) as cluster:
            adapt = Adaptive(cluster, interval=50, startup_cost=10)
            with Client(cluster, loop=loop) as client:
                cluster.scheduler.task_duration['slowinc'] = 0.001
                futures = client.map(slowinc, range(1000), delay=0.001)
    
                while not cluster.scheduler.workers:
                    sleep(0.01)
    
                for i in range(20):
                    sleep(0.1)
>                   assert len(cluster.workers) < 2
E                   AssertionError: assert 2 < 2
E                    +  where 2 = len({'15.1': WorkerSpec(job_id='15.1', kwargs={'nativeSpecification': '', 'cpus': 1, 'memory': None, 'memory_fraction': 0...., 'memory': None, 'memory_fraction': 0.5}, stdout='/dask-drmaa/worker.16.1.out', stderr='/dask-drmaa/worker.16.1.err')})
E                    +    where {'15.1': WorkerSpec(job_id='15.1', kwargs={'nativeSpecification': '', 'cpus': 1, 'memory': None, 'memory_fraction': 0...., 'memory': None, 'memory_fraction': 0.5}, stdout='/dask-drmaa/worker.16.1.out', stderr='/dask-drmaa/worker.16.1.err')} = <SGECluster: 2 workers>.workers
dask_drmaa/tests/test_adaptive.py:105: AssertionError
----------------------------- Captured stderr call -----------------------------
tornado.application - ERROR - Future <tornado.concurrent.Future object at 0x7f8ba715ac88> exception was never retrieved: Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 305, in connect
    **kwargs)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 183, in connect
    af, addr, stream = yield connector.start()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 82, in start
    self.try_connect(iter(self.primary_addrs))
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 97, in try_connect
    future = self.connect(af, addr)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 224, in _create_stream
    return stream.connect(addr)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/iostream.py", line 1128, in connect
    self._add_io_state(self.io_loop.WRITE)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/iostream.py", line 971, in _add_io_state
    self.fileno(), self._handle_events, self._state)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py", line 728, in add_handler
    self._impl.register(fd, events | self.ERROR)
ValueError: I/O operation on closed epoll object
tornado.application - ERROR - Future <tornado.concurrent.Future object at 0x7f8ba6a460f0> exception was never retrieved: Traceback (most recent call last):
  File "/dask-drmaa/dask_drmaa/tests/test_adaptive.py", line 105, in test_dont_request_on_many_short_tasks
    assert len(cluster.workers) < 2
AssertionError: assert 2 < 2
 +  where 2 = len({'15.1': WorkerSpec(job_id='15.1', kwargs={'nativeSpecification': '', 'cpus': 1, 'memory': None, 'memory_fraction': 0...., 'memory': None, 'memory_fraction': 0.5}, stdout='/dask-drmaa/worker.16.1.out', stderr='/dask-drmaa/worker.16.1.err')})
 +    where {'15.1': WorkerSpec(job_id='15.1', kwargs={'nativeSpecification': '', 'cpus': 1, 'memory': None, 'memory_fraction': 0...., 'memory': None, 'memory_fraction': 0.5}, stdout='/dask-drmaa/worker.16.1.out', stderr='/dask-drmaa/worker.16.1.err')} = <SGECluster: 2 workers>.workers
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda/lib/python3.6/site-packages/dask_drmaa-0.1.0-py3.6.egg/dask_drmaa/core.py", line 193, in stop_workers
    close_workers=True)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/scheduler.py", line 1951, in retire_workers
    yield [self.close_worker(worker=w) for w in workers]
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 828, in callback
    result_list.append(f.result())
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/scheduler.py", line 564, in close_worker
    yield r.terminate(report=False)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/core.py", line 453, in send_recv_from_rpc
    comm = yield self.live_comm()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/core.py", line 429, in live_comm
    connection_args=self.connection_args)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 307, in wrapper
    yielded = next(result)
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/comm/core.py", line 185, in connect
    quiet_exceptions=EnvironmentError)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 921, in with_timeout
    timeout, timeout_callback)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py", line 505, in add_timeout
    callback, *args, **kwargs)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py", line 921, in call_at
    heapq.heappush(self._timeouts, timeout)
TypeError: heap argument must be a list

ref: https://travis-ci.org/dask/dask-drmaa/jobs/296660993

jakirkham · 2017-11-03T15:50:41Z

To summarize the error above, it expects there to be fewer than 2 workers, but there are exactly 2.

jakirkham · 2017-11-03T16:09:29Z

Seems this is not really reproducible. 😕 Restarting made this error go away.

Edit: That said, it has repeated a few times. So it is not perfectly reproducible, but is still some kind of issue.

jakirkham · 2017-11-24T20:30:48Z

Saw a very similar error on Python 2.7 without the heap issue coming up. Snippet below and link to the relevant log afterwards.

Test failure snippet

____________________ test_dont_request_on_many_short_tasks _____________________
loop = <tornado.platform.epoll.EPollIOLoop object at 0x7fbf88ed9470>
    def test_dont_request_on_many_short_tasks(loop):
        with SGECluster(scheduler_port=0) as cluster:
            adapt = Adaptive(cluster, interval=50, startup_cost=10)
            with Client(cluster, loop=loop) as client:
                cluster.scheduler.task_duration['slowinc'] = 0.001
                futures = client.map(slowinc, range(1000), delay=0.001)
    
                while not cluster.scheduler.workers:
                    sleep(0.01)
    
                for i in range(20):
                    sleep(0.1)
>                   assert len(cluster.workers) < 2
E                   AssertionError: assert 2 < 2
E                    +  where 2 = len({'15.1': WorkerSpec(job_id='15.1', kwargs={'nativeSpecification': '', 'cpus': 1, 'memory': None, 'memory_fraction': 0...., 'memory': None, 'memory_fraction': 0.5}, stdout='/dask-drmaa/worker.16.1.out', stderr='/dask-drmaa/worker.16.1.err')})
E                    +    where {'15.1': WorkerSpec(job_id='15.1', kwargs={'nativeSpecification': '', 'cpus': 1, 'memory': None, 'memory_fraction': 0...., 'memory': None, 'memory_fraction': 0.5}, stdout='/dask-drmaa/worker.16.1.out', stderr='/dask-drmaa/worker.16.1.err')} = <SGECluster: 2 workers>.workers
dask_drmaa/tests/test_adaptive.py:105: AssertionError
----------------------------- Captured stderr call -----------------------------
tornado.application - ERROR - Future <tornado.concurrent.Future object at 0x7fbf89076f60> exception was never retrieved: Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 309, in connect
    **kwargs)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 183, in connect
    af, addr, stream = yield connector.start()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 82, in start
    self.try_connect(iter(self.primary_addrs))
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 97, in try_connect
    future = self.connect(af, addr)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 224, in _create_stream
    return stream.connect(addr)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/iostream.py", line 1128, in connect
    self._add_io_state(self.io_loop.WRITE)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/iostream.py", line 971, in _add_io_state
    self.fileno(), self._handle_events, self._state)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py", line 728, in add_handler
    self._impl.register(fd, events | self.ERROR)
ValueError: I/O operation on closed epoll object

ref: https://travis-ci.org/dask/dask-drmaa/jobs/305489948#L1929

jakirkham · 2017-11-24T20:50:14Z

Any ideas on this test, @mrocklin? I'm kind of at a loss. 😕 Seems to expect there always to be only 1 worker, but sometimes there are 2. Is this is a reasonable expectation? Given how much trouble it is causing, I'm leaning towards making it a known fail.

mrocklin · 2017-11-29T13:40:21Z

My guess is that this is caused when workers are started but do not arrive quickly, so we start another one. Finally both arrive and cause the test to fail. At first glance I don't see a nice solution.

jakirkham · 2017-11-29T16:25:38Z

That sounds plausible. Should we just up the sleep time before checking the number of workers then?

mrocklin · 2017-11-29T16:47:10Z

I'm not sure where that sleep would be. We're probably calling cluster.add_worker when we already have a worker in flight in SGE.

jakirkham · 2017-11-29T16:51:33Z

Guess before entering this for-loop to give it some more time to equilibrate. Not sure if that is defeating the test though.

mrocklin · 2017-11-30T00:13:50Z

Yeah, this test is trying to verify that we don't make more workers than we need. It turns out that sometimes we do. I think that this test genuinely fails. I think that adding a sleep during the period where we might have more than one worker destroys the spirit of the test, as you suggest.

jakirkham · 2017-11-30T15:26:17Z

Alright, I'm proposing we mark it as an xfail in PR ( #50 ). If we can figure out how to fix it, we can always drop the xfail.

jakirkham · 2018-02-27T18:13:57Z

While the test has been marked as a known failure, the test failing is still problematic. So am reopening this issue so that it can be tracked.

jakirkham mentioned this issue Nov 3, 2017

Releasing? #39

Closed

This was referenced Nov 16, 2017

Copy user-provided script to local directory #47

Merged

Minor fix for python2.7 and adding a should_copy_script argument #48

Merged

jakirkham mentioned this issue Nov 30, 2017

Mark test_dont_request_on_many_short_tasks xfail #50

Merged

mrocklin closed this as completed in #50 Nov 30, 2017

jakirkham mentioned this issue Dec 9, 2017

Disallow simultaneous scale up and down in Adaptive. Fix for #1604 dask/distributed#1608

Merged

jakirkham reopened this Feb 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_dont_request_on_many_short_tasks CI failure #41

test_dont_request_on_many_short_tasks CI failure #41

jakirkham commented Nov 3, 2017

jakirkham commented Nov 3, 2017

jakirkham commented Nov 3, 2017 •

edited

Loading

jakirkham commented Nov 24, 2017

jakirkham commented Nov 24, 2017

mrocklin commented Nov 29, 2017

jakirkham commented Nov 29, 2017

mrocklin commented Nov 29, 2017

jakirkham commented Nov 29, 2017

mrocklin commented Nov 30, 2017

jakirkham commented Nov 30, 2017

jakirkham commented Feb 27, 2018

test_dont_request_on_many_short_tasks CI failure #41

test_dont_request_on_many_short_tasks CI failure #41

Comments

jakirkham commented Nov 3, 2017

jakirkham commented Nov 3, 2017

jakirkham commented Nov 3, 2017 • edited Loading

jakirkham commented Nov 24, 2017

jakirkham commented Nov 24, 2017

mrocklin commented Nov 29, 2017

jakirkham commented Nov 29, 2017

mrocklin commented Nov 29, 2017

jakirkham commented Nov 29, 2017

mrocklin commented Nov 30, 2017

jakirkham commented Nov 30, 2017

jakirkham commented Feb 27, 2018

jakirkham commented Nov 3, 2017 •

edited

Loading