transition flight to missing if no who_has #5653

fjetter · 2022-01-12T17:11:21Z

There is an edge case connected to our select_keys_for_gather optimization for tasks which do no longer have a who_has. Letting the fetch->flight transition redirect this is the most straightforward solution to the problem.

A different approach would be to transition tasks to missing as soon as they no longer have a who_has. It is similar in complexity and I might change this again. I first want to write a test reproducing the mentioned edge case first.

#5381 (comment)

@crusaderky you may cherry-pick this commit in the meantime to test your branch. the changes will be rather minimal either way

distributed/worker.py

fjetter · 2022-01-25T16:34:45Z

This transition is only happening because there are duplicates in Worker.pending_data_per_worker. The pretty horrific test below can uncover this condition but I am a bit concerned about adding this to our test suite. At least for the time I am developing a fix this will be useful. The ultimate thing is a bug in Worker.update_who_has and we should be able to test this in isolation.

The entire mocking is there because we cannot hook into any worker internals naturally or intercept any stimuli, etc. I think if we were able to structure the internals a bit such that we can intercept/stop the worker at a few key points this would go a long way (e.g. the mocking is only there to disable the every_cycle callbackallowing me better control over / assert the state right after a handler )

@gen_cluster(client=True)
async def test_dups_in_pending_data_per_worker(c, s, a, b):

    # We need to fetch data which is reliably in the select_from_gather
    # if it is somehow prioriticed it will immediately be flagged as missing

    # pending_data_per_worker is outdated because removal is not easy
    futs = c.map(inc, range(100), workers=[a.address])
    missing_fut = c.submit(inc, -1, workers=[a.address], key="culprit")
    await c.gather(futs)
    await missing_fut
    patcher = mock.patch.object(Worker, "ensure_communicating", return_value=None)
    with mock.patch.object(
        Worker, "ensure_communicating", return_value=None
    ) as comm_mock:
        with mock.patch.object(
            Worker, "ensure_computing", return_value=None
        ) as comp_mock:
            x = await Worker(s.address, name=2, validate=True)
            f1 = c.submit(sum, [*futs[:50]], workers=[x.address], key="f1", priority=1)
            f2 = c.submit(inc, missing_fut, workers=[x.address], key="f2", priority=2)
            f3 = c.submit(sum, [*futs[50:]], workers=[x.address], key="f3", priority=3)

            while not len(x.data_needed) == len(futs) + 1:
                await asyncio.sleep(0.01)
            assert missing_fut.key in x.tasks
            assert missing_fut.key in x.pending_data_per_worker[a.address]

    for _ in range(3):
        await x.query_who_has(missing_fut.key, stimulus_id="foo")

    assert (
        sum(1 for key in x.pending_data_per_worker[a.address] if key == "culprit") > 1
    )
    with mock.patch.object(Worker, "query_who_has", return_value=None) as who_has_mock:
        # Now create a copy on B of culprit such that it isn't rescheduled once we
        # release it
        f_copy = c.submit(
            inc, missing_fut, key="copy-intention-culprit", workers=[b.address]
        )
        await f_copy
        a.handle_remove_replicas([missing_fut.key], stimulus_id="test")

        x.total_out_connections = 1
        x.target_message_size = sum(x.tasks[f.key].get_nbytes() for f in futs[:52])
        x.ensure_communicating()

        with mock.patch.object(
            Worker, "ensure_communicating", return_value=None
        ) as comm_mock:
            with mock.patch.object(
                Worker, "ensure_computing", return_value=None
            ) as comp_mock:
                while not x.data:
                    await asyncio.sleep(0.01)

        assert not x.tasks["culprit"].who_has
        assert x.tasks["culprit"].state == "fetch"
        x.target_message_size = 1000000000

        # Of course we do not want the assertionerror but it does get swallowed
        # if executed as part of a tornado coro and we will only see the timeout
        # below
        with pytest.raises(AssertionError):
            x.ensure_communicating()
        # await f1
        # await f2
        # await f3

crusaderky · 2022-01-26T14:43:20Z

@fjetter I'm not sure I understand:

in real life, the 3 calls to query_who_has are performed by gather_dep. Couldn't you invoke gather_dep instead?
Why is ensure_communicating invoking gather_dep on the same key multiple times?
In your test, what's the purpose of the 100 tasks that you seem to use as padding?
mock.patch alters all existing instances of a class. Couldn't you just init 3 workers from gen_cluster?

fjetter · 2022-01-27T14:21:39Z

this thing escalated a bit out of control. There were multiple issues and my fixes are a little more involved than I was hoping. I do believe that the code is in a much better state now.
Notable changes:

I introduced the class _UniqueTaskHeap primarily for pending_data_per_worker. The uniqueness is not absolutely necessary and can be handled by introducing proper guards instead. However, uniqueness is a much safer guarantee and allows us to avoid a bunch of edge cases. The total runtime of this custom class is mostly the same as for the ordinary heap (with the exception of a constant offset). It requires a bit more memory but I consider this additional set to track uniqueness negligible
There is now one place and one place only which removes items from who_has. This is after an unsuccessful gather_dep. I believe this was the case before but now it is much more explicit. In particular, this means that there is only one -> missing transition and we do not need to add more ts.who_has guards all over the place
The transition flight->fetch lost information about who_has since it transitioned the task back to released. This is not good since that may be just introduced by a "busy" reply. Instead, we're now more explicit in this transition which should overall reduce requests on the scheduler.