Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/integration: data store mgr tests unstable #4175

Open
oliver-sanders opened this issue Apr 15, 2021 · 10 comments
Open

tests/integration: data store mgr tests unstable #4175

oliver-sanders opened this issue Apr 15, 2021 · 10 comments
Assignees
Milestone

Comments

@oliver-sanders
Copy link
Member

The data store mgr tests or a subset thereof sometimes fail and sometimes pass.

Some example traceback:

    def test_update_data_structure(harness):
        """Test update_data_structure. This method will generate and
        apply adeltas/updates given."""
        schd, data = harness
        w_id = schd.data_store_mgr.workflow_id
        schd.data_store_mgr.data[w_id] = data
        assert TASK_STATUS_FAILED not in set(collect_states(data, TASK_PROXIES))
        assert TASK_STATUS_FAILED not in set(collect_states(data, FAMILY_PROXIES))
        assert TASK_STATUS_FAILED not in data[WORKFLOW].state_totals
>       assert len({t.is_held for t in data[TASK_PROXIES].values()}) == 2
E       assert 0 == 2
E         +0
E         -2

The test file:

tests/integration/test_data_store_mgr.py
@oliver-sanders oliver-sanders added this to the cylc-8.0.0 milestone Apr 15, 2021
@dwsutherland dwsutherland self-assigned this Apr 19, 2021
@dwsutherland
Copy link
Member

Have run it twenty times locally and no failure... I will see if there's a more robust way to test the update.

@dwsutherland
Copy link
Member

dwsutherland commented Apr 21, 2021

I'd have to understand how the test harness works at a deeper level to know why it's inconsistent..

i.e. why does this:

@pytest.mark.asyncio
@pytest.fixture(scope='module')
async def harness(mod_flow, mod_scheduler, mod_run):
    flow_def = {
        'scheduler': {
            'allow implicit tasks': True
        },
        'scheduling': {
            'graph': {
                'R1': 'foo => bar'
            }
        }
    }
    reg: str = mod_flow(flow_def)
    schd: 'Scheduler' = mod_scheduler(reg)
    async with mod_run(schd):
        schd.pool.hold_tasks('*')
        schd.resume_workflow()
        # Think this is needed to save the data state at first start (?)
        # Fails without it.. and a test needs to overwrite schd data with this.
        data = schd.data_store_mgr.data[schd.data_store_mgr.workflow_id]
        yield schd, data

Sometimes yield the scheduler with no task proxies? Did the tasks not exist when the hold command was issued?

Hard to tell unless I can reproduce the issue somehow.. (the test in question still worked when I commented out the hold)

@dwsutherland
Copy link
Member

dwsutherland commented Apr 21, 2021

I think the same harness yield gets shared amongst the tests..
Whenever I changes the order of the tests, it appears to effect what data the next tests are working with.

So perhaps create another harness object (if possible), and see if it occurs again ... Again, easier if I could reproduce..

@dwsutherland
Copy link
Member

dwsutherland commented Apr 21, 2021

On a side note, this also works:

(flow) sutherlander@cortex-vbox:cylc-flow$ git diff
diff --git a/tests/integration/test_data_store_mgr.py b/tests/integration/test_data_store_mgr.py
index 2d6d584cc..533c30a60 100644
--- a/tests/integration/test_data_store_mgr.py
+++ b/tests/integration/test_data_store_mgr.py
@@ -259,8 +259,8 @@ def test_update_data_structure(harness):
     schd.data_store_mgr.data[w_id] = data
     assert TASK_STATUS_FAILED not in set(collect_states(data, TASK_PROXIES))
     assert TASK_STATUS_FAILED not in set(collect_states(data, FAMILY_PROXIES))
-    assert TASK_STATUS_FAILED not in data[WORKFLOW].state_totals
-    assert len({t.is_held for t in data[TASK_PROXIES].values()}) == 2
+    assert data[WORKFLOW].state_totals[TASK_STATUS_FAILED] == 0
+    assert sum(data[WORKFLOW].state_totals.values()) == 2
     for itask in schd.pool.get_all_tasks():
         itask.state.reset(TASK_STATUS_FAILED)
         schd.data_store_mgr.delta_task_state(itask)

So I don't think the key lookup is working..
I did this because the manager changed to reset the state totals for families being updated:

            # Use all states to clean up pruned counts
            for state in TASK_STATUSES_ORDERED:
                fp_delta.state_totals[state] = state_counter.get(state, 0)
            fp_updated.setdefault(fp_id, PbFamilyProxy()).MergeFrom(fp_delta)

@oliver-sanders
Copy link
Member Author

Sometimes yield the scheduler with no task proxies?

Ah, perhaps we have to explicitly get the pool to release tasks, otherwise there is a race condition between the async with and the async def main_loop?

@dwsutherland
Copy link
Member

I assume this is still an issue?

@oliver-sanders
Copy link
Member Author

I still run into this one, it seems to happen more often with a higher level of parallelism in the tests. Still have no idea why it's happening.

@oliver-sanders
Copy link
Member Author

Was playing with a small change to the integration tests which caused these tests to break badly so took a look in.

I think the problem is:

  1. The main loop is running whilst the tests are running which means the scheduler can do things in-between tests, or not, depending on how asyncio manages tasks which means the data store may be updated, or not between tests. This would explain why I see these failures a lot, whereas we don't hit them much on GHA (because I parallelise the hell out of the tests!).
  2. The tests all share the same workflow, but the tests also modify the workflow and don't undo their changes which means the tests can interact.

I've addressed (1) on my branch by adding a new fixture that lets us start scheduler without running the main loop to give these tests a fighting chance.

I've fixed the resulting failures bar one, which I don't understand in test_update_data_structure:

# Shows pruning worked
assert len({t.is_held for t in data[TASK_PROXIES].values()}) == 1

The test resets a task state to failed, then expects the number of is_held tasks to change as a result? Commented out for now.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jan 31, 2022

Done (1) in #4620, may have helped, however, I still get flaky failures ~50% of the time for:

  • test_delta_task_held
  • test_update_data_structure

With high parallelism.

@dwsutherland
Copy link
Member

Well, not sure what to do about (2) other than have tests operate on different workflows, as it's not something that would happen in the wild..

Also I'm not sure the tests could "undo" their changes,
a) wouldn't parallelism still cause issues while undoing?
b) the cheapest way to undo would be to reload/reinitialize the data-store, assuming other tests aren't running at the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants