Replace Redis locking with a query state machine #434

greenape · 2019-02-28T10:06:47Z

Closes #283

I have:

Formatted any Python files with black
Brought the branch up to date with master
Added any relevant Github labels
Added tests for any new additions
Added or updated any relevant documentation
Added an Architectural Decision Record (ADR), if appropriate
Added an MPLv2 License Header if appropriate
Updated the Changelog

Description

Previously we used a quasi-reentrant lock to prevent executing a query on FlowDB if it was already in the process of being executed. In the ancient times, this was implemented as a thread based rlock, but in living memory this has used the redis locking algorithm.

This doesn't quite suffice, because the lock is only acquired once a thread is available to start running the query, which, combined with the 1:1 relationship between the number of threads and the number of database connections means this would happen fairly quickly.

In FlowMachine library terms, this as shown in #283, but could equally result in a query composing a subquery re-executing the subquery when it should actually wait for the subquery to be written to cache. In FlowMachine server terms, this means that with n_threads queries currently running, any additional query submitted to the server will appear to be awol until one currently running finishes.

This replaces the lock with a state machine, which actually runs atomically in redis. Queries can be in one of several states, some of which indicate that you can safely get their SQL, and some of which indicate that you should wait to do so because it is likely to change.

#283 is hence mitigated, since nothing ever has to wait for a lock and hence marking a query as running can safely be done in the main thread. It also has some additional benefits, in that one can now get a more accurate view of a query's current state. This includes whether it is actively running, or just expected to at some point, whether it is being wiped from cache right now, if it failed to run successfully (even if somebody else last tried to run it), and so on.

(Looking forward to some debate on this one ;))

…ome duped fixtures

…nherit from Query

…pe them

Co-Authored-By: maxalbert <maxalbert@users.noreply.github.com>

…executed

codecov · 2019-03-12T14:30:23Z

Codecov Report

Merging #434 into master will increase coverage by <.01%.
The diff coverage is 91.17%.

@@            Coverage Diff             @@
##           master     #434      +/-   ##
==========================================
+ Coverage   91.33%   91.34%   +<.01%     
==========================================
  Files          96       97       +1     
  Lines        5459     5638     +179     
  Branches      641      663      +22     
==========================================
+ Hits         4986     5150     +164     
- Misses        344      359      +15     
  Partials      129      129

Impacted Files	Coverage Δ
flowapi/flowapi/geography.py	`100% <100%> (ø)`	⬆️
flowmachine/flowmachine/core/table.py	`90.36% <100%> (+0.61%)`	⬆️
flowmachine/flowmachine/core/model_result.py	`81.81% <100%> (+4.43%)`	⬆️
flowmachine/flowmachine/core/server/server.py	`82.92% <100%> (+0.13%)`	⬆️
...ne/flowmachine/features/spatial/distance_matrix.py	`89.47% <100%> (+7.65%)`	⬆️
flowapi/flowapi/query_endpoints.py	`97.91% <100%> (+0.09%)`	⬆️
flowmachine/flowmachine/core/query_state.py	`100% <100%> (ø)`
flowmachine/flowmachine/utils.py	`50% <28.57%> (-13.05%)`	⬇️
flowclient/flowclient/client.py	`88.06% <77.77%> (-0.69%)`	⬇️
flowmachine/flowmachine/core/server/query_proxy.py	`82.74% <82.6%> (+0.55%)`	⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4f3787...74f5c20. Read the comment docs.

…ing of the strings) with dummy_redis.set()

maxalbert

Great work! 👍 I'm happy to merge this now. This will make a lot of things easier to implementa and reason about. 🎉

When looking at some of the tests (especially in test_query_state.py, but also elsewhere) I was sometimes confused about why they actually test what they purport to test. The reason for this confusion is that they make certain assumptions about how functionality is implemented internally in flowmachine which isn't visible in the actual test.

A simple example is that QueryStateMachine.wait_until_complete() uses the _sleep function internally (so that monkeypatching it allows us to test certain conditions), but there is nowhere this is visible in the test. There are a bunch more subtle and difficult-to-reason-about assumptions (e.g. how Query uses QueryStateMachine internally and how certain methods call each other under certain conditions). I'd be happier if there was a way to make these assumptions really obvious in the tests because it makes them somewhat brittle, but I can't see an easy way without reworking a bunch of other code and it's only tangentially related to this PR so let's just keep a mental note of it and aim to fix this in the future. But it's clear that the tests expose certain pains in the code base which often make it necessary to use monkeypatching or lengthy test setup code which ideally we shouldn't need. Anyway, not immediately related to this PR, just an observation from reviewing it.

maxalbert · 2019-03-05T15:50:43Z

flowmachine/tests/server/test_query_proxy.py

-    mock_func_has_lock.return_value = True
-    assert "running" == query_proxy.poll()
+    qsm = QueryStateMachine(dummy_redis, query_id=q.md5)
+    dummy_redis._store[qsm.state_machine._name] = QueryState.EXECUTING.value.encode()


Suggested change

dummy_redis._store[qsm.state_machine._name] = QueryState.EXECUTING.value.encode()

dummy_redis.set(qsm.state_machine._name, QueryState.EXECUTING.value)

Why are we using the internal property ._name here and in the other tests? This doesn't seem to be used anywhere in the actual flowmachine code, so I'm not sure what the purpose in the tests is. If we do want to use it, can we wrap it up in a non-private attribute on QueryStateMachine itself and give it a meaningful name to reveal its purpose?

maxalbert · 2019-03-05T15:52:34Z

flowmachine/tests/server/test_query_proxy.py


-    mock_func_has_lock.return_value = False
+    dummy_redis._store[qsm.state_machine._name] = QueryState.COMPLETED.value.encode()


Suggested change

dummy_redis._store[qsm.state_machine._name] = QueryState.COMPLETED.value.encode()

dummy_redis.set(qsm.state_machine._name, QueryState.COMPLETED.value)

maxalbert · 2019-03-05T15:53:10Z

flowmachine/tests/server/test_query_proxy.py

-    assert "running" == query_proxy.poll()
+    qsm = QueryStateMachine(dummy_redis, query_id=q.md5)
+    dummy_redis._store[qsm.state_machine._name] = QueryState.EXECUTING.value.encode()
+    assert QueryState.EXECUTING.value == query_proxy.poll()


Suggested change

assert QueryState.EXECUTING.value == query_proxy.poll()

assert QueryState.EXECUTING == query_proxy.poll()

maxalbert · 2019-03-05T16:01:39Z

flowmachine/tests/server/test_query_proxy.py

+    )
+    # Set query state
+    qsm = QueryStateMachine(dummy_redis, query_id=q.md5)
+    dummy_redis._store[qsm.state_machine._name] = current_state.encode()


Suggested change

dummy_redis._store[qsm.state_machine._name] = current_state.encode()

dummy_redis.set(qsm.state_machine._name, current_state)

maxalbert · 2019-03-05T16:03:33Z

flowmachine/tests/test_query_state.py

+    "blocking_state", [QueryState.EXECUTING, QueryState.RESETTING, QueryState.QUEUED]
+)
+def test_blocks(blocking_state, monkeypatch, dummy_redis):
+    """Test that states which alter the executing state of the query block."""


Incomplete sentence in docstring?

Whoops sorry, not incomplete but I didn't understand the grammatical structure. 😂

maxalbert · 2019-03-15T14:54:09Z

flowmachine/tests/test_query_state.py

+    """Test that resetting a query's cache will error if in a state where that isn't possible."""
+    q = DummyQuery(1, sleep_time=5)
+    qsm = QueryStateMachine(q.redis, q.md5)
+    # Mark the query as in the process of resetting


According to this comment, should there be a qsm.reset() after `qsm.execute() below? The test passes either way, just wondering what the intent is...

maxalbert · 2019-03-15T14:56:11Z

flowmachine/flowmachine/core/cache.py

+    redis: StrictRedis,
+    query: "Query",
+    connection: "Connection",
+    ddl_ops_func: Callable[[str, str], List[str]],


Why are ddl_ops_func and write_func arguments that need to be passed in? Why shouldn't their functionality just be part of the internal responsibility of write_query_to_cache?

maxalbert · 2019-03-15T15:01:40Z

flowmachine/flowmachine/core/cache.py

+                q_state_machine.raise_error()
+                logger.error(f"Error executing SQL. Error was {e}")
+                raise e
+            if schema == "cache":


Is there a reason to only conditionally write cache metadata? I'm a little confused about the meaning of this schema argument. I assumed its purpose was to allow a different name for the cache schema (e.g. for testing), but it looks like it's passed in from Query.to_sql() and ModelResult.to_sql() and is basically the schema that the queries are stored in? I trust that it does the right thing but I'm not sure I fully follow the logic.

maxalbert · 2019-03-15T15:03:18Z

flowmachine/flowmachine/core/server/query_proxy.py

+                f"Got a bad state for '{query_id}'. Original exception was {e}"
+            )
+
+        if query_state == QueryState.EXECUTING:


Just as a thought, I wonder whether it would be useful to have the error messages live in QueryState directly and simply propagate them here? Would allow us to get rid of the if/elif chain. But haven't looked into it deeply, it's possble this wouldn't work out.

maxalbert · 2019-03-15T15:04:53Z

flowmachine/tests/server/test_query_proxy.py

+@pytest.mark.parametrize(
+    "current_state, expected_error",
+    [
+        (QueryState.KNOWN, MissingQueryError),


Similar to my comment above, I'm wondering whether it's worth always raising the same error. I'll keep that in mind for my re-working / removal of the QueryProxy class.

maxalbert · 2019-03-15T15:24:14Z

@greenape FYI, I have added a couple more commits on top of yours, but literally only cosmetics.

…se the logic seems to be subtly different and makes the test 'test_drop_query_blocks' hang. This reverts commit d5bf413.

greenape and others added 30 commits February 25, 2019 12:12

Adding in more column names props

17801e1

Make column_names mandatory

1dd9387

Add most frequent location column_names attrib

e16a806

Add column_names attrib to SpatialAggregate and Displacement

d431d68

Add most missing column_name props

51fd5a7

Add column_names for locationarea

25fbca8

Enforce column names for get_dataframe & head, fix geojson

c703cbd

Fix a handful of glitches

4b73235

Update changelog, fix few more glitches revealed on CI

c668687

Enforce column_names in get_dataframe fixture and sql write, remove s…

68e6373

…ome duped fixtures

Fix AggregateTotalNetworkObjects

48e23bb

Fix join

09723e9

Add col names list, missing Returns sections for feature_collection

bfbf66b

Should ref to self.table

0bbc4e1

s/method/function

7023745

Turn BaseLocation & MultiLocation into mixins instead of ABCs which i…

9517dbd

…nherit from Query

Update query hashes & include column names in custom query, also dedu…

9a1ab32

…pe them

Fix another couple of customquery usages

6ae4c7e

Remove pointless try except

3070254

Co-Authored-By: maxalbert <maxalbert@users.noreply.github.com>

Missed a newline

710b359

Co-Authored-By: maxalbert <maxalbert@users.noreply.github.com>

Add a state machine for queries to replace redis lock

bdd876c

Merge branch 'column-names' into query-state-machine

1c59275

Update blocking tests, special handling for Table which is instantly …

1ea8c3e

…executed

Add state machine to model results as well

b96c557

Add dosctrings, fix force overwrite test to reflect this being removed

7088a7a

Add status reporting to server, api and client

fe7d263

Update integration tests lockfile

a17d31b

Remove redis lock import

67ee213

Relock docs, and change integration tests helper to reflect new status

8edd5a9

Fix a couple of statuses which are now wrong

8b7eebb

greenape and others added 14 commits March 6, 2019 10:31

Merge branch 'master' into query-state-machine

b7b4590

Remove notify

00d20bd

Merge branch 'master' into query-state-machine

41ee435

Merge branch 'master' into query-state-machine

3766a67

Merge branch 'master' into query-state-machine

024cb47

Merge branch 'master' into query-state-machine

02eca79

Merge branch 'master' into query-state-machine

1806981

Merge branch 'master' into query-state-machine

140552f

Unrelated fix for some docs bits that I missed in #456

0f0abcf

Exclude setup.py files

3c3ff07

Merge branch 'master' into query-state-machine

8585c3c

Add a test for _make_sql's behaviour with already stored queries

b1a21a2

Merge branch 'master' into query-state-machine

4ce3561

Add an exception class referred to but which didn't exist

5594595

Maximilian Albert added 8 commits March 14, 2019 18:22

Merge remote-tracking branch 'origin/master' into query-state-machine

40521f6

Tweak docstring

b968dfb

Use wait_until_complete() instead of manually sleeping

d5bf413

Extend docstring

74cda57

Replace explicit assignment of dummy_redis._store (and explicit encod…

793de72

…ing of the strings) with dummy_redis.set()

Rename arg for clarity and better readability of the tests

5670beb

Remove unused monkeypatch

266c179

Merge remote-tracking branch 'origin/master' into query-state-machine

4c02c40

maxalbert approved these changes Mar 15, 2019

View reviewed changes

maxalbert mentioned this pull request Mar 15, 2019

Document QueryStateMachine in developer docs #484

Open

maxalbert added the ready-to-merge Label indicating a PR is OK to automerge label Mar 15, 2019

Revert "Use wait_until_complete() instead of manually sleeping" becau…

74f5c20

…se the logic seems to be subtly different and makes the test 'test_drop_query_blocks' hang. This reverts commit d5bf413.

mergify bot merged commit f413d1c into master Mar 15, 2019

mergify bot deleted the query-state-machine branch March 15, 2019 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Redis locking with a query state machine #434

Replace Redis locking with a query state machine #434

greenape commented Feb 28, 2019

codecov bot commented Mar 12, 2019 •

edited

Loading

maxalbert left a comment

maxalbert Mar 5, 2019

maxalbert Mar 5, 2019

maxalbert Mar 5, 2019

maxalbert Mar 5, 2019

maxalbert Mar 5, 2019

maxalbert Mar 5, 2019

maxalbert Mar 5, 2019

maxalbert Mar 15, 2019

maxalbert Mar 15, 2019

maxalbert Mar 15, 2019

maxalbert Mar 15, 2019

maxalbert Mar 15, 2019

maxalbert commented Mar 15, 2019

	dummy_redis._store[qsm.state_machine._name] = QueryState.EXECUTING.value.encode()
	dummy_redis.set(qsm.state_machine._name, QueryState.EXECUTING.value)


		mock_func_has_lock.return_value = False
		dummy_redis._store[qsm.state_machine._name] = QueryState.COMPLETED.value.encode()

	assert QueryState.EXECUTING.value == query_proxy.poll()
	assert QueryState.EXECUTING == query_proxy.poll()

Replace Redis locking with a query state machine #434

Replace Redis locking with a query state machine #434

Conversation

greenape commented Feb 28, 2019

I have:

Description

codecov bot commented Mar 12, 2019 • edited Loading

Codecov Report

maxalbert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxalbert commented Mar 15, 2019

codecov bot commented Mar 12, 2019 •

edited

Loading