Add bootstrap state machine #90

iand · 2023-08-11T09:00:12Z

This adds a state machine for running the bootstrap query described in #45. The state machine is very simple (it runs a query that attempts to find the self node), but it drives the design of the coordinator in a number of ways:

The coordinator now manages two state machines: bootstrap and user queries. It enforces the constraint that no user queries can be progressed while the bootstrap is running. This establishes the pattern for managing a set of state machines.
Priority is simple: the coordinator first attempts to advance the bootstrap state machine and only if it is idle, indicating the bootstrap has no further work, will it proceed to advance the query pool state machine.
This changes the design of the state machine. Previously the state machine reacted to an incoming event passed to the Advance method. However this causes complications in the presence of multiple state machines. What should happen if the bootstrap is waiting for a message response but the caller attempts to start a user query? The coordinator needs to remember the "add query" event until the bootstrap is complete, so that event would need to remain on the coordinator's queue. But the coordinator needs to read from the queue to detect if an incoming message response is available for the bootstrap, without removing the "add query" event. Thus we need separate queues for the two state machines. Rather than manage those in the coordinator, we give each state machine its own event queue. External callers enqueue events and the state machine dequeues the next one each time it attempts to advance state.
~~The above change leads to a simple interface for state machines: an Enqueue method for notifying a new event and an Advance method that returns the next state.~~
update 2023-08-15: The above change leads to a simple interface for state machines:an Advance method that accepts an event and returns the next state.
Coordinator methods like StartQuery and StopQuery now enqueue an event for query pool
A new Bootstrap method enqueues an event for bootstrap state machine
update 2023-08-15: the queues for the state machines are managed by the coordinator, which allows state machines to be more cleanly composed into hierarchies (for example, the state machine managing the routing table include queue will use a query pool state machine and this change eliminates the need to manage event queues of child state machines)

There are still some ugly parts which I may be able to fix within the scope of this PR:

the coordinator implements a number of unused methods to conform to the scheduler.Scheduler interface. All that is needed is the RunOne method.
~~the name of the bootstrap query needs to be factored into a constant or remembered by the coordinator~~ - coordinator now uses a separate callback to deal with bootstrap query instead of checking query id
events are action.Action interfaces so they can use the standard queue interface. The Run method is unused. The queue could simply be a channel or we could modify the queue interface to be parameterised by type, allowing us to have a queue of BootstrapEvents (removed 2023-08-15)
currently the bootstrap method expects a function that generates a FindNode request for the given node. FindNode is such a fundamental DHT operation that I think it should be provided as a method by the Endpoint

Fixes #47

dennis-tra · 2023-08-14T08:53:35Z

routing/bootstrap.go

+
+	if cfg.QueueCapacity < 1 {
+		return &kaderr.ConfigurationError{
+			Component: "PoolConfig",


Suggested change

Component: "PoolConfig",

Component: "BootstrapConfig",

dennis-tra · 2023-08-14T09:02:34Z

coord/coordinator.go

+type StateMachine[S any, E event.Action] interface {
+	// Enqueue enqueues an event to be processed by the state machine.
+	Enqueue(context.Context, E)
+	// Advance advances the state of the state machine.
+	Advance(context.Context) S
+}


I think a StateMachine interface makes sense. I also think it's great that we can constrain the type of events that we pass into the state machine with the generic type parameters!

I'm not sure about the part where we require each state to maintain its own queue of events it needs to process. I would have expected that this is the responsibility of another component. What drove the requirement to have a separate queue in each state?

Okay, I just read the PR description

dennis-tra · 2023-08-14T09:03:36Z

coord/coordinator.go

+
+	bootstrap, err := routing.NewBootstrap(self, bootstrapCfg)
+	if err != nil {
+		return nil, fmt.Errorf("query pool: %w", err)


Suggested change

return nil, fmt.Errorf("query pool: %w", err)

return nil, fmt.Errorf("bootstrap: %w", err)

dennis-tra · 2023-08-14T09:08:00Z

coord/coordinator.go

-	c.queue.Enqueue(ctx, ev)
-	// c.inboundEvents <- ev
-
+	c.pool.Enqueue(ctx, qev)


What if the queue is at its capacity? Enqueue is exported so anyone could put messages in the queue. If it was just internal we probably could make sure that this won't happen.

Besides in the coordinator, Enqueue is only called in tests, so we could make it a private method?

What if the queue is at its capacity? Enqueue is exported so anyone could put messages in the queue. If it was just internal we probably could make sure that this won't happen.

The pool is only internal - the coordinator creates it - so noone else is adding events.

Besides in the coordinator, Enqueue is only called in tests, so we could make it a private method?

Do you mean pool.Enqeue? That's from synchronous methods exposed by the coordinator, not just tests.

dennis-tra · 2023-08-14T09:09:51Z

coord/coordinator.go

+	pool StateMachine[query.PoolState, query.PoolEvent]
+
+	// bootstrap is the bootstrap state machine, responsible for bootstrapping the routing table
+	bootstrap StateMachine[routing.BootstrapState, routing.BootstrapEvent]


We could define the StateMachine type in routing, e.g.:

type BootstrapStateMachine coord.StateMachine[BootstrapState, BootstrapEvent]

(cyclic dependency as the packages are structured now though)

I prefer to define interfaces where they are used. In this case the coordinator is using the StateMachine interface.

dennis-tra · 2023-08-14T09:44:12Z

Priority is simple: the coordinator first attempts to advance the bootstrap state machine and only if it is idle, indicating the bootstrap has no further work, will it proceed to advance the query pool state machine.

I think it should be possible to already start queries as soon as we have connected to the first peer. It'll probably not be the fastest query but it'll resolve and will be faster than waiting for the whole bootstrap process to finish.

What should happen if the bootstrap is waiting for a message response but the caller attempts to start a user query?

IMO this should result in an error for the caller. So the coordinator would process that message normally, determine that it cannot start the query, and return an error to the caller.

However, I see your point and that, in the general case, it might be necessary to delay events until a certain state is reached.

I think having separate queues undermines the hierarchy of the hierarchical state machine. Sub-components can enqueue events with sub-states, by-passing the parent state machines that might want to react to or track these events as well. I think it's easier to reason about the control flow if data always flows from the root to the leaves and cannot skip any intermediate states.

iand · 2023-08-14T10:38:39Z

I think having separate queues undermines the hierarchy of the hierarchical state machine. Sub-components can enqueue events with sub-states, by-passing the parent state machines that might want to react to or track these events as well. I think it's easier to reason about the control flow if data always flows from the root to the leaves and cannot skip any intermediate states.

I don't understand this point. The coordinator is coordinating state machines and is translating asynchronous behaviour into synchronous. That async to sync interface needs to have some kind of queueing. Each hierarchical state machine is synchronous and there is no enqueing of substates, only the coordinator is feeding async events to the top of each state machine hierarchy.

iand · 2023-08-14T10:42:17Z

I think it should be possible to already start queries as soon as we have connected to the first peer. It'll probably not be the fastest query but it'll resolve and will be faster than waiting for the whole bootstrap process to finish.

I discuss this a bit in the design issue #45 and expand in this comment. If we're only connected to a bootstrap node then queries will hit that bootstrap node far more than is usual and the bootstrap node has to deal with that for every node that starts up. We should at least wait until the routing table is better populated before allowing queries (or we could throttle them based on bootstrap progress).

iand · 2023-08-14T10:53:31Z

What should happen if the bootstrap is waiting for a message response but the caller attempts to start a user query?

IMO this should result in an error for the caller. So the coordinator would process that message normally, determine that it cannot start the query, and return an error to the caller.

That only works in this specific case. A more general example is what happens if a non-bootstrap query response is received as an event while the bootstrap is waiting for its own query response? There is no place to return an error in that case.

In general the coordinator will also be advancing other state machines like the routing table probe/refresh, exploring etc. These will essentially be background tasks and should be given the chance to progress and potentially change state, which needs to be returned by the coordinator. Any incoming event needs to be queued in that case.

The rust-libp2p design handles this by calling methods to mutate state on receipt of an inbound event. So if a message response is received it calls the equivalent of onMessageReceived on pool, which looks up the query and calls onMessageReceived on the query which then calls onMessageReceived on the iterator which looks up the responding node and changes its state.

iand · 2023-08-15T09:17:23Z

@dennis-tra I moved the event queues out of the state machines and into the coordinator. After some thought I ended up agreeing with your point that this is the responsibility of the coordinator, and it allows state machines to be composed in a simple way

Based on sm-bootstrap branch (#90) This adds a state machine for running the include process described in #45. The state machine manages a queue of candidates nodes and processes them by checking whether they respond to a find node request. Candidates that respond with one or more closer nodes are considered live and added to the routing table. Nodes that do not respond or do not provide any suggested closer nodes are dropped from the queue. The number of concurrent checks in flight is configurable. Not done yet: - [ ] check timeouts - [ ] removing nodes failing checks from routing table - [ ] notifying of unroutable nodes

iand requested review from guillaumemichel and dennis-tra as code owners August 11, 2023 09:00

iand force-pushed the sm-bootstrap branch from 08a14ae to 29adb5e Compare August 11, 2023 14:22

dennis-tra reviewed Aug 14, 2023

View reviewed changes

iand mentioned this pull request Aug 16, 2023

Add include state machine #95

Merged

3 tasks

iand added 10 commits August 16, 2023 12:58

Add bootstrap state machine

1707f9f

Integrate bootstrap into coordinator

4b490a2

Remove unused test function

c827b72

Port some fixes from PR#88

101a08f

Coordinator sets bootstrap config

3f25b85

Fix up after event flattening

aa04975

Remove unused method

48c62fb

Use separate function for sending bootstrap messages

7862906

Move queues out of state machines

d8b917e

Remove routing.go, included by mistake

98506fa

iand force-pushed the sm-bootstrap branch from 52edf4e to 98506fa Compare August 16, 2023 11:58

iand merged commit accf5ea into main Aug 16, 2023

iand deleted the sm-bootstrap branch August 16, 2023 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bootstrap state machine #90

Add bootstrap state machine #90

iand commented Aug 11, 2023 •

edited

Loading

dennis-tra Aug 14, 2023

dennis-tra Aug 14, 2023

dennis-tra Aug 14, 2023 •

edited

Loading

dennis-tra Aug 14, 2023

dennis-tra Aug 14, 2023

iand Aug 14, 2023

dennis-tra Aug 14, 2023

iand Aug 14, 2023

dennis-tra commented Aug 14, 2023

iand commented Aug 14, 2023

iand commented Aug 14, 2023

iand commented Aug 14, 2023

iand commented Aug 15, 2023

	return nil, fmt.Errorf("query pool: %w", err)
	return nil, fmt.Errorf("bootstrap: %w", err)

Add bootstrap state machine #90

Add bootstrap state machine #90

Conversation

iand commented Aug 11, 2023 • edited Loading

dennis-tra Aug 14, 2023

Choose a reason for hiding this comment

dennis-tra Aug 14, 2023

Choose a reason for hiding this comment

dennis-tra Aug 14, 2023 • edited Loading

Choose a reason for hiding this comment

dennis-tra Aug 14, 2023

Choose a reason for hiding this comment

dennis-tra Aug 14, 2023

Choose a reason for hiding this comment

iand Aug 14, 2023

Choose a reason for hiding this comment

dennis-tra Aug 14, 2023

Choose a reason for hiding this comment

iand Aug 14, 2023

Choose a reason for hiding this comment

dennis-tra commented Aug 14, 2023

iand commented Aug 14, 2023

iand commented Aug 14, 2023

iand commented Aug 14, 2023

iand commented Aug 15, 2023

iand commented Aug 11, 2023 •

edited

Loading

dennis-tra Aug 14, 2023 •

edited

Loading