Randomised processing of objects #10

FrederikSchaff · 2018-10-02T13:37:33Z

In cases where current information is relevant for the process, the fixed-order updating in LSD is problematic. Currently this is circumvented by providing an explicit cycle over objects and doing this in a random fashion. It would be good if instead one could flag objects as "randomised" (or this was the default). Also, a random cycle macro would be useful.

One way to implement this could be to change the current updating scheme as follows: First, create a quee of all the variables that shall be updated. Next, randomise it. Then process it as usual. Problems like copying with deleted objects have to be taken care of. This could be done by combining it with the delete flag suggested in the issue #9 "Delete Self"

MCP1 · 2018-10-02T20:05:05Z

I understand your point. It can be useful in certain situations to be able to randomize the order of evaluation of variables, when more than one sequence is available.
But there is a big if to be answered before: are the possible alternative computation sequences equivalent (in terms of representing the same model)? The general answer is no. Quite frequently, even a deterministic model (no stochastic components) produce different results if computed in using different valid sequences. Currently, LSD pick the first "route" it finds and keep it. Is this the "best" one? We don't know, but there is a simple strategy in place (see below) and one consistent result. Which are really the alternative routes? It depends heavily in the model structure itself.
How could you reproduce your results or perform a Monte Carlo experiment in this situation? Imagine tracking a bug when every execution can take a different route? Or trying to understand a complex emerging result?
Last, but pragmatically most important, this would require a complete redesign of the LSD scheduler. Currently, we use a recursive scheduler: we pick the first variable at the topmost object and compute it entirely, including any dependence it may have. Next, the second topmost variable is computed, if not already evaluated when the first one was calculated. And so on until the last variable in the last object at the lowest level is updated, directly (in the top-down order) or not (previously computed by an upstream update). In a typical model, more than 80% of the variables are updated by recursion, on-the-fly, and such order cannot be controlled programatically. And the other 20% are more likely statistics-gathering variables, to which the computation order is not relevant anyway (but cannot be computed before its dependencies).
For all those reasons, I guess such a change may be of limited value despite incurring in significant coding and testing demands. On the other hand, designing the relevant (as per the model logic) multi-instance variables to be computed in random order is relatively simple to implement. If you have a particular situation in mind, let us know so we can suggest an approach. Most of the time, randomly sorting the desired object chain every time step would do.
As a side note, maybe just allowing the parallel computation of the desired multi-instance variable will produce the desired results in some situations. When computing these variables in parallel there is not an strict similar updating sequence every time step.

FrederikSchaff · 2018-10-03T09:41:01Z

I will try to respond to all issues raised.

MC experiments: This is only problematic if you want to run models that were build without randomisation of processing or whenever we change the way random numbers are used in these processes. The deterministic structure itself is not changed.
Recursive/Endogenous Updating: I also think this is a big strength of LSD and I would not change this. I would only change the default order of updating (outside of recursions).
SA argument: As you rightly point out, even deterministic models may depend on the updating scheme, i.e. the order in which actions happen. This is exactly the point. It is a theoretical aspect of the model. Obviously there needs to be a default, but I would argue it should not be a constant default but a randomised default, depending on the seed. Otherwise, if a user implements a model that depends on the sequence of computations of objects with the same label and does not manually control it the results are not robust at all (for whatever reason, perhaps because the user simply doesn't recognise the dependency).
Manual Control: One can control this directly but it is cumbersome. You need to define a master-equation at the top level and then call the lower-level variables in a random fashion in a loop.
Debugging/Validation: This could be helped with a simple "switch" that enables or disables the randomisation of the updating scheme.
Parallelisation: I think objects / variables for which the sequence is critical should NEVER be parallelised. It implies a loss of control and also deterministic reproduction becomes impossible, because the seed does not control the parallelisation.

The biggest point are the pragmatic issues. I think it is possible to find a solution. A first simple help would be a random-order cycle (and cycle-safe). I have implemented something in the GIS enhancement I am working on and could implement it right away. This would make manual control more easy. But if you agree, I could also think about a complete randomisation of the default updating scheme (with a switch to turn it on/off).

I am sure that if we want to get more people on using LSD, this is a crucial point.

MCP1 · 2018-10-03T19:50:05Z

Hi Frederik. Thanks for the detailed analysis.

I'm still not convinced and, anyway, I don't have the time to invest on this direction for now, as it is a significant effort. However, feel free to adapt LSD to your needs. If you come up with a solution that is end-user-ready, we'd be very happy to incorporate into mainline LSD.

Just to provide more basis to my feelings, some comments on your points (no answer is necessary).

Not many models control for that. I don't remember of any. Mine, never.
Ok, but you can't eat the cake and have it. To control order you have to care about the possible orders and the current scheduler model is no longer useful. Also, we would need algorithms to find the set of all possible orders before picking one randomly in a robust way, which I guess is not a problem we can solve in polynomial time. Randomizing over an incomplete set of possible orders looks like a bad idea from a sampling/experimental perspective..
This is the point when you are studying the particular subject of alternative processing orders, as you are. Most models simply don't want to deal with it, as the complexity of controlling this kind of stuff grows exponentially with the model size. The current recursive scheduler only provide a "good-enough" evaluation sequence for most applications, as practice shows, and is flexible enough for modelers to control order when required.
If it is cumbersome when you know what you want to randomize, imagine doing it for any code, even if we assume it is feasible in the first place. The approach you suggest seems limited both in terms of performance for any larger model (I'll be randomizing blindly over things nobody wants randomized) and sample representativeness (as it does not exhaust all the possible orders, why are the ones proposed the good/representative ones?) Seems to me more efficient to take care of order when it is relevant and custom design a solution for the problem a hand.
True, but as soon as you turn the switch on, no way to find the remaining bugs not exercised by the default order. Bugs appearing only in particular the execution orders of equations are quite common.
First, you control when parallelize or not, at at the object level. Second, knowing ex ante which sequences are critical in practice is very unlikely before facing real problems for larger-size models (and not a top question for most modelers). Third, you don't parallelize things that can't be parallelized, as LSD tells you (or crash...). Fourth, the seed and the random number generators operate exactly the same under parallelization, you simply don't control the order object instances are updated (as you are proposing to do).

Integrate fixes

FrederikSchaff added the enhancement label Oct 2, 2018

FrederikSchaff self-assigned this Oct 2, 2018

MCP1 added the help wanted label Oct 2, 2018

FrederikSchaff added a commit that referenced this issue Apr 3, 2021

Merge pull request #10 from FrederikSchaff/GIS-beta

e22fd53

Integrate fixes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Randomised processing of objects #10

Randomised processing of objects #10

FrederikSchaff commented Oct 2, 2018

MCP1 commented Oct 2, 2018

FrederikSchaff commented Oct 3, 2018

MCP1 commented Oct 3, 2018

Randomised processing of objects #10

Randomised processing of objects #10

Comments

FrederikSchaff commented Oct 2, 2018

MCP1 commented Oct 2, 2018

FrederikSchaff commented Oct 3, 2018

MCP1 commented Oct 3, 2018