Adding support for OperationCode.StopAllExecution #1115

erikvansebille · 2022-01-04T07:53:03Z

This PR fixes a bug where the OperationCode to StopExecution was not propagated to pset.execute(), so that a run would not finish as soon as the signal was triggered.

CKehl

This change makes little sense to me, to be honest.

Neither the [Base]Kernel[SOA|AOS].execute(...) function nor the [Base]ParticleSet[SOA|AOS] execute function is expected to return any value - why now ?
The error code is emitted by the execute_jit(...) or execute_python(...) function in order to be evaluated directly in the main execute(...) function of the Kernel - it needs to be treated there, not some place higher up.
This PR propagates the error up to ParticleSet.execute(...), and then the StopExecution error code is the only one apparently treated in the BaseParticleSet - it looks inconsistent. Wouldn't it be also an option to just set the time attribute then to the simulation endtime (or: return that value) ? That approach would be more consistent with the current layout of the main while-loop in the BaseParticleSet.execute(...) function.
The code presented here only works for SoA, but not for AoS, because it's execute(...) function looks different - please correct that byt at least implementing the same things for the AoS ParticleSet and Kernel.
The change is not tested in MPI and I doubt it works in MPI, because only the worker where the Particle stops leaves the execute function, while the other workers will happily continue their work. If one wants that to work, one needs to first gather all res values in the BaseParticleSet.execute(...) with root=0 and then broadcast that gathered (maximal) value to all workers. PS: that will slow down execution because that 'gather-and-broadcast'-point acts as (frequently occurring) synchronization barrier.

Please reconsider the approach here to make it work for all currently supported execution settings.

erikvansebille · 2022-01-04T14:29:54Z

Thanks for the feedback, @CKehl. I agree with your point 5 that this solution might not work well with MPI.

To respond to all your points separately

Well, this was my proposed solution to propagate the OperationCode all the way up to ParticleSet.execute(). We may choose to implement also for other OperationCodes, but it's not directly necessary (yet) because any function by default returns None
No, it needs to be treated in pset.execute() because that is the call that users do and that should be stopped. This was exactly the bug that this PR fixes;
Hmm, I don't like the idea of changing the time. What if users want to use the OperationCode.StopExecution to change something to their pset and then continue with another pset.execute(), expecting to continue from where they left off; it would be a surprise if time has then changed without them expecting it
No, I changed both kernelaos.py and kernelsoa.py. No changes are requires in either particlesetaos.py or particlesetsoa.py because the pset.execute()is handled bybaseparticleset.py()` (note that it may need to be changed for Nodes, I have not looked into that yet)
As I said above, this is a valid point. But what would you propose then? We now have a broken feature (calling operationCode.StopExecution does not actually top execution), so need a fix...

CKehl · 2022-01-04T15:12:22Z

Thanks for your response @erikvansebille .

Thanks for the feedback, @CKehl. I agree with your point 5 that this solution might not work well with MPI.

To respond to all your points separately
1. Well, this was my proposed solution to propagate the OperationCode all the way up to ParticleSet.execute(). We may choose to implement also for other OperationCodes, but it's not directly necessary (yet) because any function by default returns None

That's an interesting point. If that is now a new intention, to return "the" status code, then obviously the default return of that function needs to be StatusCode.Success and not None, otherwise we get an ambiguity of what those statuscodes actually represent. Code design question: How would you define "the" return status code for a set of particles with multiple different status codes ? Is there a priority hierarchy embedded within the codes ? I think for now those status codes just arbitrarily map to some sequential number, but that number is not priority-ordered or representative for one status being more critical than another. That makes it undefined to decide what the 1 status code out of all possible status codes of all particles should be that is returned in the end. Guess it's up to the code designer here, but I'd certainly prefer to have a conclusive approach here rather than a quick-fix if I were the code author. This open question otherwise will return inevitably down the line. That said, I agree with the viewpoint to give the user some feedback - it's something that's certainly nice to have.

2. No, it needs to be treated in `pset.execute()` because that is the call that users do and that should be stopped. This was exactly the bug that this PR fixes;

I agree that the main loop should be stopped. I don't necessarily agree that this must be returned to the user, but as said before: nice to have

3. Hmm, I don't like the idea of changing the `time`. What if users want to use the `OperationCode.StopExecution` to change something to their `pset` and then continue with another `pset.execute()`, expecting to continue from where they left off; it would be a surprise if `time` has then changed without them expecting it

I see and agree with the response. This still can be done more conclusively. To my understanding, the (novel) definition is: if 1 particle sets status = OperationCode.StopExecution, then all further simulation is to be stopped. This can also be done internally in the kernel itself, by then setting the status of all particles to OperationCode.StopExecution. Advection itself is only executed for particles with the status StatusCode.Evaluate, so the execute_[jit|python] function would then skip further simulation steps naturally. Another week point right now: there is no ordered conclusion of the simulation. The way it is now, the function immediately returns, skipping the outfile.export(...) which comes after the while-loop in pset.execute(...). OperationCode.StopExecution is, to my understanding, a controlled simulation end, so the particles should be written and exported, right ?

4. No, I changed both `kernelaos.py` and `kernelsoa.py`. No changes are requires in either `particlesetaos.py` or `particlesetsoa.py` because the pset.execute()`is handled by`baseparticleset.py()` (note that it may need to be changed for Nodes, I have not looked into that yet)

I assumed we already changed that in order to make the InteractionKernel actually less hacky - apparently we did not until now. So, in that case, your point is correct: both structures use the BaseParticleSet.execute(...) function definition.

5. As I said above, this is a valid point. But what would you propose then? We now have a broken feature (calling `operationCode.StopExecution` does not actually stop execution), so need a fix...

Well, I gave some points to reconsider in my comments, so I propose to think about the code design a bit more, account for the comments, and test this kind of work then also actually in MPI with a proper simulation with >1024 particles to verify that all this works correctly in the currently supported features. I do think this fix will take more time than what has been spent on until now though, given the issues to consider. In short: I think it needs more than just "a fix".

Also, depending on the original definition, there is nothing broken: originally, this status is meant as indicator of a single particle, and for that, this particle is not included in the simulation any further - that one particle is "stopped". What has apparently changed with this "issue" is the definition of what that status means, and that is where the discussion now arises, I believe.

erikvansebille · 2022-01-14T08:05:31Z

Good point about the ambiguity whether StopExecution is intended to stop execution of the single particle, or for all particles; I suggest that we pause this discussion and rethink what exactly is needed (and feasible to implement, also in light of @CKehl's points about MPI).
I will close this PR for now..

erikvansebille · 2023-06-23T15:56:03Z

Looking at the discussion above, I still think it's useful to have an OperationCode to stop the execution of all particles; but the solution would be to add a new StopAllExecution code. I will implement that soon(ish)

for more information, see https://pre-commit.ci

Fixing a bug in OperationCode.StopExecution

d1b6c7f

erikvansebille requested a review from CKehl January 4, 2022 07:57

CKehl reviewed Jan 4, 2022

View reviewed changes

erikvansebille closed this Jan 14, 2022

Merge branch 'master' into bugfix_OperationCodeStopExecution

796c57f

erikvansebille reopened this Jun 23, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

e6e505b

for more information, see https://pre-commit.ci

erikvansebille changed the title ~~Fixing a bug in OperationCode.StopExecution~~ Adding support for OperationCode.StopAllExecution Jun 23, 2023

erikvansebille mentioned this pull request Sep 28, 2023

Towards parcels v3.0 #1426

Merged

10 tasks

erikvansebille added 2 commits October 2, 2023 10:48

Merge branch 'v3.0' into bugfix_OperationCodeStopExecution

75e8572

Updating code and adding test for StopAllExecution

8cae557

erikvansebille changed the base branch from master to v3.0 October 2, 2023 09:29

erikvansebille added 2 commits October 2, 2023 11:44

Merge branch 'v3.0' into bugfix_OperationCodeStopExecution

4967da3

remove debug message

e8af252

erikvansebille merged commit 84120d9 into v3.0 Oct 2, 2023

erikvansebille deleted the bugfix_OperationCodeStopExecution branch October 2, 2023 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for OperationCode.StopAllExecution #1115

Adding support for OperationCode.StopAllExecution #1115

erikvansebille commented Jan 4, 2022

CKehl left a comment •

edited

Loading

erikvansebille commented Jan 4, 2022

CKehl commented Jan 4, 2022 •

edited

Loading

erikvansebille commented Jan 14, 2022

erikvansebille commented Jun 23, 2023

Adding support for OperationCode.StopAllExecution #1115

Adding support for OperationCode.StopAllExecution #1115

Conversation

erikvansebille commented Jan 4, 2022

CKehl left a comment • edited Loading

Choose a reason for hiding this comment

erikvansebille commented Jan 4, 2022

CKehl commented Jan 4, 2022 • edited Loading

erikvansebille commented Jan 14, 2022

erikvansebille commented Jun 23, 2023

CKehl left a comment •

edited

Loading

CKehl commented Jan 4, 2022 •

edited

Loading