Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for OperationCode.StopAllExecution #1115

Merged
merged 7 commits into from
Oct 2, 2023

Conversation

erikvansebille
Copy link
Member

This PR fixes a bug where the OperationCode to StopExecution was not propagated to pset.execute(), so that a run would not finish as soon as the signal was triggered.

@erikvansebille erikvansebille requested a review from CKehl January 4, 2022 07:57
Copy link
Contributor

@CKehl CKehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes little sense to me, to be honest.

  1. Neither the [Base]Kernel[SOA|AOS].execute(...) function nor the [Base]ParticleSet[SOA|AOS] execute function is expected to return any value - why now ?
  2. The error code is emitted by the execute_jit(...) or execute_python(...) function in order to be evaluated directly in the main execute(...) function of the Kernel - it needs to be treated there, not some place higher up.
  3. This PR propagates the error up to ParticleSet.execute(...), and then the StopExecution error code is the only one apparently treated in the BaseParticleSet - it looks inconsistent. Wouldn't it be also an option to just set the time attribute then to the simulation endtime (or: return that value) ? That approach would be more consistent with the current layout of the main while-loop in the BaseParticleSet.execute(...) function.
  4. The code presented here only works for SoA, but not for AoS, because it's execute(...) function looks different - please correct that byt at least implementing the same things for the AoS ParticleSet and Kernel.
  5. The change is not tested in MPI and I doubt it works in MPI, because only the worker where the Particle stops leaves the execute function, while the other workers will happily continue their work. If one wants that to work, one needs to first gather all res values in the BaseParticleSet.execute(...) with root=0 and then broadcast that gathered (maximal) value to all workers. PS: that will slow down execution because that 'gather-and-broadcast'-point acts as (frequently occurring) synchronization barrier.

Please reconsider the approach here to make it work for all currently supported execution settings.

@erikvansebille
Copy link
Member Author

Thanks for the feedback, @CKehl. I agree with your point 5 that this solution might not work well with MPI.

To respond to all your points separately

  1. Well, this was my proposed solution to propagate the OperationCode all the way up to ParticleSet.execute(). We may choose to implement also for other OperationCodes, but it's not directly necessary (yet) because any function by default returns None
  2. No, it needs to be treated in pset.execute() because that is the call that users do and that should be stopped. This was exactly the bug that this PR fixes;
  3. Hmm, I don't like the idea of changing the time. What if users want to use the OperationCode.StopExecution to change something to their pset and then continue with another pset.execute(), expecting to continue from where they left off; it would be a surprise if time has then changed without them expecting it
  4. No, I changed both kernelaos.py and kernelsoa.py. No changes are requires in either particlesetaos.py or particlesetsoa.py because the pset.execute()is handled bybaseparticleset.py()` (note that it may need to be changed for Nodes, I have not looked into that yet)
  5. As I said above, this is a valid point. But what would you propose then? We now have a broken feature (calling operationCode.StopExecution does not actually top execution), so need a fix...

@CKehl
Copy link
Contributor

CKehl commented Jan 4, 2022

Thanks for your response @erikvansebille .

Thanks for the feedback, @CKehl. I agree with your point 5 that this solution might not work well with MPI.

To respond to all your points separately

1. Well, this was my proposed solution to propagate the OperationCode all the way up to ParticleSet.execute(). We may choose to implement also for other OperationCodes, but it's not directly necessary (yet) because any function by default returns None

That's an interesting point. If that is now a new intention, to return "the" status code, then obviously the default return of that function needs to be StatusCode.Success and not None, otherwise we get an ambiguity of what those statuscodes actually represent. Code design question: How would you define "the" return status code for a set of particles with multiple different status codes ? Is there a priority hierarchy embedded within the codes ? I think for now those status codes just arbitrarily map to some sequential number, but that number is not priority-ordered or representative for one status being more critical than another. That makes it undefined to decide what the 1 status code out of all possible status codes of all particles should be that is returned in the end. Guess it's up to the code designer here, but I'd certainly prefer to have a conclusive approach here rather than a quick-fix if I were the code author. This open question otherwise will return inevitably down the line. That said, I agree with the viewpoint to give the user some feedback - it's something that's certainly nice to have.

2. No, it needs to be treated in `pset.execute()` because that is the call that users do and that should be stopped. This was exactly the bug that this PR fixes;

I agree that the main loop should be stopped. I don't necessarily agree that this must be returned to the user, but as said before: nice to have

3. Hmm, I don't like the idea of changing the `time`. What if users want to use the `OperationCode.StopExecution` to change something to their `pset` and then continue with another `pset.execute()`, expecting to continue from where they left off; it would be a surprise if `time` has then changed without them expecting it

I see and agree with the response. This still can be done more conclusively. To my understanding, the (novel) definition is: if 1 particle sets status = OperationCode.StopExecution, then all further simulation is to be stopped. This can also be done internally in the kernel itself, by then setting the status of all particles to OperationCode.StopExecution. Advection itself is only executed for particles with the status StatusCode.Evaluate, so the execute_[jit|python] function would then skip further simulation steps naturally. Another week point right now: there is no ordered conclusion of the simulation. The way it is now, the function immediately returns, skipping the outfile.export(...) which comes after the while-loop in pset.execute(...). OperationCode.StopExecution is, to my understanding, a controlled simulation end, so the particles should be written and exported, right ?

4. No, I changed both `kernelaos.py` and `kernelsoa.py`. No changes are requires in either `particlesetaos.py` or `particlesetsoa.py` because the pset.execute()`is handled by`baseparticleset.py()` (note that it may need to be changed for Nodes, I have not looked into that yet)

I assumed we already changed that in order to make the InteractionKernel actually less hacky - apparently we did not until now. So, in that case, your point is correct: both structures use the BaseParticleSet.execute(...) function definition.

5. As I said above, this is a valid point. But what would you propose then? We now have a broken feature (calling `operationCode.StopExecution` does not actually stop execution), so need a fix...

Well, I gave some points to reconsider in my comments, so I propose to think about the code design a bit more, account for the comments, and test this kind of work then also actually in MPI with a proper simulation with >1024 particles to verify that all this works correctly in the currently supported features. I do think this fix will take more time than what has been spent on until now though, given the issues to consider. In short: I think it needs more than just "a fix".

Also, depending on the original definition, there is nothing broken: originally, this status is meant as indicator of a single particle, and for that, this particle is not included in the simulation any further - that one particle is "stopped". What has apparently changed with this "issue" is the definition of what that status means, and that is where the discussion now arises, I believe.

@erikvansebille
Copy link
Member Author

Good point about the ambiguity whether StopExecution is intended to stop execution of the single particle, or for all particles; I suggest that we pause this discussion and rethink what exactly is needed (and feasible to implement, also in light of @CKehl's points about MPI).
I will close this PR for now..

@erikvansebille
Copy link
Member Author

Looking at the discussion above, I still think it's useful to have an OperationCode to stop the execution of all particles; but the solution would be to add a new StopAllExecution code. I will implement that soon(ish)

@erikvansebille erikvansebille changed the title Fixing a bug in OperationCode.StopExecution Adding support for OperationCode.StopAllExecution Jun 23, 2023
@erikvansebille erikvansebille mentioned this pull request Sep 28, 2023
10 tasks
@erikvansebille erikvansebille changed the base branch from master to v3.0 October 2, 2023 09:29
@erikvansebille erikvansebille merged commit 84120d9 into v3.0 Oct 2, 2023
@erikvansebille erikvansebille deleted the bugfix_OperationCodeStopExecution branch October 2, 2023 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants