-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for OperationCode.StopAllExecution #1115
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes little sense to me, to be honest.
- Neither the [Base]Kernel[SOA|AOS].execute(...) function nor the [Base]ParticleSet[SOA|AOS] execute function is expected to return any value - why now ?
- The error code is emitted by the
execute_jit(...)
orexecute_python(...)
function in order to be evaluated directly in the mainexecute(...)
function of theKernel
- it needs to be treated there, not some place higher up. - This PR propagates the error up to
ParticleSet.execute(...)
, and then theStopExecution
error code is the only one apparently treated in theBaseParticleSet
- it looks inconsistent. Wouldn't it be also an option to just set thetime
attribute then to the simulation endtime (or: return that value) ? That approach would be more consistent with the current layout of the main while-loop in theBaseParticleSet.execute(...)
function. - The code presented here only works for SoA, but not for AoS, because it's
execute(...)
function looks different - please correct that byt at least implementing the same things for the AoS ParticleSet and Kernel. - The change is not tested in MPI and I doubt it works in MPI, because only the worker where the
Particle
stops leaves the execute function, while the other workers will happily continue their work. If one wants that to work, one needs to firstgather
allres
values in theBaseParticleSet.execute(...)
withroot=0
and thenbroadcast
that gathered (maximal) value to all workers. PS: that will slow down execution because that 'gather
-and-broadcast
'-point acts as (frequently occurring) synchronization barrier.
Please reconsider the approach here to make it work for all currently supported execution settings.
Thanks for the feedback, @CKehl. I agree with your point 5 that this solution might not work well with MPI. To respond to all your points separately
|
Thanks for your response @erikvansebille .
That's an interesting point. If that is now a new intention, to return "the" status code, then obviously the default return of that function needs to be
I agree that the main loop should be stopped. I don't necessarily agree that this must be returned to the user, but as said before: nice to have
I see and agree with the response. This still can be done more conclusively. To my understanding, the (novel) definition is: if 1 particle sets
I assumed we already changed that in order to make the
Well, I gave some points to reconsider in my comments, so I propose to think about the code design a bit more, account for the comments, and test this kind of work then also actually in MPI with a proper simulation with >1024 particles to verify that all this works correctly in the currently supported features. I do think this fix will take more time than what has been spent on until now though, given the issues to consider. In short: I think it needs more than just "a fix". Also, depending on the original definition, there is nothing broken: originally, this status is meant as indicator of a single particle, and for that, this particle is not included in the simulation any further - that one particle is "stopped". What has apparently changed with this "issue" is the definition of what that status means, and that is where the discussion now arises, I believe. |
Good point about the ambiguity whether |
Looking at the discussion above, I still think it's useful to have an |
for more information, see https://pre-commit.ci
This PR fixes a bug where the
OperationCode
toStopExecution
was not propagated topset.execute()
, so that a run would not finish as soon as the signal was triggered.