-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating the event manager for MPI communication #136
Conversation
@cssherman , is this ready for a review? |
I still need to check on a few things with sub-events, I'll let you know when I think its ready for review |
@@ -240,6 +240,13 @@ void EventBase::CheckEvents(real64 const time, | |||
{ | |||
subEvent->CheckEvents(time, dt, cycle, domain); | |||
}); | |||
|
|||
// Synchronize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to occur here (instead of in event manager) because we need to know the forecast for all events and their children (to allow nesting)
@@ -377,4 +386,15 @@ void EventBase::Cleanup(real64 const& time_n, | |||
|
|||
|
|||
|
|||
integer EventBase::GetExitFlag() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were not propagating any exit flags upwards to the main run loop.
@artv3 I think this is ready for you to look at. Also, it looks like the code is only working with 1-2 partitions along each dimension? |
@cssherman did you merge with a different branch? There seems to be a lot of commits unrelated to the event manager. |
|
||
#if USE_MPI | ||
real64 dt_global; | ||
MPI_Allreduce(&dt, &dt_global, 1, MPI_DOUBLE, MPI_MIN, MPI_COMM_WORLD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can replace the two MPI_Allreduce calls with a single MPI_Gather for a buffer containing both variables, local reduction of each component that was gathered...then a single MPI_Bcast for a buffer containing both variables...then you can overwrite the local variables with the values in the buffer.
…for testing time-stepping behavior
} | ||
}); | ||
|
||
time += dt; | ||
++cycle; | ||
dt = nextDt; | ||
dt = (time + dt > maxTime) ? (maxTime - time) : dt; | ||
|
||
#if USE_MPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be more efficient than the two allreduce calls for the dt and exitFlag
@@ -86,6 +86,14 @@ void HaltEvent::EstimateEventTiming(real64 const time, | |||
m_realDt = currentTime - m_lastTime; | |||
m_lastTime = currentTime; | |||
integer forecast = static_cast<integer>((maxRuntime - (currentTime - m_startTime)) / m_realDt); | |||
|
|||
// The timing for the ranks may differ slightly, so synchronize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Allreduce can happen here, or we could store a common time-stamp for each of the ranks somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will these be different? It seems that currentTime and m_lastTime will always be in sync?? You could sync up m_startTime in the constructor when it is set....wait what is going on here? Why do you care about the real world runtime? Is this for a maximum wall time?
|
||
// Because the function applied to an object may differ by rank, synchronize | ||
// (Note: this shouldn't occur very often, since it is only called if the base forecast <= 0) | ||
#if USE_MPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allreduce should be fairly uncommon, since it will only be applied when the following apply:
- The base forecast is 0
- The forecast is being modified via a function/threshold being applied to an object (not a simple time-series)
@@ -0,0 +1,107 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solver is meant to toy around with random time-step requests. I've been using it to make sure that the event manager, mpi communications are handling the time-stepping correctly.
#if USE_MPI | ||
MPI_Comm_rank(MPI_COMM_WORLD, &rank); | ||
MPI_Comm_size(MPI_COMM_WORLD, &comm_size); | ||
send_buffer = static_cast<real64 *>(malloc(2 * sizeof(real64))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not deallocated anywhere that I can see??
It is better just to do something like
real64 send_buffer[2];
array1d< real64 > receive_buffer( 2 * comm_size )
{ | ||
for (integer ii=0; ii<comm_size; ii++) | ||
{ | ||
send_buffer[0] = std::min(send_buffer[0], receive_buffer[2*ii]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use dt instead of send_buffer[0] as the argument?
@@ -86,6 +86,14 @@ void HaltEvent::EstimateEventTiming(real64 const time, | |||
m_realDt = currentTime - m_lastTime; | |||
m_lastTime = currentTime; | |||
integer forecast = static_cast<integer>((maxRuntime - (currentTime - m_startTime)) / m_realDt); | |||
|
|||
// The timing for the ranks may differ slightly, so synchronize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will these be different? It seems that currentTime and m_lastTime will always be in sync?? You could sync up m_startTime in the constructor when it is set....wait what is going on here? Why do you care about the real world runtime? Is this for a maximum wall time?
@@ -86,6 +86,14 @@ void HaltEvent::EstimateEventTiming(real64 const time, | |||
m_realDt = currentTime - m_lastTime; | |||
m_lastTime = currentTime; | |||
integer forecast = static_cast<integer>((maxRuntime - (currentTime - m_startTime)) / m_realDt); | |||
|
|||
// The timing for the ranks may differ slightly, so synchronize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rrsettgast - This event type is meant to handle max wall time. If the forecast for this type of event is <=0, then it will set the exitFlag. Because the currentTime and m_lastTime on each rank will be slightly different, we need to do this check to make sure that all of the ranks will agree.
If you would rather restrict this event from having a executable target, we could skip the allreduce here because the global exitFlag is synced in the main run loop.
No description provided.