-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP - NNFDM implementation in AXL #131
base: main
Are you sure you want to change the base?
Conversation
It basically compiles and cannot be fully tested until we have an operational server.
Thanks @mcfadden8 . Nicely done. I haven't checked things under this context, but it would be good to think through whether the HPE API supports SCR's scalable restart and scavenge operations. In the case of a scalable restart, we normally try to cancel any outstanding flush. Since there is no way to cancel, I think we'd need the restarted job to be able to resume and/or wait on any outstanding flush that was started from a previous run, i.e., I don't think we'd want the restarted job to initiate a new flush of the same files that are already in progress from a flush in a prior run. For scavenge, is there a way for the job script to see the status of a flush started by the last run? If not, will there be problems if we try to copy the files again while a flush may still be ongoing? |
Hi @adammoody, for scalable restart, I agree that we would need an API to cancel any outstanding flushes. For both scalable restart and scavange, I think we will need a way to list the requests that are still in progress from any previous runs. I think that these requests may already be documented, but we should discuss to be sure I am understanding things correctly. |
@adammoody - I've integrated with the latest C++ api provided from HPE. My next step will be to add in their new API for canceling and enumeration of old jobs which should allow us to support scalable restart and scavenge. |
IF(HAVE_NNFDM) | ||
LIST(APPEND libaxl_srcs nnfdm.cpp) | ||
ENDIF(HAVE_NNFDM) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll also want this in src/dist/CMakeLists.txt
.
src/nnfdm.cpp
Outdated
do { | ||
status = nnfdm_stat(uid, max_seconds_to_wait); | ||
} while (status == AXL_STATUS_INPROG); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future work, we may want to add a sleep() in here to avoid thrashing the server will poll requests.
Nice. Thanks, @mcfadden8 |
src/axl.c
Outdated
#ifdef HAVE_NNFDM | ||
case AXL_XFER_ASYNC_NNFDM: | ||
break; | ||
#endif /* HAVE_NNFDM */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's include a comment here about why we don't have to do anything.
…fdm library currently reports conflicting error messages that will be investigated next week
It basically compiles and cannot be fully tested until we have an
operational server.