-
Notifications
You must be signed in to change notification settings - Fork 31
Cachepath work broken out into functional commits #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
Previously, chosen_realized_cachepath was copied into set_intercept_readlink_cachepath() chosen_realized_cachepath and chosen_parsed_cachepath were copied into set_should_intercept_cachepath() This PR removes both setter functions and makes the original pointers global.
Removes chosen_cachepath and cachepath_bitindex from spindle_launch.h Updates initialization of matching variables in ldcs_process_data. determineValidCachePaths() moved from spindle_be.cc to ldcs_audit_server_process.c to get ldcs_process_data visibility. Added #include "parseloc.h" to ldcs_audit_server_process.c to get declaration of determineValidCachePaths(). Relocated "parseloc.h" to src/util so ldcs_audit_server_process.c could find it. Trued up signedness of types caused my making "parseloc.h" more visible, e.g., cachepath_bitidx is now uint64_t everywhere.
The three-message-reply response is now a single message with two strings. The symbolic version of the cachepath is no longer communicated as it was not being used.
New name is ldcs_audit_server_md_allreduce_AND(). If we get to the point where we're using other allreduce operations we can solve the problem of duplicating the op list in md-land and cobo-land. For now, we're only using one op in md-land, so the op can go into the function name.
| char* str = getenv(envvar); | ||
| if (str == NULL && type == ENV_REQUIRED) { | ||
| err_printf("Missing required environment variable: %s\n", envvar); | ||
| exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this commit intended to be part of the cachepath work? It looks more to be part of the network reliability work.
And if there's a reason to do this in the cachepath PR, we would need to take a more robust pass to make sure the various error returns don't just lead to crashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see it's in response to my above comment.
I'd more meant that we shouldn't propogate, not that clean them all up. We can move forward with this in the PR, but we'll want to come back and work through some of these new error return paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
mplegendre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the CI is failing on this PR due to unset TMPDIR changes. Resolve that and the deletion in test_driver.c and I think we'll be ready to merge.
| return strdup(last_slash); | ||
| } | ||
|
|
||
| static int checkLinkForLeak(const char *path, const char *spindle_loc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be deleted. It's checking that spindle's readlink interception is working and that we don't have symlinks from /proc/PID/... pointing into the spindle cache location that can be seen by the application (which has happened before).
We may need another way to get the cache path into the test, but we should keep the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache path is available to the test, but the fifos are no longer in that path, so those files leak. If I restore that code, these are the kind of errors I get on the first of the runTests:
Error - [/p/vast1/rountree/repos/Spindle/testsuite/test_driver.c:1159] - Link at '/proc/self/fd/315' has path '/tmp/rountree/spindle.9d2b000000/spindle_comm/fifo-1379774-0', which leaks spindle path with 'spindle.9d2b000000'
I'll restore the checkLinkForLeak in the commpath PR and make sure that path gets into the TestVerifier class via another <internal> message (unless you'd rather I merge the commpath PR into this PR). Just to be clear, the above file should NOT trigger an error, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that was a good solution to a problem we don't have yet.
Here's the actual problem:
Spindle/testsuite/test_driver.c
Line 1241 in d8ac4c5
| spindle_loc = getCacheLocation("LDCS_LOCATION"); |
Should be a bit simpler to solve.
Fixes #61