Scalability improvements and a few bug fixes #785
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The overall theme for this PR is to improve scalability by reducing the number of client requests that end up generating an RPC to a file owner. Mostly, this is done by identifying when many clients on a node are generating a request for the same information, and making sure the node's local server only sends a single remote request to get the information from the owner. Similarly, when making updates to a file (e.g., new extents), this adds some batching of the updates for a given node. In general, this reduces the number of requests that reach the owner from O(# clients) to O(# nodes).
This PR also includes some code cleanup (removing last vestiges of MPI and MDHIM from the server) and a few minor bug fixes.
Motivation and Context
At higher numbers of clients (above 2k) on Frontier, we were seeing client request timeouts due to the serialized processing of these requests at the owner server.
How Has This Been Tested?
With these changes, Unify examples with up to 8k clients (8 ppn @ 1k nodes, or 32 ppn @ 256 nodes) were passing more often. There is still more work to do on multithreading the service manager who processes the file owner requests.
Types of changes
Checklist: