Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability improvements and a few bug fixes #785

Merged
merged 2 commits into from
Aug 14, 2023

Conversation

MichaelBrim
Copy link
Collaborator

Description

The overall theme for this PR is to improve scalability by reducing the number of client requests that end up generating an RPC to a file owner. Mostly, this is done by identifying when many clients on a node are generating a request for the same information, and making sure the node's local server only sends a single remote request to get the information from the owner. Similarly, when making updates to a file (e.g., new extents), this adds some batching of the updates for a given node. In general, this reduces the number of requests that reach the owner from O(# clients) to O(# nodes).

This PR also includes some code cleanup (removing last vestiges of MPI and MDHIM from the server) and a few minor bug fixes.

Motivation and Context

At higher numbers of clients (above 2k) on Frontier, we were seeing client request timeouts due to the serialized processing of these requests at the owner server.

How Has This Been Tested?

With these changes, Unify examples with up to 8k clients (8 ppn @ 1k nodes, or 32 ppn @ 256 nodes) were passing more often. There is still more work to do on multithreading the service manager who processes the file owner requests.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Testing (addition of new tests or update to current tests)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the UnifyFS code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted.

Also:
* fix premature heartbeat rpc attempt
This adds batching of client extent sync requests at
each server to try to limit the number of RPCs to the
owner to one-per-server, rather than one-per-client.
It also avoids doing metaget RPCs for the mountpoint.

Also:
* use ABT_rwlock instead of pthread_rwlock_t
* add comparison functions for int/float types to common
* fix misused return code
* fix free() of uninitialized margo bulk buf pointers
* remove last vestiges of MPI in server
* remove last vestiges of MDHIM in server
Comment on lines +39 to +40
/* a valid gfid generated via MD5 hash will never be zero */
#define INVALID_GFID (0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

@adammoody adammoody merged commit 12507a7 into LLNL:dev Aug 14, 2023
@MichaelBrim MichaelBrim deleted the scalable-svcmgr branch October 30, 2023 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants