Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job submission with binary protocol #2393

Merged
merged 13 commits into from
Mar 11, 2024

Conversation

yanchengnv
Copy link
Collaborator

@yanchengnv yanchengnv commented Mar 7, 2024

Fixes # .

Description

The current job submission uses text protocol, which is very inefficient. It consumes large amount of memory space, especially when the job contains large data files (e.g. pre-trained model weights):

  • The admin client zips the job folder into memory; and then base64-encodes it before sending to the admin server;
  • The admin server receives the base64-encoded data and then decodes it back to bytes;
  • The admin server then calls the submit_job processor, which processes the job data;
  • The submit_job processor then FOBS encodes the data again before writing to disk.

During this process, the job data gets converted from base64 to bytes to FOBS encoding, at least tripling the memory usage. When the job contains large data files and the user tries to submit multiple such jobs at the same time, the system can quickly run out of memory. Even if the server doesn't crash, the system becomes practically unusable - everything becomes extremely slow.

This PR tries to solve the problem by completely avoiding these unnecessary memory usage:

  • The admin client zips the job folder to a zip file;
  • The admin client sends the zip file with a binary streaming protocol that sends the file data in small blocks;
  • The admin server receives the data in blocks and writes received blocks into a temporary file;
  • The submit_job processor does all job validation based on the zip file directly (not loading into memory);
  • Once processed, the submit_job processor moves the file into the job store.

The same binary protocol is also used for downloading job from the server to admin client.

The binary exchange protocol has built-in checksum validation that ensures the data integrity during exchange.

Additional improvement:

  • Removed some dead code;
  • Improved clone_job to directly copy data file (instead of loading data into memory and then write out) from the source job;
  • Backward compatible with previous data storage format (FOBS encoded): support jobs created by previous version of FLARE.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

@YuanTingHsieh YuanTingHsieh added the enhancement New feature or request label Mar 8, 2024
Copy link
Collaborator

@YuanTingHsieh YuanTingHsieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, have questions regarding if we should handle the file being removed/closed by other people/process using the same machine.

nvflare/fuel/hci/file_transfer_defs.py Show resolved Hide resolved
nvflare/fuel/hci/proto.py Show resolved Hide resolved
nvflare/apis/job_def_manager_spec.py Show resolved Hide resolved
nvflare/apis/impl/job_def_manager.py Show resolved Hide resolved
nvflare/fuel/hci/binary_proto.py Show resolved Hide resolved
nvflare/fuel/hci/binary_proto.py Show resolved Hide resolved
nvflare/fuel/hci/binary_proto.py Outdated Show resolved Hide resolved
nvflare/fuel/hci/client/file_transfer.py Show resolved Hide resolved
nvflare/fuel/hci/client/api_spec.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@YuanTingHsieh YuanTingHsieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@yanchengnv yanchengnv merged commit 2d0c609 into NVIDIA:main Mar 11, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants