Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In process Client API Executor Part 1 #2248

Merged
merged 13 commits into from
Feb 26, 2024

Conversation

chesterxgchen
Copy link
Collaborator

@chesterxgchen chesterxgchen commented Dec 28, 2023

Description

The Client API currently only support running the training code in sub-process. This works great for cases where 3rd party needs to run in a command line.
For many data scientist users, where running training directly in-process is also desired as it will be easier for debugging in IDE debugger. Also, the in-process training code will be a lot easier both in configuration and setup avoid the communication issues.

The proposal

This PR proposed an additional alternative in-memory implementation of the executor and Client API.

  • The Client API stays the same, nothing changed. (The configuration is different)
  • To call user training code directly, we requires uses provide a function method that can be called directly, such as run(), or train() or sub_main(**kwargs). User can specify the function name.
    user will need to specify the path as 'module.function_name' to indicate the call function.
    For example
 fn_path = "cifar10_fl.sub_main"

which indicate the function name is "sub_main" located in cifar10_fl.py module

User Experience

The user experience is the same as previous sub-process case. The code change from ML/DL to FL stays identical.
The configuration changes to indicate the Memory Executor

 executor {
        path = "nvflare.app_opt.pt.client_api_mem_executor.PTClientAPIMemExecutor"
        args {
                    task_fn_path = "cifar10_fl.sub_main
                    task_fn_args { 
                         batch_size = 6
                         dataset_path = "/tmp/nvflare/data/cifar10"
                         num_workers = 2
                   }
              }
              params_transfer_type = "DIFF"
              train_with_evaluation = true
              result_pull_interval = 0.5
              log_pull_interval = 0.1
        }

All the pipe and responding component configurations are not needed.

The implementation

  • main approach:
    a new InProcessClientAPIExecutor is implemented, We leverage the new DataBus and EventManager. The main thread of the executor will subscribe to both LOCAL_RESULT and LOG_DATA topics to receive callbacks when results are available.
    The client API will also register a Global_RESULT topic callback to receive global model.

we use one thread is running the Client's training code,
The executor main thread is sending global model once the model shareable is received and waiting for the local model update.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

@chesterxgchen chesterxgchen marked this pull request as draft December 28, 2023 05:25
@chesterxgchen chesterxgchen marked this pull request as ready for review December 28, 2023 05:44
@chesterxgchen chesterxgchen marked this pull request as draft December 28, 2023 05:44
@yanchengnv
Copy link
Collaborator

This PR is too big for this stage of 2.4. We could consider it for 2.4.1.

@chesterxgchen chesterxgchen marked this pull request as ready for review December 28, 2023 18:46
@chesterxgchen chesterxgchen changed the title Client API with MemPipe Executor In memory Client API Executor Jan 3, 2024
@chesterxgchen chesterxgchen changed the title In memory Client API Executor In process Client API Executor Jan 12, 2024
@chesterxgchen chesterxgchen force-pushed the client_api_memory branch 2 times, most recently from 1e64db4 to 3ba6a11 Compare January 13, 2024 18:43
@chesterxgchen
Copy link
Collaborator Author

/build

1 similar comment
@chesterxgchen
Copy link
Collaborator Author

/build

@chesterxgchen chesterxgchen changed the title In process Client API Executor In process Client API Executor Part 1 Jan 16, 2024
@chesterxgchen
Copy link
Collaborator Author

chesterxgchen commented Jan 16, 2024

To reduce the PR size, I split the PR into three parts

  1. message bus ( changes are still in this PR), but its duplicate of[ messagebus] (DataBus #2285)
    The message bus PR is independent of client API PR. once it merged, this PR can rebase, the PR file size will be smaller (28 -8 = 20 files, which includes 8 test related files. The real code changes besides tests are 12 files)
  2. The main client API changes Part 1, this PR.
  3. The Job Templates and related example changes will be Part 2 of Client API

@chesterxgchen chesterxgchen marked this pull request as draft January 21, 2024 05:55
@chesterxgchen chesterxgchen marked this pull request as ready for review January 22, 2024 20:13
2) fix example
code formatting
add queue.task_done()
1) add message bus
2) hide task func wrapper class
3) rename executor package
4) clean up some code
update meta info
remove used code
optimize import
fix message_bus
import order change
rename the executor from ClientAPIMemExecutor to InProcessClientAPIExecutor
1) remove thread_pool
2) further loose couple executor and client_api implementation
formating
add unit tests
avoid duplicated constant TASK_NAME definition
split PR into two parts (besides message bus)
this is part 1: only remove the example and job template changes

1. Replace MemPipe (Queues) with callback via EventManager
2. Simplified overall logics
3. notice the param convert doesn't quite work ( need to fix later)
4. removed some tests that now invalid. Will need to add more unit tests later

fix task_name is None bug

add few unit tests
code format
update to comform with new databus changes
@SYangster
Copy link
Collaborator

Cleanup, add additional support for main function with cli args.

Will add and enhance examples in follow up PR.

@SYangster
Copy link
Collaborator

/build

Copy link
Collaborator

@YuanTingHsieh YuanTingHsieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good, we just need to make changes following recent PRs in main regarding api.py

Copy link
Collaborator

@YuanTingHsieh YuanTingHsieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good, minor comments

Copy link
Collaborator

@YuanTingHsieh YuanTingHsieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM, can think about these logging levels later

@SYangster
Copy link
Collaborator

/build

@SYangster SYangster merged commit 81caf29 into NVIDIA:main Feb 26, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants