Lr newton raphson (#2529)

* Implement federated logistic regression with second-order newton raphson. Update file headers. Update README. Update README. Fix README. Refine README. Update README. Added more logging for the job status changing. (#2480) * Added more logging for the job status changing. * Fixed a logging call error. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Fix update client status (#2508) * check workflow id before updating client status * change order of checks Add user guide on how to deploy to EKS (#2510) * Add user guide on how to deploy to EKS * Address comments Improve dead client handling (#2506) * dev * test dead client cmd * added more info for dead client tracing * remove unused imports * fix unit test * fix test case * address PR comments --------- Co-authored-by: Sean Yang <seany314@gmail.com> Enhance WFController (#2505) * set flmodel variables in basefedavg * make round info optional, fix inproc api bug temporarily disable preflight tests (#2521) Upgrade dependencies (#2516) Use full path for PSI components (#2437) (#2517) Multiple bug fixes from 2.4 (#2518) * [2.4] Support client custom code in simulator (#2447) * Support client custom code in simulator * Fix client custom code * Remove cancel_futures args (#2457) * Fix sub_worker_process shutdown (#2458) * Set GRPC_ENABLE_FORK_SUPPORT to False (#2474) Pythonic job creation (#2483) * WIP: constructed the FedJob. * WIP: server_app josn export. * generate the job app config. * fully functional pythonic job creation. * Added simulator_run for pythonic API. * reformat. * Added filters support for pythonic job creation. * handled the direct import case in fed_job. * refactor. * Added the resource_spec set function for FedJob. * refactored. * Moved the ClientApp and ServerApp into fed_app.py. * Refactored: removed the _FilterDef class. * refactored. * Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine * Enable obj in the constructor as paramenter. * Added support for the launcher script. * refactored. * reformat. * Update the comment. * re-arrange the package location. * Added add_ext_script() for BaseAppConfig. * codestyle fix. * Removed the client-api-pt example. * removed no used import. * fixed the in_time_accumulate_weighted_aggregator_test.py * Added Enum parameter support. * Added docstring. * Added ability to handle parameters from base class. * Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor. * Added params_exchange_format for PTInProcessClientAPIExecutor. * codestyle fix. * Fixed a custom code folder structure issue. * work for sub-folder custom files. * backed to handle parameters from base classes. * Support folder structure job config. * Added support for flat folder from '.XXX' import. * codestyle fix. * refactored and add docstring. * Address some of the PR reviews. --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Enhancements from 2.4 (#2519) * Starts heartbeat after task is pull and before task execution (#2415) * Starts pipe handler heartbeat send/check after task is pull before task execution (#2442) * [2.4] Improve cell pipe timeout handling (#2441) * improve cell pipe timeout handling * improved end and abort handling * improve timeout handling --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * [2.4] Enhance launcher executor (#2433) * Update LauncherExecutor logs and execution setup timeout * Change name * [2.4] Fire and forget for pipe handler control messages (#2413) * Fire and forget for pipe handler control messages * Add default timeout value * fix wait-for-reply (#2478) * Fix pipe handler timeout in task exchanger and launcher executor (#2495) * Fix metric relay pipe handler timeout (#2496) * Rely on launcher check_run_status to pause/resume hb (#2502) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> --------- Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Update ci cd from 2.4 (#2520) * Update github actions (#2450) * Fix premerge (#2467) * Fix issues on hello-world TF2 notebook * Fix tf integration test (#2504) * Add client api integration tests --------- Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Sean Yang <seany314@gmail.com> use controller name for stats (#2522) Simulator workspace re-design (#2492) * Redesign simulator workspace structure. * working, needs clean. * Changed the simulator workspacce structure to be consistent with POC. * Moved the logfile init to start_server_app(). * optimzed. * adjust the stats pool location. * Addressed the PR views. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Simulator end run for all clients (#2514) * Provide an option to run END_RUN for all clients. * Added end_run_all option for simulator to run END_RUN event for all clients. * Fixed a add_argument type, added help message. * Changed to use add_argument(() compatible with python 3.8. * reformat. * rewrite the _end_run_clients() and add docstring for easier understanding. * reformat. * adjusting the locking in the _end_run_clients. * Fixed a potential None pointer error. * renamed the clients_finished_end_run variable. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Sean Yang <seany314@gmail.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Secure XGBoost Integration (#2512) * Updated FOBS readme to add DatumManager, added agrpcs as secure scheme * Refactoring * Refactored the secure version to histogram_based_v2 * Replaced Paillier with a mock encryptor * Added license header * Put mock back * Added metrics_writer back and fixed GRPC error reply simplify job simulator_run to take only one workspace parameter. (#2528) Fix README. Fix file links in README. Fix file links in README. Add comparison between centralized and federated training code. Add missing client api test jobs (#2535) Fixed the simulator server workspace root dir (#2533) * Fixed the simulator server root dir error. * Added unit test for SimulatorRunner start_server_app. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Improve InProcessClientAPIExecutor (#2536) * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * make result check and result pull use the same configurable variable * rename exec_task_fn_wrapper to task_script_runner.py * fix typo Update README for launching python script. Modify tensorboard logdir. Link to environment setup instructions. expose aggregate_fn to users for overwriting (#2539) FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (#2537) * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1. Remove the default code to use configuration 2. fix some broken notebook * rollback changes Fix decorator issue (#2542) Remove line number in code link. FLModel summary (#2544) * add FLModel Summary * format formatting Update KM example, add 2-stage solution without HE (#2541) * add KM without HE, update everything * fix license header * fix license header - update year to 2024 * fix format --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * update license --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Holger Roth <hroth@nvidia.com>
NVIDIA · May 2, 2024 · 775880f · 775880f
1 parent 3dd4476
commit 775880f
Show file tree

Hide file tree

Showing 12 changed files with 987 additions and 0 deletions.
diff --git a/examples/advanced/lr-newton-raphson/README.md b/examples/advanced/lr-newton-raphson/README.md
@@ -0,0 +1,246 @@
+# Federated Logistic Regression with Second-Order Newton-Raphson optimization
+
+This example shows how to implement a federated binary
+classification via logistic regression with second-order Newton-Raphson optimization.
+
+The [UCI Heart Disease
+dataset](https://archive.ics.uci.edu/dataset/45/heart+disease) is
+used in this example. Scripts are provided to download and process the
+dataset as described
+[here](https://github.com/owkin/FLamby/tree/main/flamby/datasets/fed_heart_disease). This
+dataset contains samples from 4 sites, splitted into training and
+testing sets as described below:
+|site         | sample split                          |
+|-------------|---------------------------------------|
+|Cleveland    | train: 199 samples, test: 104 samples |
+|Hungary      | train: 172 samples, test: 89 samples  |
+|Switzerland  | train: 30 samples, test: 16 samples   |
+|Long Beach V | train: 85 samples, test: 45 samples   |
+
+The number of features in each sample is 13.
+
+## Introduction
+
+The [Newton-Raphson
+optimization](https://en.wikipedia.org/wiki/Newton%27s_method) problem
+can be described as follows.
+
+In a binary classification task with logistic regression, the
+probability of a data sample $x$ classified as positive is formulated
+as:
+$$p(x) = \sigma(\beta \cdot x + \beta_{0})$$
+where $\sigma(.)$ denotes the sigmoid function. We can incorporate
+$\beta_{0}$ and $\beta$ into a single parameter vector $\theta =
+( \beta_{0},  \beta)$. Let $d$ be the number
+of features for each data sample $x$ and let $N$ be the number of data
+samples. We then have the matrix version of the above probability
+equation:
+$$p(X) = \sigma( X \theta )$$
+Here $X$ is the matrix of all samples, with shape $N \times (d+1)$,
+having it's first column filled with value 1 to account for the
+intercept $\theta_{0}$.
+
+The goal is to compute parameter vector $\theta$ that maximizes the
+below likelihood function:
+$$L_{\theta} = \prod_{i=1}^{N} p(x_i)^{y_i} (1 - p(x_i)^{1-y_i})$$
+
+The Newton-Raphson method optimizes the likelihood function via
+quadratic approximation. Omitting the maths, the theoretical update
+formula for parameter vector $\theta$ is:
+$$\theta^{n+1} = \theta^{n} - H_{\theta^{n}}^{-1} \nabla L_{\theta^{n}}$$
+where
+$$\nabla L_{\theta^{n}} = X^{T}(y - p(X))$$
+is the gradient of the likelihood function, with $y$ being the vector
+of ground truth for sample data matrix $X$,  and
+$$H_{\theta^{n}} = -X^{T} D X$$
+is the Hessian of the likelihood function, with $D$ a diagonal matrix
+where diagonal value at $(i,i)$ is $D(i,i) = p(x_i) (1 - p(x_i))$.
+
+In federated Newton-Raphson optimization, each client will compute its
+own gradient $\nabla L_{\theta^{n}}$ and Hessian $H_{\theta^{n}}$
+based on local training samples. A server will aggregate the gradients
+and Hessians computed from all clients, and perform the update of
+parameter $\theta$ based on the theoretical update formula described
+above.
+
+## Implementation
+
+Using `nvflare`, The federated logistic regression with Newton-Raphson
+optimization is implemented as follows.
+
+On the server side, all workflow logics are implemented in
+class `FedAvgNewtonRaphson`, which can be found
+[here](job/newton_raphson/app/custom/newton_raphson_workflow.py). The
+`FedAvgNewtonRaphson` class inherits from the
+[`BaseFedAvg`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/workflows/base_fedavg.py)
+class, which itself inherits from the **Workflow Controller**
+([`WFController`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/workflows/wf_controller.py))
+class. This is the preferrable approach to implement a custom
+workflow, since `WFController` decouples communication logic from
+actual workflow (training & validation) logic. The mandatory
+method to override in `WFController` is the
+[`run()`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/workflows/wf_controller.py#L37)
+method, where the orchestration of server-side workflow actually
+happens. The implementation of `run()` method in
+[`FedAvgNewtonRaphson`](job/newton_raphson/app/custom/newton_raphson_workflow.py)
+is similar to the classic
+[`FedAvg`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/workflows/fedavg.py#L44):
+- Initialize the global model, this is acheived through method `load_model()`
+  from base class
+  [`ModelController`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/workflows/model_controller.py#L292),
+  which relies on the
+  [`ModelPersistor`](https://nvflare.readthedocs.io/en/main/glossary.html#persistor). A
+  custom
+  [`NewtonRaphsonModelPersistor`](job/newton_raphson/app/custom/newton_raphson_persistor.py)
+  is implemented in this example, which is based on the
+  [`NPModelPersistor`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/np/np_model_persistor.py)
+  for numpy data, since the _model_ in the case of logistic regression
+  is just the parameter vector $\theta$ that can be represented by a
+  numpy array. Only the `__init__` method needs to be re-implemented
+  to provide a proper initialization for the global parameter vector
+  $\theta$.
+- During each training round, the global model will be sent to the
+  list of participating clients to perform a training task. This is
+  done using the
+  [`send_model_and_wait()`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/workflows/wf_controller.py#L41)
+  method. Once
+  the clients finish their local training, results will be collected
+  and sent back to server as
+  [`FLModel`](https://nvflare.readthedocs.io/en/main/programming_guide/fl_model.html#flmodel)s.
+- Results sent by clients contain their locally computed gradient and
+  Hessian. A [custom aggregation
+  function](job/newton_raphson/app/custom/newton_raphson_workflow.py)
+  is implemented to get the averaged gradient and Hessian, and compute
+  the Newton-Raphson update for the global parameter vector $\theta$,
+  based on the theoretical formula shown above. The averaging of
+  gradient and Hessian is based on the
+  [`WeightedAggregationHelper`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/aggregators/weighted_aggregation_helper.py#L20),
+  which weighs the contribution from each client based on the number
+  of local training samples. The aggregated Newton-Raphson update is
+  returned as an `FLModel`.
+- After getting the aggregated Newton-Raphson update, an
+  [`update_model()`](job/newton_raphson/app/custom/newton_raphson_workflow.py#L172)
+  method is implemented to actually apply the Newton-Raphson update to
+  the global model.
+- The last step is to save the updated global model, again through
+  the `NewtonRaphsonModelPersistor` using `save_model()`.
+
+
+On the client side, the local training logic is implemented
+[here](job/newton_raphson/app/custom/newton_raphson_train.py). The
+implementation is based on the [`Client
+API`](https://nvflare.readthedocs.io/en/main/programming_guide/execution_api_type.html#client-api). This
+allows user to add minimum `nvflare`-specific codes to turn a typical
+centralized training script to a federated client side local training
+script.
+- During local training, each client receives a copy of the global
+  model, sent by the server, using `flare.receive()` API. The received
+  global model is an instance of `FLModel`.
+- A local validation is first performed, where validation metrics
+  (accuracy and precision) are streamed to server using the
+  [`SummaryWriter`](https://nvflare.readthedocs.io/en/main/apidocs/nvflare.client.tracking.html#nvflare.client.tracking.SummaryWriter). The
+  streamed metrics can be loaded and visualized using tensorboard.
+- Then each client computes it's gradient and Hessian based on local
+  training data, using their respective theoretical formula described
+  above. This is implemented in the
+  [`train_newton_raphson()`](job/newton_raphson/app/custom/newton_raphson_train.py#L82)
+  method. Each client then sends the computed results (always in
+  `FLModel` format) to server for aggregation, using `flare.send()`
+  API.
+
+Each client site corresponds to a site listed in the data table above.
+
+A [centralized training script](./train_centralized.py) is also
+provided, which allows for comparing the federated Newton-Raphson
+optimization versus the centralized version. In the centralized
+version, training data samples from all 4 sites were concatenated into
+a single matrix, used to optimize the model parameters. The
+optimized model was then tested separately on testing data samples of
+the 4 sites, using accuracy and precision as metrics.
+
+Comparing the federated [client-side training
+code](job/newton_raphson/app/custom/newton_raphson_train.py) with the
+centralized [training code](./train_centralized.py), we can see that
+the training logic remains similar: load data, perform training
+(Newton-Raphson updates), and valid trained model. The only added
+differences in the federated code are related to interaction with the
+FL system, such as receiving and send `FLModel`.
+
+## Set Up Environment & Install Dependencies
+
+Follow instructions
+[here](https://github.com/NVIDIA/NVFlare/tree/main/examples#set-up-a-virtual-environment)
+to set up a virtual environment for `nvflare` examples and install
+dependencies for this example.
+
+## Download and prepare data
+
+Execute the following script
+```
+bash ./prepare_heart_disease_data.sh
+```
+This will download the heart disease dataset under
+`/tmp/flare/dataset/heart_disease_data/`
+
+## Centralized Logistic Regression
+
+Launch the following script:
+```
+python ./train_centralized.py --solver custom
+```
+
+Two implementations of logistic regression are provided in the
+centralized training script, which can be specified by the `--solver`
+argument:
+- One is using `sklearn.LogisticRegression` with `newton-cholesky`
+  solver
+- The other one is manually implemented using the theoretical update
+  formulas described above.
+
+Both implementations were tested to converge in 4 iterations and to
+give the same result.
+
+Example output:
+```
+using solver: custom
+loading training data.
+training data X loaded. shape: (486, 13)
+training data y loaded. shape: (486, 1)
+
+site - 1
+validation set n_samples:  104
+accuracy: 0.75
+precision: 0.7115384615384616
+
+site - 2
+validation set n_samples:  89
+accuracy: 0.7528089887640449
+precision: 0.6122448979591837
+
+site - 3
+validation set n_samples:  16
+accuracy: 0.75
+precision: 1.0
+
+site - 4
+validation set n_samples:  45
+accuracy: 0.6
+precision: 0.9047619047619048
+```
+
+## Federated Logistic Regression
+
+Execute the following command to launch federated logistic
+regression. This will run in `nvflare`'s simulator mode.
+```
+nvflare simulator -w ./workspace -n 4 -t 4 job/newton_raphson/
+```
+
+Accuracy and precision for each site can be viewed in Tensorboard:
+```
+tensorboard --logdir=./workspace/server/simulate_job/tb_events
+```
+As can be seen from the figure below, per-site evaluation metrics in
+federated logistic regression are on-par with the centralized version.
+
+<img src="./figs/tb-metrics.png" alt="Tensorboard metrics server"/>
diff --git a/examples/advanced/lr-newton-raphson/figs/tb-metrics.png b/examples/advanced/lr-newton-raphson/figs/tb-metrics.png
diff --git a/examples/advanced/lr-newton-raphson/job/newton_raphson/app/config/config_fed_client.json b/examples/advanced/lr-newton-raphson/job/newton_raphson/app/config/config_fed_client.json
@@ -0,0 +1,75 @@
+{
+  "format_version": 2,
+  "app_script": "newton_raphson_train.py",
+  "app_config": "--data_root /tmp/flare/dataset/heart_disease_data",
+  "executors": [
+    {
+      "tasks": [
+        "train"
+      ],
+      "executor": {
+        "path": "nvflare.app_common.executors.client_api_launcher_executor.ClientAPILauncherExecutor",
+        "args": {
+          "launcher_id": "launcher",
+          "pipe_id": "pipe",
+          "heartbeat_timeout": 60,
+          "params_exchange_format": "raw",
+          "params_transfer_type": "FULL",
+          "train_with_evaluation": false
+        }
+      }
+    }
+  ],
+  "task_data_filters": [],
+  "task_result_filters": [],
+  "components": [
+    {
+      "id": "launcher",
+      "path": "nvflare.app_common.launchers.subprocess_launcher.SubprocessLauncher",
+      "args":  {
+        "script": "python3 custom/{app_script} {app_config}",
+        "launch_once": true
+      }
+    },
+    {
+      "id": "pipe",
+      "path": "nvflare.fuel.utils.pipe.cell_pipe.CellPipe",
+      "args": {
+        "mode": "PASSIVE",
+        "site_name": "{SITE_NAME}",
+        "token": "{JOB_ID}",
+        "root_url": "{ROOT_URL}",
+        "secure_mode": "{SECURE_MODE}",
+        "workspace_dir": "{WORKSPACE}"
+      }
+    },
+    {
+      "id": "metrics_pipe",
+      "path": "nvflare.fuel.utils.pipe.cell_pipe.CellPipe",
+      "args": {
+        "mode": "PASSIVE",
+        "site_name": "{SITE_NAME}",
+        "token": "{JOB_ID}",
+        "root_url": "{ROOT_URL}",
+        "secure_mode": "{SECURE_MODE}",
+        "workspace_dir": "{WORKSPACE}"
+      }
+    },
+    {
+      "id": "metric_relay",
+      "path": "nvflare.app_common.widgets.metric_relay.MetricRelay",
+      "args": {
+        "pipe_id": "metrics_pipe",
+        "event_type": "fed.analytix_log_stats",
+        "read_interval": 0.1
+      }
+    },
+    {
+      "id": "client_api_config_preparer",
+      "path": "nvflare.app_common.widgets.external_configurator.ExternalConfigurator",
+      "args": {
+        "component_ids": ["metric_relay"]
+      }
+    }
+  ]
+}
diff --git a/examples/advanced/lr-newton-raphson/job/newton_raphson/app/config/config_fed_server.json b/examples/advanced/lr-newton-raphson/job/newton_raphson/app/config/config_fed_server.json
@@ -0,0 +1,34 @@
+{
+  "format_version": 2,
+  "server": {
+    "heart_beat_timeout": 600
+  },
+  "task_data_filters": [],
+  "task_result_filters": [],
+  "components": [
+    {
+      "id": "newton_raphson_persistor",
+      "path": "newton_raphson_persistor.NewtonRaphsonModelPersistor",
+      "args": {
+        "n_features": 13
+      }
+    },
+    {
+      "id": "tb_analytics_receiver",
+      "path": "nvflare.app_opt.tracking.tb.tb_receiver.TBAnalyticsReceiver",
+      "args.events": ["fed.analytix_log_stats"]
+    }
+  ],
+  "workflows": [
+    {
+      "id": "fedavg_newton_raphson",
+      "path": "newton_raphson_workflow.FedAvgNewtonRaphson",
+      "args": {
+        "min_clients": 4,
+        "num_rounds": 5,
+        "damping_factor": 0.8,
+        "persistor_id": "newton_raphson_persistor"
+      }
+    }
+  ]
+}