Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Implement federated logistic regression with second-order newton raphson. Update file headers. Update README. Update README. Fix README. Refine README. Update README. Added more logging for the job status changing. (#2480) * Added more logging for the job status changing. * Fixed a logging call error. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Fix update client status (#2508) * check workflow id before updating client status * change order of checks Add user guide on how to deploy to EKS (#2510) * Add user guide on how to deploy to EKS * Address comments Improve dead client handling (#2506) * dev * test dead client cmd * added more info for dead client tracing * remove unused imports * fix unit test * fix test case * address PR comments --------- Co-authored-by: Sean Yang <seany314@gmail.com> Enhance WFController (#2505) * set flmodel variables in basefedavg * make round info optional, fix inproc api bug temporarily disable preflight tests (#2521) Upgrade dependencies (#2516) Use full path for PSI components (#2437) (#2517) Multiple bug fixes from 2.4 (#2518) * [2.4] Support client custom code in simulator (#2447) * Support client custom code in simulator * Fix client custom code * Remove cancel_futures args (#2457) * Fix sub_worker_process shutdown (#2458) * Set GRPC_ENABLE_FORK_SUPPORT to False (#2474) Pythonic job creation (#2483) * WIP: constructed the FedJob. * WIP: server_app josn export. * generate the job app config. * fully functional pythonic job creation. * Added simulator_run for pythonic API. * reformat. * Added filters support for pythonic job creation. * handled the direct import case in fed_job. * refactor. * Added the resource_spec set function for FedJob. * refactored. * Moved the ClientApp and ServerApp into fed_app.py. * Refactored: removed the _FilterDef class. * refactored. * Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine * Enable obj in the constructor as paramenter. * Added support for the launcher script. * refactored. * reformat. * Update the comment. * re-arrange the package location. * Added add_ext_script() for BaseAppConfig. * codestyle fix. * Removed the client-api-pt example. * removed no used import. * fixed the in_time_accumulate_weighted_aggregator_test.py * Added Enum parameter support. * Added docstring. * Added ability to handle parameters from base class. * Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor. * Added params_exchange_format for PTInProcessClientAPIExecutor. * codestyle fix. * Fixed a custom code folder structure issue. * work for sub-folder custom files. * backed to handle parameters from base classes. * Support folder structure job config. * Added support for flat folder from '.XXX' import. * codestyle fix. * refactored and add docstring. * Address some of the PR reviews. --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Enhancements from 2.4 (#2519) * Starts heartbeat after task is pull and before task execution (#2415) * Starts pipe handler heartbeat send/check after task is pull before task execution (#2442) * [2.4] Improve cell pipe timeout handling (#2441) * improve cell pipe timeout handling * improved end and abort handling * improve timeout handling --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * [2.4] Enhance launcher executor (#2433) * Update LauncherExecutor logs and execution setup timeout * Change name * [2.4] Fire and forget for pipe handler control messages (#2413) * Fire and forget for pipe handler control messages * Add default timeout value * fix wait-for-reply (#2478) * Fix pipe handler timeout in task exchanger and launcher executor (#2495) * Fix metric relay pipe handler timeout (#2496) * Rely on launcher check_run_status to pause/resume hb (#2502) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> --------- Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Update ci cd from 2.4 (#2520) * Update github actions (#2450) * Fix premerge (#2467) * Fix issues on hello-world TF2 notebook * Fix tf integration test (#2504) * Add client api integration tests --------- Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Sean Yang <seany314@gmail.com> use controller name for stats (#2522) Simulator workspace re-design (#2492) * Redesign simulator workspace structure. * working, needs clean. * Changed the simulator workspacce structure to be consistent with POC. * Moved the logfile init to start_server_app(). * optimzed. * adjust the stats pool location. * Addressed the PR views. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Simulator end run for all clients (#2514) * Provide an option to run END_RUN for all clients. * Added end_run_all option for simulator to run END_RUN event for all clients. * Fixed a add_argument type, added help message. * Changed to use add_argument(() compatible with python 3.8. * reformat. * rewrite the _end_run_clients() and add docstring for easier understanding. * reformat. * adjusting the locking in the _end_run_clients. * Fixed a potential None pointer error. * renamed the clients_finished_end_run variable. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Sean Yang <seany314@gmail.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Secure XGBoost Integration (#2512) * Updated FOBS readme to add DatumManager, added agrpcs as secure scheme * Refactoring * Refactored the secure version to histogram_based_v2 * Replaced Paillier with a mock encryptor * Added license header * Put mock back * Added metrics_writer back and fixed GRPC error reply simplify job simulator_run to take only one workspace parameter. (#2528) Fix README. Fix file links in README. Fix file links in README. Add comparison between centralized and federated training code. Add missing client api test jobs (#2535) Fixed the simulator server workspace root dir (#2533) * Fixed the simulator server root dir error. * Added unit test for SimulatorRunner start_server_app. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Improve InProcessClientAPIExecutor (#2536) * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * make result check and result pull use the same configurable variable * rename exec_task_fn_wrapper to task_script_runner.py * fix typo Update README for launching python script. Modify tensorboard logdir. Link to environment setup instructions. expose aggregate_fn to users for overwriting (#2539) FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (#2537) * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1. Remove the default code to use configuration 2. fix some broken notebook * rollback changes Fix decorator issue (#2542) Remove line number in code link. FLModel summary (#2544) * add FLModel Summary * format formatting Update KM example, add 2-stage solution without HE (#2541) * add KM without HE, update everything * fix license header * fix license header - update year to 2024 * fix format --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * update license --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Holger Roth <hroth@nvidia.com>
- Loading branch information