Skip to content

Commit

Permalink
FIX MLFLow and Tensorboard Output to be consistent with new Workspace…
Browse files Browse the repository at this point in the history
… root changes (#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes
  • Loading branch information
chesterxgchen authored Apr 30, 2024
1 parent 593a11d commit 6bd0bf5
Show file tree
Hide file tree
Showing 5 changed files with 147 additions and 84 deletions.
2 changes: 1 addition & 1 deletion examples/advanced/experiment-tracking/mlflow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ app_server app_site-1 app_site-2 log.txt tb_events
By default, MLflow will create an experiment log directory under a directory named "mlruns" in the simulator's workspace. If you ran the simulator with "/tmp/nvflare" as the workspace, then you can launch the MLflow UI with:

```
mlflow ui --backend-store-uri /tmp/nvflare/mlruns/
mlflow ui --backend-store-uri /tmp/nvflare/server/simulate_job/mlruns/
```

### 4. MLflow Streaming
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -88,14 +88,38 @@
"\n",
"```\n",
"nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 ./jobs/hello-pt-tb-mlflow\n",
"```"
"```\n",
"\n",
"or set the PYTHONPATH programmatically. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "faeb00d2-fa1f-4a95-b2b3-0029d3a4a671",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import os\n",
"parent_directory = os.path.dirname(os.getcwd())\n",
"\n",
"# Get the current PATH\n",
"current_path = os.environ.get('PYTHONPATH', '')\n",
"\n",
"# Add the path if it's not already there\n",
"if parent_directory not in current_path:\n",
" os.environ['PYTHONPATH'] = parent_directory + os.pathsep + current_path\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c8f08cef",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 ./jobs/hello-pt-tb-mlflow"
Expand All @@ -120,10 +144,31 @@
"To view training metrics that are being streamed to the server, run:\n",
"\n",
"```\n",
"tensorboard --logdir=/tmp/nvflare/simulate_job/tb_events\n",
"tensorboard --logdir=/tmp/nvflare/server/simulate_job/tb_events\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40441575-95e6-47ec-907a-af93e1c77949",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!tensorboard --logdir=/tmp/nvflare/server/simulate_job/tb_events"
]
},
{
"cell_type": "markdown",
"id": "c85b6330-9a99-4751-ac03-307c4da9f0c5",
"metadata": {},
"source": [
">Note \n",
"Remember to \"stop\" above cell before running next cell"
]
},
{
"cell_type": "markdown",
"id": "534d7879",
Expand All @@ -137,7 +182,7 @@
"To view training metrics that are being streamed to the server, run:\n",
"\n",
"```\n",
"mlflow ui --backend-store-uri=/tmp/nvflare/mlruns\n",
"mlflow ui --backend-store-uri=/tmp/nvflare/server/simulate_job/mlruns\n",
"```\n",
"\n",
"Then \n",
Expand All @@ -149,11 +194,29 @@
"cell_type": "code",
"execution_count": null,
"id": "da1e7952-c3e6-4e90-a42e-648a823ede78",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!mlflow ui --backend-store-uri=/tmp/nvflare/mlruns"
"!mlflow ui --backend-store-uri=/tmp/nvflare/server/simulate_job/mlruns"
]
},
{
"cell_type": "markdown",
"id": "278a1d7b-a71d-4b50-bbd8-61bc5122d423",
"metadata": {},
"source": [
"> Note: remember to \"stop\" above cell"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "486046c6-0b74-4d95-925b-e175799df6f9",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand All @@ -172,7 +235,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.17"
"version": "3.8.18"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
"id": "mlflow_receiver_with_tracking_uri",
"path": "nvflare.app_opt.tracking.mlflow.mlflow_receiver.MLflowReceiver",
"args": {
"tracking_uri": "file:///{WORKSPACE}/{JOB_ID}/mlruns"
"kwargs": {
"experiment_name": "hello-pt-experiment",
"run_name": "hello-pt-with-mlflow",
Expand Down
115 changes: 45 additions & 70 deletions examples/hello-world/hello-numpy-sag/hello_numpy_sag.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -55,26 +55,12 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"id": "c3dbde69",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Using memory store\n",
"SystemInfo\n",
"server_info:\n",
"status: stopped, start_time: Thu Jan 26 11:52:39 2023\n",
"client_info:\n",
"site_a(last_connect_time: Thu Jan 26 14:56:33 2023)\n",
"site_b(last_connect_time: Thu Jan 26 14:56:33 2023)\n",
"job_info:\n",
"\n"
]
}
],
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import os\n",
"from nvflare.fuel.flare_api.flare_api import new_secure_session\n",
Expand All @@ -98,25 +84,33 @@
"source": [
"### 4. Submit the Job with the FLARE API\n",
"\n",
"With a session successfully connected, you can use `submit_job()` to submit your job. You can change `path_to_example_job` to the location of the job you are submitting. If your session is not active, go back to the previous step and connect with a session."
"With a session successfully connected, you can use `submit_job()` to submit your job. You can change `path_to_example_job` to the location of the job you are submitting. If your session is not active, go back to the previous step and connect with a session.\n",
"\n",
"With POC command, we link the examples to the following directory ``` /tmp/nvflare/poc/example_project/prod_00/admin@nvidia.com/transfer```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b3589b60-434b-4b6d-97bc-74e95bbc7b52",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"ls -l /tmp/nvflare/poc/example_project/prod_00/admin@nvidia.com/transfer\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"id": "c8f08cef",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"cd3f69d4-c78b-47fa-8db0-bb6325cc7be5 was submitted\n"
]
}
],
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"path_to_example_job = \"/workspace/NVFlare/examples/hello-numpy-sag/jobs/hello-numpy-sag\"\n",
"path_to_example_job = \"hello-world/hello-numpy-sag/jobs/hello-numpy-sag\"\n",
"job_id = sess.submit_job(path_to_example_job)\n",
"print(job_id + \" was submitted\")"
]
Expand All @@ -137,32 +131,12 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"id": "03fd93d0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'name': 'hello-numpy-sag', 'job_folder_name': 'hello-numpy-sag', 'resource_spec': {}, 'deploy_map': {'hello-numpy-sag': ['@ALL']}, 'min_clients': 1, 'submitter_name': 'super@a.org', 'submitter_org': 'org_a', 'submitter_role': 'project_admin', 'job_id': 'cd3f69d4-c78b-47fa-8db0-bb6325cc7be5', 'submit_time': 1674763013.0868688, 'submit_time_iso': '2023-01-26T14:56:53.086869-05:00', 'start_time': '2023-01-26 14:56:54.702227', 'duration': 'N/A', 'status': 'RUNNING', 'job_deploy_detail': ['server: OK', 'site_a: OK', 'site_b: OK'], 'schedule_count': 1, 'last_schedule_time': 1674763013.6890023, 'schedule_history': ['2023-01-26 14:56:53: scheduled']}\n",
"{'name': 'hello-numpy-sag', 'job_folder_name': 'hello-numpy-sag', 'resource_spec': {}, 'deploy_map': {'hello-numpy-sag': ['@ALL']}, 'min_clients': 1, 'submitter_name': 'super@a.org', 'submitter_org': 'org_a', 'submitter_role': 'project_admin', 'job_id': 'cd3f69d4-c78b-47fa-8db0-bb6325cc7be5', 'submit_time': 1674763013.0868688, 'submit_time_iso': '2023-01-26T14:56:53.086869-05:00', 'start_time': '2023-01-26 14:56:54.702227', 'duration': 'N/A', 'status': 'RUNNING', 'job_deploy_detail': ['server: OK', 'site_a: OK', 'site_b: OK'], 'schedule_count': 1, 'last_schedule_time': 1674763013.6890023, 'schedule_history': ['2023-01-26 14:56:53: scheduled']}\n",
"{'name': 'hello-numpy-sag', 'job_folder_name': 'hello-numpy-sag', 'resource_spec': {}, 'deploy_map': {'hello-numpy-sag': ['@ALL']}, 'min_clients': 1, 'submitter_name': 'super@a.org', 'submitter_org': 'org_a', 'submitter_role': 'project_admin', 'job_id': 'cd3f69d4-c78b-47fa-8db0-bb6325cc7be5', 'submit_time': 1674763013.0868688, 'submit_time_iso': '2023-01-26T14:56:53.086869-05:00', 'start_time': '2023-01-26 14:56:54.702227', 'duration': 'N/A', 'status': 'RUNNING', 'job_deploy_detail': ['server: OK', 'site_a: OK', 'site_b: OK'], 'schedule_count': 1, 'last_schedule_time': 1674763013.6890023, 'schedule_history': ['2023-01-26 14:56:53: scheduled']}\n",
"....................\n",
"{'name': 'hello-numpy-sag', 'job_folder_name': 'hello-numpy-sag', 'resource_spec': {}, 'deploy_map': {'hello-numpy-sag': ['@ALL']}, 'min_clients': 1, 'submitter_name': 'super@a.org', 'submitter_org': 'org_a', 'submitter_role': 'project_admin', 'job_id': 'cd3f69d4-c78b-47fa-8db0-bb6325cc7be5', 'submit_time': 1674763013.0868688, 'submit_time_iso': '2023-01-26T14:56:53.086869-05:00', 'start_time': '2023-01-26 14:56:54.702227', 'duration': '0:00:46.957211', 'status': 'FINISHED:COMPLETED', 'job_deploy_detail': ['server: OK', 'site_a: OK', 'site_b: OK'], 'schedule_count': 1, 'last_schedule_time': 1674763013.6890023, 'schedule_history': ['2023-01-26 14:56:53: scheduled']}\n"
]
},
{
"data": {
"text/plain": [
"<MonitorReturnCode.JOB_FINISHED: 0>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from nvflare.fuel.flare_api.flare_api import Session, basic_cb_with_print\n",
"\n",
Expand All @@ -182,25 +156,26 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"id": "b0d8aa9c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'time': '2023-01-26 14:57:45.968111', 'data': [{'type': 'table', 'rows': [['CLIENT', 'RESPONSE'], ['site_a', 'Shutdown the client...'], ['site_b', 'Shutdown the client...']]}, {'type': 'success', 'data': ''}], 'meta': {'status': 'ok', 'info': ''}, 'status': <APIStatus.SUCCESS: 'SUCCESS'>}\n",
"{'time': '2023-01-26 14:57:46.882510', 'data': [{'type': 'string', 'data': 'FL app has been shutdown.'}, {'type': 'shutdown', 'data': 'Bye bye'}, {'type': 'success', 'data': ''}], 'meta': {'status': 'ok', 'info': ''}, 'status': <APIStatus.SUCCESS: 'SUCCESS'>}\n"
]
}
],
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"print(sess.api.do_command(\"shutdown client\"))\n",
"print(sess.api.do_command(\"shutdown server\"))\n",
"\n",
"sess.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "331c0ba2-8abe-47a3-a864-18dcb7489a44",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand All @@ -219,7 +194,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.17"
"version": "3.8.18"
},
"vscode": {
"interpreter": {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -71,13 +71,15 @@
"cell_type": "code",
"execution_count": null,
"id": "de430380",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"! nvflare job create -j /tmp/nvflare/jobs/cifar10_sag_pt_mlflow -w sag_pt_mlflow \\\n",
"-f meta.conf min_clients=2 \\\n",
"-f config_fed_client.conf app_script=train_with_mlflow.py app_config=\"--batch_size 6 --dataset_path /tmp/nvflare/data/cifar10 --num_workers 2\" \\\n",
"-f config_fed_server.conf num_rounds=5 experiment_name=\"nvflare-sag-pt-experiment\" run_name=\"nvflare-sag-pt-with-mlflow\" tracking_uri=\\\"\\\" \\\n",
"-f config_fed_server.conf num_rounds=5 experiment_name=\"nvflare-sag-pt-experiment\" run_name=\"nvflare-sag-pt-with-mlflow\" tracking_uri=\\\"file:///{WORKSPACE}/{JOB_ID}/mlruns\\\" \\\n",
"-sd ../code/fl \\\n",
"-force"
]
Expand Down Expand Up @@ -130,7 +132,9 @@
"cell_type": "code",
"execution_count": null,
"id": "17323f61",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"! python ../data/download.py --dataset_path /tmp/nvflare/data/cifar10"
Expand Down Expand Up @@ -171,9 +175,9 @@
"\n",
"If the tracking_uri is specified, you can directly go to the tracking_uri to view the results\n",
"\n",
"If the tracking_uri is not specified, the results will be saved in `/tmp/nvflare/cifar10_sag_pt_mlflow/mlruns/`\n",
"If the tracking_uri is not specified, the results will be saved in `/tmp/nvflare/cifar10_sag_pt_mlflow/server/simulate_job/mlruns/`\n",
"\n",
"You can then run the mlflow command: `mlflow ui --port 5000` inside the directory `/tmp/nvflare/cifar10_sag_pt_mlflow/`\n",
"You can then run the mlflow command: `mlflow ui --port 5000` inside the directory `/tmp/nvflare/cifar10_sag_pt_mlflow/server/simulate_job`\n",
"\n",
"Then you should be seeing similar thing as the following screenshot:\n",
"\n",
Expand All @@ -182,6 +186,26 @@
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a86a4ab4-00d0-4907-b770-71969ffb15ac",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!mlflow ui --port 5000 --backend-store-uri /tmp/nvflare/cifar10_sag_pt_mlflow/server/simulate_job/mlruns\n"
]
},
{
"cell_type": "markdown",
"id": "0211edd5-35e3-4af5-bc81-bd906325a4c4",
"metadata": {},
"source": [
"Make sure you \"stop\" the above Cell when you done with review the MLFlow results. "
]
},
{
"cell_type": "markdown",
"id": "58037d1e",
Expand All @@ -207,7 +231,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.2"
"version": "3.8.18"
}
},
"nbformat": 4,
Expand Down

0 comments on commit 6bd0bf5

Please sign in to comment.