Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: DFP DUO integrated training pipeline fails with dask error in 24.10 #1916

Closed
2 tasks done
dagardner-nv opened this issue Sep 30, 2024 · 0 comments · Fixed by #1931
Closed
2 tasks done

[BUG]: DFP DUO integrated training pipeline fails with dask error in 24.10 #1916

dagardner-nv opened this issue Sep 30, 2024 · 0 comments · Fixed by #1931
Assignees
Labels
bug Something isn't working

Comments

@dagardner-nv
Copy link
Contributor

dagardner-nv commented Sep 30, 2024

Version

24.10

Which installation method(s) does this occur on?

Source

Describe the bug.

Creating dask cluster...
Creating dask cluster... Done. Dashboard: http://192.168.4.51:8787/status
S3 objects to DF complete. Rows: 86, Cache: miss, Duration: 791.4469242095947 ms, Rate: 108.66174012349194 rows/s
Stopping dask cluster...
Stopping dask cluster... Done.
Failed to download logs. Error: 
Traceback (most recent call last):
  File "/home/dagardner/work/m2/python/morpheus/morpheus/controllers/file_to_df_controller.py", line 166, in _get_or_create_dataframe_from_batch
    dfs = self._downloader.download(download_buckets, download_method_func)
  File "/home/dagardner/work/m2/python/morpheus/morpheus/utils/downloader.py", line 160, in download
    with self.get_dask_client() as dist:
  File "/home/dagardner/work/m2/python/morpheus/morpheus/utils/downloader.py", line 125, in get_dask_client
    return dask.distributed.Client(self.get_dask_cluster())
  File "/home/dagardner/work/conda/envs/m2/lib/python3.10/site-packages/distributed/client.py", line 914, in __init__
    raise RuntimeError(
RuntimeError: Trying to connect to an already closed or closing Cluster LocalCluster(8ada266f, 'inproc://192.168.4.51/156737/1', workers=0, threads=0, memory=0 B).

Minimum reproducible example

Terminal 1:

cd examples/digital_fingerprinting/production
docker compose build
docker compose up mlflow

Terminal 2:

cd examples/digital_fingerprinting/production/morpheus
python dfp_integrated_training_batch_pipeline.py --tracking_uri="http://localhost:5000" \
        --log_level DEBUG \
        --silence_monitors \
        --source duo \
        --start_time "2022-08-01" \
        --duration "60d" \
        --train_users generic \
        --input_file "./control_messages/duo_payload_training.json"

Relevant log output

Click here to see error details

Creating dask cluster...
Creating dask cluster... Done. Dashboard: http://192.168.4.51:8787/status
S3 objects to DF complete. Rows: 86, Cache: miss, Duration: 791.4469242095947 ms, Rate: 108.66174012349194 rows/s
Stopping dask cluster...
Stopping dask cluster... Done.
Failed to download logs. Error:
Traceback (most recent call last):
File "/home/dagardner/work/m2/python/morpheus/morpheus/controllers/file_to_df_controller.py", line 166, in _get_or_create_dataframe_from_batch
dfs = self._downloader.download(download_buckets, download_method_func)
File "/home/dagardner/work/m2/python/morpheus/morpheus/utils/downloader.py", line 160, in download
with self.get_dask_client() as dist:
File "/home/dagardner/work/m2/python/morpheus/morpheus/utils/downloader.py", line 125, in get_dask_client
return dask.distributed.Client(self.get_dask_cluster())
File "/home/dagardner/work/conda/envs/m2/lib/python3.10/site-packages/distributed/client.py", line 914, in init
raise RuntimeError(
RuntimeError: Trying to connect to an already closed or closing Cluster LocalCluster(8ada266f, 'inproc://192.168.4.51/156737/1', workers=0, threads=0, memory=0 B).
Error while converting S3 buckets to DF.

Full env printout

Click here to see environment details

[Paste the results of print_env.sh here, it will be hidden by default]

Other/Misc.

No response

Code of Conduct

  • I agree to follow Morpheus' Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@dagardner-nv dagardner-nv added the bug Something isn't working label Sep 30, 2024
@dagardner-nv dagardner-nv self-assigned this Oct 2, 2024
@morpheus-bot-test morpheus-bot-test bot moved this from Todo to Review - Ready for Review in Morpheus Boards Oct 2, 2024
@mdemoret-nv mdemoret-nv added this to the 24.10 - Release milestone Oct 16, 2024
rapids-bot bot pushed a commit that referenced this issue Oct 16, 2024
* `FileToDFController` now exposes a `download_method` constructor argument allowing the caller control over the `Downloader` class's download method.
* Since the `file_to_df_loader` module creates and closes a `FileToDFController` instance on a per-call basis, set `download_method=SINGLE_THREAD`. Fixes an issue where dask was being shutdown early after the first message.

Closes [#1916](#1916)

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/nv-morpheus/Morpheus/blob/main/docs/source/developer_guide/contributing.md).
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - David Gardner (https://github.com/dagardner-nv)

Approvers:
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: #1931
@github-project-automation github-project-automation bot moved this from Review - Ready for Review to Done in Morpheus Boards Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants