Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Ensure that mlflow callback cleans up background-saving threads on trainer teardown. #3683

Merged
merged 13 commits into from
Oct 2, 2023

Conversation

justinxzhao
Copy link
Contributor

@justinxzhao justinxzhao commented Oct 2, 2023

#3670 de-duplicates logged metrics to MLFlow by first checking if the metrics for a given training step have already been logged, and skipping if so.

This stateful mechanic ensures that if training completes at the same step as evaluation (default configuration), the on_eval_end callback logs metrics and the on_trainer_teardown callback skips self.save_fn().

However, the save_fn block in on_trainer_teardown also takes care of two cleanup steps:

  1. self.save_thread.join() to join the background-saving thread.
  2. self.save_fn((..., False)) to break the infinite background loop, _log_mlflow_loop, which is started in on_trainer_train_setup if background saving is enabled.

These were improperly skipped by the new stateful mechanic, causing pytest to hang and reach timeout.

In on_trainer_teardown, the fix is to 1) always join the save_thread and 2) always call self.save_fn((None, None, None, False)) to ensure that the background saving loop breaks.

This PR also fixes the recent integration_test_e timeouts.

Credit to @jeffkinnison who pinpointed the culprit and issue, and worked with me on this solution.

CC: @Infernaught @geoffreyangus

Copy link
Contributor

@jeffkinnison jeffkinnison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for a quick fix, just a couple of suggestions for how we could potentially iterate on this fix.

ludwig/contribs/mlflow/__init__.py Outdated Show resolved Hide resolved
ludwig/contribs/mlflow/__init__.py Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Oct 2, 2023

Unit Test Results

  6 files  ±0    6 suites  ±0   40m 50s ⏱️ - 1m 25s
31 tests ±0  26 ✔️ ±0    5 💤 ±0  0 ±0 
82 runs  ±0  66 ✔️ ±0  16 💤 ±0  0 ±0 

Results for commit 364c39f. ± Comparison against base commit 3a0e2b2.

♻️ This comment has been updated with latest results.

@justinxzhao
Copy link
Contributor Author

test_mlflow.py is working with the timeout removal.

@jeffkinnison
Copy link
Contributor

test_mlflow.py is working with the timeout removal.

Oh, interesting. Do we know why the timeout was causing problems?

@justinxzhao justinxzhao merged commit e46a989 into master Oct 2, 2023
@justinxzhao justinxzhao deleted the fix_mlflow_callback branch October 2, 2023 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants