[CI] Ensure that mlflow callback cleans up background-saving threads on trainer teardown. #3683

justinxzhao · 2023-10-02T17:07:33Z

#3670 de-duplicates logged metrics to MLFlow by first checking if the metrics for a given training step have already been logged, and skipping if so.

This stateful mechanic ensures that if training completes at the same step as evaluation (default configuration), the on_eval_end callback logs metrics and the on_trainer_teardown callback skips self.save_fn().

However, the save_fn block in on_trainer_teardown also takes care of two cleanup steps:

self.save_thread.join() to join the background-saving thread.
self.save_fn((..., False)) to break the infinite background loop, _log_mlflow_loop, which is started in on_trainer_train_setup if background saving is enabled.

These were improperly skipped by the new stateful mechanic, causing pytest to hang and reach timeout.

In on_trainer_teardown, the fix is to 1) always join the save_thread and 2) always call self.save_fn((None, None, None, False)) to ensure that the background saving loop breaks.

This PR also fixes the recent integration_test_e timeouts.

Credit to @jeffkinnison who pinpointed the culprit and issue, and worked with me on this solution.

CC: @Infernaught @geoffreyangus

jeffkinnison

LGTM for a quick fix, just a couple of suggestions for how we could potentially iterate on this fix.

ludwig/contribs/mlflow/__init__.py

github-actions · 2023-10-02T17:57:40Z

Unit Test Results

  6 files ±0   6 suites ±0 40m 50s ⏱️ - 1m 25s
31 tests ±0 26 ✔️ ±0   5 💤 ±0 0 ❌ ±0
82 runs ±0 66 ✔️ ±0 16 💤 ±0 0 ❌ ±0

Results for commit 364c39f. ± Comparison against base commit 3a0e2b2.

♻️ This comment has been updated with latest results.

justinxzhao · 2023-10-02T20:16:28Z

test_mlflow.py is working with the timeout removal.

jeffkinnison · 2023-10-02T21:07:54Z

test_mlflow.py is working with the timeout removal.

Oh, interesting. Do we know why the timeout was causing problems?

justinxzhao added 6 commits October 2, 2023 12:05

Ensure threads get closed.

12989f4

Remove print.

7154774

Back out preprocessing change.

26b9a69

Spurious whitespace.

a5936d6

Add docstrings.

1d9bc35

Comments.

b351b13

justinxzhao requested review from jeffkinnison, geoffreyangus and Infernaught October 2, 2023 17:07

jeffkinnison approved these changes Oct 2, 2023

View reviewed changes

ludwig/contribs/mlflow/__init__.py Outdated Show resolved Hide resolved

ludwig/contribs/mlflow/__init__.py Show resolved Hide resolved

justinxzhao added 5 commits October 2, 2023 19:22

Remove timeout.

ab9c940

Temporarily comment out tests except for mlflow.

307ee68

Fix

ba38bb4

test_mlflow.

1f66e2c

Comment out regular tests.

7b4dd8d

Full tests.

a5584bb

Skip Aim test.

364c39f

justinxzhao merged commit e46a989 into master Oct 2, 2023

justinxzhao deleted the fix_mlflow_callback branch October 2, 2023 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Ensure that mlflow callback cleans up background-saving threads on trainer teardown. #3683

[CI] Ensure that mlflow callback cleans up background-saving threads on trainer teardown. #3683

justinxzhao commented Oct 2, 2023 •

edited

Loading

jeffkinnison left a comment

github-actions bot commented Oct 2, 2023 •

edited

Loading

justinxzhao commented Oct 2, 2023

jeffkinnison commented Oct 2, 2023

[CI] Ensure that mlflow callback cleans up background-saving threads on trainer teardown. #3683

[CI] Ensure that mlflow callback cleans up background-saving threads on trainer teardown. #3683

Conversation

justinxzhao commented Oct 2, 2023 • edited Loading

jeffkinnison left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 2, 2023 • edited Loading

Unit Test Results

justinxzhao commented Oct 2, 2023

jeffkinnison commented Oct 2, 2023

justinxzhao commented Oct 2, 2023 •

edited

Loading

github-actions bot commented Oct 2, 2023 •

edited

Loading