-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Ensure that mlflow callback cleans up background-saving threads on trainer teardown. #3683
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jeffkinnison
approved these changes
Oct 2, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for a quick fix, just a couple of suggestions for how we could potentially iterate on this fix.
test_mlflow.py is working with the timeout removal. |
Oh, interesting. Do we know why the timeout was causing problems? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#3670 de-duplicates logged metrics to MLFlow by first checking if the metrics for a given training step have already been logged, and skipping if so.
This stateful mechanic ensures that if training completes at the same step as evaluation (default configuration), the
on_eval_end
callback logs metrics and theon_trainer_teardown
callback skipsself.save_fn()
.However, the
save_fn
block inon_trainer_teardown
also takes care of two cleanup steps:self.save_thread.join()
to join the background-saving thread.self.save_fn((..., False))
to break the infinite background loop,_log_mlflow_loop
, which is started inon_trainer_train_setup
if background saving is enabled.These were improperly skipped by the new stateful mechanic, causing
pytest
to hang and reach timeout.In
on_trainer_teardown
, the fix is to 1) always join thesave_thread
and 2) always callself.save_fn((None, None, None, False))
to ensure that the background saving loop breaks.This PR also fixes the recent
integration_test_e
timeouts.Credit to @jeffkinnison who pinpointed the culprit and issue, and worked with me on this solution.
CC: @Infernaught @geoffreyangus