Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Prediction Job Failures (Aug 2024) #50

Merged
merged 1 commit into from
Aug 23, 2024
Merged

Conversation

lcjohnso
Copy link
Member

Changes wait_for_success parameter on Azure batchmodel's JobPreparationTask from False (meant to save $$$ in case job prep code failed and left node running indefinitely) to True (wait for file copying to complete before running prediction task).

This change is made as an attempt to fix prediction job failures that appear tied to missing file failures (e.g., RuntimeError: DataLoader timed out after 600 seconds). The solution tried here fits the symptoms of the failure: rerunning jobs leads to success (just waiting seemed to fix the underlying problem) so perhaps some of the data files had not finished copying over via job preparation task causing the timeout failure.

@lcjohnso lcjohnso requested a review from Tooyosi August 23, 2024 16:07
@lcjohnso lcjohnso merged commit a74e2d2 into main Aug 23, 2024
1 check passed
@lcjohnso lcjohnso deleted the fixfailures-Aug2024 branch August 23, 2024 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants