Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Mlflow 403 PL UserError #1623

Merged
merged 11 commits into from
Oct 31, 2024
Merged

Add Mlflow 403 PL UserError #1623

merged 11 commits into from
Oct 31, 2024

Conversation

mattyding
Copy link
Contributor

@mattyding mattyding commented Oct 28, 2024

This PR

Adds networking UserError to Foundry.

Abandoning PFC change in favor of this. This try-catch should sufficiently cover the 403 networking errors we are seeing. Additionally, change is safer as it avoids duplicating logic in composer.

Testing

Was gonna add a unit test that mocked a MLflow 500 exception but the test doesn't really fit anywhere in the testing suite and has strong assumptions, so not much benefit. Idk if others have thoughts

Rollout (including migrations)

DLE release

@mattyding mattyding changed the title Matt/pl preflight Add PL UserError Oct 28, 2024
@mattyding mattyding marked this pull request as ready for review October 30, 2024 01:41
@mattyding mattyding requested a review from a team as a code owner October 30, 2024 01:41
llmfoundry/callbacks/hf_checkpointer.py Outdated Show resolved Hide resolved
mattyding and others added 2 commits October 30, 2024 15:00
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
Copy link
Contributor

@nancyhung nancyhung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a try/catch to the read operation when we download data via uc.files.download(...)? @mattyding

@mattyding
Copy link
Contributor Author

mattyding commented Oct 30, 2024

Can we also add a try/catch to the read operation when we download data via uc.files.download(...)? @mattyding

I thought motivation for the PFC->wrapping change is that we have no examples of customer runs with data read errors and can't validate the PFC check. So any try/catch wrapping we do for read would be a best guess.

I say we proceed with just the 403 artifacts error for now because we have ample evidence for that. If another data read error comes up (still waiting on Rohit's logs), then can add another wrap

Copy link
Contributor

@nancyhung nancyhung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

llmfoundry/callbacks/hf_checkpointer.py Show resolved Hide resolved
@mattyding mattyding changed the title Add PL UserError Add Mlflow 403 PL UserError Oct 31, 2024
@mattyding mattyding merged commit c91416e into main Oct 31, 2024
9 checks passed
@mattyding mattyding deleted the matt/pl-preflight branch October 31, 2024 19:01
dakinggg added a commit that referenced this pull request Nov 1, 2024
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
dakinggg added a commit that referenced this pull request Nov 1, 2024
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants