Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Catch exceptions during pytorch_inference startup #103873

Merged

Conversation

droberts195
Copy link
Contributor

A couple of test failures (#103808 and #103868) reveal that an exception thrown while connecting to the pytorch_inference process can be uncaught and hence cause the whole node to stop.

This change does not fix the underlying problem of failure to connect to the process that those issues relate to, but it converts the error from one that crashes a whole node to one that just fails the affected model deployment.

A couple of test failures (elastic#103808 and elastic#103868) reveal that
an exception thrown while connecting to the pytorch_inference
process can be uncaught and hence cause the whole node to stop.

This change does not fix the underlying problem of failure to
connect to the process that those issues relate to, but it
converts the error from one that crashes a whole node to one
that just fails the affected model deployment.
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Jan 3, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @droberts195, I've created a changelog YAML for you.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195
Copy link
Contributor Author

@elasticsearchmachine run elasticsearch-ci/bwc-snapshots

2 similar comments
@droberts195
Copy link
Contributor Author

@elasticsearchmachine run elasticsearch-ci/bwc-snapshots

@droberts195
Copy link
Contributor Author

@elasticsearchmachine run elasticsearch-ci/bwc-snapshots

@droberts195 droberts195 merged commit 65d597c into elastic:main Jan 4, 2024
15 checks passed
@droberts195 droberts195 deleted the catch_pytorch_startup_exceptions branch January 4, 2024 13:37
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.12

droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jan 4, 2024
A couple of test failures (elastic#103808 and elastic#103868) reveal that
an exception thrown while connecting to the pytorch_inference
process can be uncaught and hence cause the whole node to stop.

This change does not fix the underlying problem of failure to
connect to the process that those issues relate to, but it
converts the error from one that crashes a whole node to one
that just fails the affected model deployment.
elasticsearchmachine pushed a commit that referenced this pull request Jan 4, 2024
…3911)

A couple of test failures (#103808 and #103868) reveal that
an exception thrown while connecting to the pytorch_inference
process can be uncaught and hence cause the whole node to stop.

This change does not fix the underlying problem of failure to
connect to the process that those issues relate to, but it
converts the error from one that crashes a whole node to one
that just fails the affected model deployment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v8.12.0 v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants