Skip to content

Commit

Permalink
* Catch TensorFlow FailedPreconditionErrors and force the process res…
Browse files Browse the repository at this point in the history
…tart.

  * Attempt to fix kubeflow#88
  * We see some predictions succeed but then subsequent ones fail.

* Try to deal with workload identity issues by testing for a service account
  on startup.
  • Loading branch information
Jeremy Lewi committed Jan 18, 2020
1 parent aeea157 commit 11d7656
Showing 1 changed file with 16 additions and 0 deletions.
16 changes: 16 additions & 0 deletions py/label_microservice/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,22 @@ def callback(message):
f"The program will restart to try to recover.")
sys.exit(1)

# TODO(jlewi): I observed cases where some of the initial inferences
# would succeed but on subsequent ones it started failing
# see: https://github.com/kubeflow/code-intelligence/issues/70#issuecomment-570491289
# Restarting is a bit of a hack. We should try to figure out
# why its happening and fix it.
except tf_errors.FailedPreconditionError as e:
logging.fatal(f"Exception occurred while handling issue "
f"{repo_owner}/{repo_name}#{issue_num}. \n"
f"Exception: {e}\n"
f"{traceback.format_exc()}\n."
f"This usually indicates an issue with "
f"trying to use the model in a thread different "
f"from the one it was created in. "
f"The program will restart to try to recover.")
sys.exit(1)

#TODO(jlewi): We should catch a more narrow exception.
# On exception if we don't ack the message then we risk problems
# caused by poison pills repeatedly crashing our workers
Expand Down

0 comments on commit 11d7656

Please sign in to comment.