* Catch TensorFlow FailedPreconditionErrors and force the process res…

…tart. * Attempt to fix kubeflow#88 * We see some predictions succeed but then subsequent ones fail. * Try to deal with workload identity issues by testing for a service account on startup.
jlewi · Jan 18, 2020 · 11d7656 · 11d7656
1 parent aeea157
commit 11d7656
Showing 1 changed file with 16 additions and 0 deletions.
diff --git a/py/label_microservice/worker.py b/py/label_microservice/worker.py
@@ -187,6 +187,22 @@ def callback(message):
                               f"The program will restart to try to recover.")
                 sys.exit(1)
 
+            # TODO(jlewi): I observed cases where some of the initial inferences
+            # would succeed but on subsequent ones it started failing
+            # see: https://github.com/kubeflow/code-intelligence/issues/70#issuecomment-570491289
+            # Restarting is a bit of a hack. We should try to figure out
+            # why its happening and fix it.
+            except tf_errors.FailedPreconditionError as e:
+                logging.fatal(f"Exception occurred while handling issue "
+                              f"{repo_owner}/{repo_name}#{issue_num}. \n"
+                              f"Exception: {e}\n"
+                              f"{traceback.format_exc()}\n."
+                              f"This usually indicates an issue with "
+                              f"trying to use the model in a thread different "
+                              f"from the one it was created in. "
+                              f"The program will restart to try to recover.")
+                sys.exit(1)
+
             #TODO(jlewi): We should catch a more narrow exception.
             # On exception if we don't ack the message then we risk problems
             # caused by poison pills repeatedly crashing our workers