-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Deadline Exceeded in TFEvent Metrics Collector #877
Comments
Not sure if it is a bug in my dev env now. /cc @hougangliu |
/kind bug |
lt looks error occurs when persisting metrics into DB by katib-manager ReportObservationLog API. Can you try if https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/tfjob-example.yaml works well in your env? |
I will have a look. BTW, what's the expected behavior when there is an error in metrics collector? |
We need to have a retry. Also should we call ReportObservationLog in batches rather than single call? |
@johnugeorge It could be a feature or enhancement for metrics collector. I will open an issue for it. This issue will be fixed by #881 We should have a longer timeout in tf metrics collector. |
SGTM |
When I run the TFJob example, I got the error in metrics collector sidecar during the training.
The text was updated successfully, but these errors were encountered: