Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Deadline Exceeded in TFEvent Metrics Collector #877

Closed
gaocegege opened this issue Oct 14, 2019 · 7 comments · Fixed by #881
Closed

[bug] Deadline Exceeded in TFEvent Metrics Collector #877

gaocegege opened this issue Oct 14, 2019 · 7 comments · Fixed by #881
Labels

Comments

@gaocegege
Copy link
Member

/train/metrics/test/events.out.tfevents.1571041915.quick-start-example-jrth8kpc-zj5r4 will be parsed.
/train/metrics/train/events.out.tfevents.1571041914.quick-start-example-jrth8kpc-zj5r4 will be parsed.
In quick-start-example-jrth8kpc 900 metrics will be reported.
Traceback (most recent call last):
  File "main.py", line 45, in <module>
    ), 10)
  File "/usr/local/lib/python2.7/dist-packages/grpc/beta/_client_adaptations.py", line 309, in __call__
    self._request_serializer, self._response_deserializer)
  File "/usr/local/lib/python2.7/dist-packages/grpc/beta/_client_adaptations.py", line 195, in _blocking_unary_unary
    raise _abortion_error(rpc_error_call)
grpc.framework.interfaces.face.face.ExpirationError: ExpirationError(code=StatusCode.DEADLINE_EXCEEDED, details="Deadline Exceeded")

When I run the TFJob example, I got the error in metrics collector sidecar during the training.

@gaocegege
Copy link
Member Author

Not sure if it is a bug in my dev env now.

/cc @hougangliu

@gaocegege
Copy link
Member Author

/kind bug

@hougangliu
Copy link
Member

lt looks error occurs when persisting metrics into DB by katib-manager ReportObservationLog API. Can you try if https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/tfjob-example.yaml works well in your env?

@gaocegege
Copy link
Member Author

I will have a look. BTW, what's the expected behavior when there is an error in metrics collector?

@johnugeorge
Copy link
Member

We need to have a retry. Also should we call ReportObservationLog in batches rather than single call?

@gaocegege
Copy link
Member Author

@johnugeorge It could be a feature or enhancement for metrics collector. I will open an issue for it.

This issue will be fixed by #881 We should have a longer timeout in tf metrics collector.

@johnugeorge
Copy link
Member

SGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants