Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UI could help in troubleshooting any pods stuck in ContainerCreating state #1711

Closed
mattnworb opened this issue Jul 31, 2019 · 4 comments · Fixed by #3304
Closed

UI could help in troubleshooting any pods stuck in ContainerCreating state #1711

mattnworb opened this issue Jul 31, 2019 · 4 comments · Fixed by #3304
Assignees
Labels
area/frontend area/pipelines area/troubleshoot help wanted The community is welcome to contribute. kind/feature priority/p1 status/triaged Whether the issue has been explicitly triaged

Comments

@mattnworb
Copy link
Contributor

What happened:

If a pipeline is run where the pod for one of the steps is stuck in ContainerCreating or any other non-running state, the Pipeline UI is capable of showing what state the pod is in, but not why or what the user running the experiment should do to resolve things.

Additionally, an error message is shown about not being able to view the logs, which happens since there are no logs to display for a pod that has not yet run - but the phrasing is somewhat confusing.

image-2019-07-31-09-47-47-750

What did you expect to happen:

It would be really helpful to ML engineers running an experiment for the UI to attempt to diagnose the problem, or at least to display the events from the Kubernetes pod - which in the case of the screenshot above, would show that it was due to the pod attempting to mount a Secret that does not actually exist.

As-is, the user running the experiment has to inspect the state of the Kubernetes cluster to troubleshoot the problem, and we have found that often the engineer running the experiment does not have the experience or background to be able to do so effectively.

Anything else you would like to add:
For what it is worth, this is less of a bug than feedback and a feature request.

@paveldournov
Copy link
Contributor

@mattnworb do the stackdriver logs provide any info if you navigate to it using the links in the UI?

@mattnworb
Copy link
Contributor Author

In the case where a pod can't be created because its PodSpec tries to mount a volume/secret that doesn't exist, I don't think the Stackdriver Logs help - because there are no logs from the container to show if none of the containers have run yet.

@mattnworb
Copy link
Contributor Author

to add more context to the above comment, when a pod is in ContainerCreating, the UI shows two links:

Logs can still be viewed in either [Legacy Stackdriver] or in [Stackdriver Kubernetes Monitoring]

the former is to a Stackdriver Logging query like

resource.type="container"
resource.labels.cluster_name:...
resource.labels.pod_id:...

the latter is a query for

resource.type="k8s_container"
resource.labels.cluster_name:...
resource.labels.pod_name:...

since the container has not run, there are no logs from the container.

However an additional link to view the Kubernetes Events for the pod could be helpful:

resource.type="k8s_pod"
resource.labels.cluster_name=...
resource.labels.pod_name=...

since in the case of a volume not being able to be mounted because the referenced Secret doesn't exist, etc., an Event is generated and attached to the Pod.

@Bobgy Bobgy added area/frontend status/triaged Whether the issue has been explicitly triaged priority/p1 labels Feb 25, 2020
@Bobgy
Copy link
Contributor

Bobgy commented Feb 25, 2020

I think this would be very useful for debugging, also aligns with #3112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/frontend area/pipelines area/troubleshoot help wanted The community is welcome to contribute. kind/feature priority/p1 status/triaged Whether the issue has been explicitly triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants