Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661

vtlrazin · 2022-09-14T11:48:55Z

Hi,
The training operator failed to start in OpenShift cluster v4.10.30 with error:

# oc -n kubeflow get pod
NAME                                 READY   STATUS                 RESTARTS   AGE
training-operator-5cc8cdfdd6-fpthk   0/1     CreateContainerError   0          2m29s

Warning  Failed  113s   kubelet  Error: container create failed: time="2022-09-14T11:35:13Z" level=error msg="runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?)"

System info:
OCP - v4.10.30
Worker node - NVIDIA DGX A100

The proposed solution to resolve the issue to increase the limit in daemonset deployment to:

        resources:
          limits:
            cpu: 500m
            memory: 300Mi

Best regards,
Vitaliy

The text was updated successfully, but these errors were encountered:

omrishiv mentioned this issue Sep 22, 2022

Update deployment.yaml #1668

Merged

1 task

google-oss-prow bot closed this as completed in #1668 Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661

Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661

vtlrazin commented Sep 14, 2022

Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661

Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661

Comments

vtlrazin commented Sep 14, 2022