You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When executing Kserve integration tests with additional server the tests started to fail intermittently. In 90% of the cases test could not complete.
This PR exposed this issue.
From initial investigation it looks like that there are not enough resouces in GH runner to complete tests:
test_charm:test_charm.py:308 mlserver-sklearn-iris is not ready {'lastTransitionTime': '2023-09-14T19:56:19Z', 'message': 'Revision "mlserver-sklearn-iris-predictor-default-00001" failed with message: 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..', 'reason': 'RevisionFailed', 'severity': 'Info', 'status': 'False', 'type': 'PredictorConfigurationReady'}
Setting limits.cpu: 250m and deleting/asserting deletion of test resouces solved the issue (see above PR).
The resouce limitation is not affecting just Kserve. It has been observed in other repositories as well.
To Reproduce
Trigger pull request workflow.
Environment
GH runners
Relevant Log Output
# K8S logs:
CONTROLLER_NAME: github-pr-29756-microk8s
NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE
default 5m14s Warning FreeDiskSpaceFailed node/fv-az42-917 failed to garbage collect required amount of images. Wanted to free 11716064051 bytes, but freed 317164 bytes
default 5m14s Warning ImageGCFailed node/fv-az42-917 failed to garbage collect required amount of images. Wanted to free 11716064051 bytes, but freed 317164 bytes
knative-serving 3m30s Warning FailedGetResourceMetric horizontalpodautoscaler/activator failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
knative-serving 3m27s Warning FailedGetResourceMetric horizontalpodautoscaler/webhook failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
test-charm-sa6s 2m9s Warning FailedGetResourceMetric horizontalpodautoscaler/istiod failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
default 2s Warning FreeDiskSpaceFailed node/fv-az42-917 failed to garbage collect required amount of images. Wanted to free 16135041843 bytes, but freed 293648095 bytes
default 1s Warning ImageGCFailed node/fv-az42-917 failed to garbage collect required amount of images. Wanted to free 16135041843 bytes, but freed 293648095 bytes
Additional Context
N/A
The text was updated successfully, but these errors were encountered:
Bug Description
When executing Kserve integration tests with additional server the tests started to fail intermittently. In 90% of the cases test could not complete.
This PR exposed this issue.
Failed run:
https://github.com/canonical/kserve-operators/actions/runs/6187990896
Successful run:
https://github.com/canonical/kserve-operators/actions/runs/6190038075
From initial investigation it looks like that there are not enough resouces in GH runner to complete tests:
Setting
limits.cpu: 250m
and deleting/asserting deletion of test resouces solved the issue (see above PR).The resouce limitation is not affecting just Kserve. It has been observed in other repositories as well.
To Reproduce
Trigger pull request workflow.
Environment
GH runners
Relevant Log Output
Additional Context
N/A
The text was updated successfully, but these errors were encountered: