Integration test fail intermittently in GH runners #174

i-chvets · 2023-09-15T13:45:13Z

Bug Description

When executing Kserve integration tests with additional server the tests started to fail intermittently. In 90% of the cases test could not complete.
This PR exposed this issue.

Failed run:
https://github.com/canonical/kserve-operators/actions/runs/6187990896

Successful run:
https://github.com/canonical/kserve-operators/actions/runs/6190038075

From initial investigation it looks like that there are not enough resouces in GH runner to complete tests:

 test_charm:test_charm.py:308 mlserver-sklearn-iris is not ready {'lastTransitionTime': '2023-09-14T19:56:19Z', 'message': 'Revision "mlserver-sklearn-iris-predictor-default-00001" failed with message: 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..', 'reason': 'RevisionFailed', 'severity': 'Info', 'status': 'False', 'type': 'PredictorConfigurationReady'}

Setting limits.cpu: 250m and deleting/asserting deletion of test resouces solved the issue (see above PR).

The resouce limitation is not affecting just Kserve. It has been observed in other repositories as well.

To Reproduce

Trigger pull request workflow.

Environment

GH runners

Relevant Log Output

# K8S logs:

    CONTROLLER_NAME: github-pr-29756-microk8s
NAMESPACE         LAST SEEN   TYPE      REASON                    OBJECT                              MESSAGE
default           5m14s       Warning   FreeDiskSpaceFailed       node/fv-az42-917                    failed to garbage collect required amount of images. Wanted to free 11716064051 bytes, but freed 317164 bytes
default           5m14s       Warning   ImageGCFailed             node/fv-az42-917                    failed to garbage collect required amount of images. Wanted to free 11716064051 bytes, but freed 317164 bytes
knative-serving   3m30s       Warning   FailedGetResourceMetric   horizontalpodautoscaler/activator   failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
knative-serving   3m27s       Warning   FailedGetResourceMetric   horizontalpodautoscaler/webhook     failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
test-charm-sa6s   2m9s        Warning   FailedGetResourceMetric   horizontalpodautoscaler/istiod      failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
default           2s          Warning   FreeDiskSpaceFailed       node/fv-az42-917                    failed to garbage collect required amount of images. Wanted to free 16135041843 bytes, but freed 293648095 bytes
default           1s          Warning   ImageGCFailed             node/fv-az42-917                    failed to garbage collect required amount of images. Wanted to free 16135041843 bytes, but freed 293648095 bytes

Additional Context

N/A

The text was updated successfully, but these errors were encountered:

i-chvets added the bug Something isn't working label Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration test fail intermittently in GH runners #174

Integration test fail intermittently in GH runners #174

i-chvets commented Sep 15, 2023 •

edited

Loading

Integration test fail intermittently in GH runners #174

Integration test fail intermittently in GH runners #174

Comments

i-chvets commented Sep 15, 2023 • edited Loading

Bug Description

To Reproduce

Environment

Relevant Log Output

Additional Context

i-chvets commented Sep 15, 2023 •

edited

Loading