Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy model but pod is evicted for many times before running #515

Closed
dyoung23 opened this issue Apr 18, 2019 · 12 comments
Closed

Deploy model but pod is evicted for many times before running #515

dyoung23 opened this issue Apr 18, 2019 · 12 comments

Comments

@dyoung23
Copy link

Sometimes the pod will be evicted many many time before running.

serve0315-serve0315-a4e34ed-84fb7dbbd7-rlgj8          0/2     Evicted            0          3h    <none>            192.168.13.135
serve0315-serve0315-a4e34ed-84fb7dbbd7-slxmn          0/2     Evicted            0          3h    <none>            192.168.13.135
serve0315-serve0315-a4e34ed-84fb7dbbd7-xsqhd          2/2     Running            0          3h    192.168.234.125   192.168.13.124

the logs from the evicted pod is as below:

...
Events:
  Type     Reason                 Age                From                     Message
  ----     ------                 ----               ----                     -------
  Normal   Scheduled              13m                default-scheduler        Successfully assigned serve0315-serve0315-a4e34ed-84fb7dbbd7-rlgj8 to 192.168.13.135
  Normal   SuccessfulMountVolume  13m                kubelet, 192.168.13.135  MountVolume.SetUp succeeded for volume "podinfo"
  Normal   SuccessfulMountVolume  13m                kubelet, 192.168.13.135  MountVolume.SetUp succeeded for volume "default-token-vkdmn"
  Normal   SuccessfulMountVolume  13m (x2 over 13m)  kubelet, 192.168.13.135  MountVolume.SetUp succeeded for volume "315-ceph-pv"
  Warning  Evicted                12m (x2 over 13m)  kubelet, 192.168.13.135  The node was low on resource: memory.
  Warning  ExceededGracePeriod    12m (x2 over 13m)  kubelet, 192.168.13.135  Container runtime did not kill the pod within specified grace period.
  Normal   Pulling                12m                kubelet, 192.168.13.135  pulling image "192.168.12.41:5000/seldon-mock-classifier:1.0"
  Normal   Pulled                 12m                kubelet, 192.168.13.135  Successfully pulled image "192.168.12.41:5000/seldon-mock-classifier:1.0"
  Normal   Created                12m                kubelet, 192.168.13.135  Created container
  Normal   Started                12m                kubelet, 192.168.13.135  Started container
  Normal   Pulled                 12m                kubelet, 192.168.13.135  Container image "192.168.12.41:5000/engine:0.2.7-SNAPSHOT" already present on machine
  Normal   Created                11m                kubelet, 192.168.13.135  Created container
  Normal   Started                10m                kubelet, 192.168.13.135  Started container
  Warning  FailedPreStopHook      10m                kubelet, 192.168.13.135  Exec lifecycle hook ([/bin/sh -c curl 127.0.0.1:8000/pause && /bin/sleep 10]) for Container "seldon-container-engine" in Pod "serve0315-serve0315-a4e34ed-84fb7dbbd7-rlgj8_kf(a9527db7-6186-11e9-9f44-fa163e446fda)" failed - error: command '/bin/sh -c curl 127.0.0.1:8000/pause && /bin/sleep 10' exited with 7:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 127.0.0.1 port 8000: Connection refused
, message: "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 127.0.0.1 port 8000: Connection refused\n"
  Normal  Killing  10m  kubelet, 192.168.13.135  Killing container with id docker://seldon-container-engine:Need to kill Pod
  Normal  Killing  10m  kubelet, 192.168.13.135  Killing container with id docker://serve0315:Need to kill Pod

@ukclivecox
Copy link
Contributor

I think the pertinent line is:

 Warning  Evicted                12m (x2 over 13m)  kubelet, 192.168.13.135  The node was low on resource: memory.

Can you try with a larger cluster? Are you using minikube? If so maybe increase memory e.g.,

minikube start --memory 4096 --cpus 6

@dyoung23
Copy link
Author

@cliveseldon Thanks for your reply.
I'm using a large cluster with my colleagues.But I don't know why it's always scheduled on this node.

And I noticed this line,but I deployed a model which was running at last on the same node last time.

admin-1-10-admin-1-10-bea2a16-5dbbb9bfb6-v8zxr        2/2     Running            1          14h   192.168.53.177    192.168.13.135
admin-1-10-admin-1-10-bea2a16-5dbbb9bfb6-v9q5d        0/2     Evicted            0          15h   <none>            192.168.13.135
admin-1-10-admin-1-10-bea2a16-5dbbb9bfb6-vfnpc        0/2     Evicted            0          16h   <none>            192.168.13.135
admin-1-10-admin-1-10-bea2a16-5dbbb9bfb6-vgrf7        0/2     Evicted            0          14h   <none>            192.168.13.135
admin-1-10-admin-1-10-bea2a16-5dbbb9bfb6-vgwqm        0/2     Evicted            0          15h   <none>            192.168.13.135
admin-1-10-admin-1-10-bea2a16-5dbbb9bfb6-vxcbk        0/2     Evicted            0          15h   <none>            192.168.13.135
admin-1-10-admin-1-10-bea2a16-5dbbb9bfb6-w9ddb        0/2     Evicted            0          16h   <none>            192.168.13.135

And I'm just curious for what it is doing to curl 127.0.0.1:8000/pause and need to kill pod.

Warning  FailedPreStopHook      10m                kubelet, 192.168.13.135  Exec lifecycle hook ([/bin/sh -c curl 127.0.0.1:8000/pause && /bin/sleep 10]) for Container "seldon-container-engine" in Pod "serve0315-serve0315-a4e34ed-84fb7dbbd7-rlgj8_kf(a9527db7-6186-11e9-9f44-fa163e446fda)" failed - error: command '/bin/sh -c curl 127.0.0.1:8000/pause && /bin/sleep 10' exited with 7:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 127.0.0.1 port 8000: Connection refused
, message: "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 127.0.0.1 port 8000: Connection refused\n"
  Normal  Killing  10m  kubelet, 192.168.13.135  Killing container with id docker://seldon-container-engine:Need to kill Pod
  Normal  Killing  10m  kubelet, 192.168.13.135  Killing container with id docker://serve0315:Need to kill Pod

@ukclivecox
Copy link
Contributor

The curl pause is a preStophandler telling the svc-orchestrator to do a graceful shutdown. It will only be called when Kubernetes has sent a termination signal to the pod. So I don't think this is connected with the issue. It sounds more like the resource requirements are invalid - maybe add more memory to resource requests?

@dyoung23
Copy link
Author

Ah,I see.Thanks for your help.
I meet another error just now when I request to get predictions follow this example.

curl -v 0.0.0.0:8003/seldon/mymodel/api/v0.1/predictions -d '{"data":{"names":["a","b"],"tensor":{"shape":[2,2],"values":[0,0,1,1]}}}' -H "Content-Type: application/json"

But I get 404 Not Found or 500 Internal Server Error randomly.

* About to connect() to 192.168.13.50 port 32300 (#0)
*   Trying 192.168.13.50...
* Connected to 192.168.13.50 (192.168.13.50) port 32300 (#0)
> POST /seldon/admin-1-10/api/v0.1/predictions HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 192.168.13.50:32300
> Accept: */*
> Content-Type: application/json
> Content-Length: 72
> 
* upload completely sent off: 72 out of 72 bytes
< HTTP/1.1 404 Not Found
< content-type: text/plain; charset=utf-8
< x-content-type-options: nosniff
< date: Thu, 18 Apr 2019 08:20:41 GMT
< content-length: 19
< x-envoy-upstream-service-time: 0
< server: envoy
< 
404 page not found
* About to connect() to 192.168.13.50 port 32300 (#0)
*   Trying 192.168.13.50...
* Connected to 192.168.13.50 (192.168.13.50) port 32300 (#0)
> POST /seldon/admin-1-10/api/v0.1/predictions HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 192.168.13.50:32300
> Accept: */*
> Content-Type: application/json
> Content-Length: 72
> 
* upload completely sent off: 72 out of 72 bytes
< HTTP/1.1 500 Internal Server Error
< x-application-context: application:8081
< content-type: application/json;charset=utf-8
< content-length: 170
< date: Thu, 18 Apr 2019 08:20:45 GMT
< x-envoy-upstream-service-time: 12
< server: envoy
< 
{
  "code": 203,
  "info": "org.springframework.web.client.HttpServerErrorException: 500 INTERNAL SERVER ERROR",
  "reason": "Microservice error",
  "status": "FAILURE"
* Connection #0 to host 192.168.13.50 left intact

From the logs from model pod,I found that it meet some error internal lead to 500 Internal Server Error,but it even fail to get the request so response to 404 Not Found.

2019-04-18 08:20:36,972 - werkzeug:_log:88 - INFO:  127.0.0.1 - - [18/Apr/2019 08:20:36] "POST /predict HTTP/1.1" 500 -
2019-04-18 08:20:45,882 - flask.app:log_exception:1761 - ERROR:  Exception on /predict [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python2.7/site-packages/flask_cors/extension.py", line 161, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/local/lib/python2.7/site-packages/seldon_core/model_microservice.py", line 81, in Predict
    predictions = predict(user_model, features, names)
  File "/usr/local/lib/python2.7/site-packages/seldon_core/model_microservice.py", line 33, in predict
    return user_model.predict(features, feature_names)
  File "/microservice/DeepMnist.py", line 18, in predict
    predictions = self.sess.run(self.y,feed_dict={self.x:X})
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 944, in _run
    % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (2, 2) for Tensor u'x:0', which has shape '(?, 784)'
2019-04-18 08:20:45,883 - werkzeug:_log:88 - INFO:  127.0.0.1 - - [18/Apr/2019 08:20:45] "POST /predict HTTP/1.1" 500 -

@ukclivecox
Copy link
Contributor

Looks like you are sending the wrong size Tensor

ValueError: Cannot feed value of shape (2, 2) for Tensor u'x:0', which has shape '(?, 784)'

@dyoung23
Copy link
Author

But wrong request will result in 404?

@ukclivecox
Copy link
Contributor

The request resulted in 500 as the python code failed on predicting with that input. 404 would be an incorrect URL to a path that didn't exist.

@dyoung23
Copy link
Author

But I just request the same URL, get 500 or 404 randomly.
I found the Tensor size is not right so result in 505,I just don't know why I get 404.

@ukclivecox
Copy link
Contributor

Yes. 404 is strange for the same path. The 404 could only be if the model is removed so the Ambassador path no longer exists.

@dyoung23
Copy link
Author

yeah,it is.So it's really strange to get 500 sometimes. Do you have some idea to locate this problem?

@ukclivecox
Copy link
Contributor

I would solve the 500 first so you are always sending the correct payloads.
Then would be good to check if the namespace where this is running is changing in anyway - any evictions still happening or something?

@ukclivecox
Copy link
Contributor

Assuming fixed. Please reopen if still issue on latest Seldon 0.4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants