Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in "seldon-container-engine" with MLFLOW_SERVER #1922

Closed
Nithinbs18 opened this issue Jun 8, 2020 · 24 comments
Closed

Issue in "seldon-container-engine" with MLFLOW_SERVER #1922

Nithinbs18 opened this issue Jun 8, 2020 · 24 comments
Assignees
Labels

Comments

@Nithinbs18
Copy link

Dear team,

Greetings!!
I have been trying to deploy a model locally on my laptop using MLFLOW server, I have the appropriate credentials create as mentioned in Prepackaged Model Servers --> Handling credentials --> Create a secret containing the environment variables (https://docs.seldon.io/projects/seldon-core/en/v1.1.0/servers/overview.html#create-a-secret-containing-the-environment-variables)
I have my yaml ;

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: model-a
spec:
  name: model-a
  predictors:
  - graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://seldon/
      envSecretRefName: seldon-init-container-secret
      name: classifier
    name: default
    replicas: 1

It always ends up with an error in the initContainer i.e. "classifier-model-initializer" with the below error

Traceback (most recent call last):
  File "/storage-initializer/scripts/initializer-entrypoint", line 14, in <module>
    kfserving.Storage.download(src_uri, dest_path)
  File "/usr/local/lib/python3.7/site-packages/kfserving/storage.py", line 50, in download
    Storage._download_s3(uri, out_dir)
  File "/usr/local/lib/python3.7/site-packages/kfserving/storage.py", line 65, in _download_s3
    client = Storage._create_minio_client()
  File "/usr/local/lib/python3.7/site-packages/kfserving/storage.py", line 217, in _create_minio_client
    secure=use_ssl)
  File "/usr/local/lib/python3.7/site-packages/minio/api.py", line 150, in __init__
    is_valid_endpoint(endpoint)
  File "/usr/local/lib/python3.7/site-packages/minio/helpers.py", line 301, in is_valid_endpoint
    if hostname[-1] == '.':
IndexError: string index out of range

Could you please suggest anything that might be of help to me with this issue?
Thank you very much in advance.

Regards,
Nithin Bhardwaj

@Nithinbs18 Nithinbs18 added bug triage Needs to be triaged and prioritised accordingly labels Jun 8, 2020
@Nithinbs18
Copy link
Author

Hello,

I figured it out I had some extra quote in the secret create my bad. The initContainer classifier-model-initializer and the classifier executed successfully. Now I have a new issue in the "seldon-container-engine"

{"level":"info","ts":1591611950.8325408,"logger":"SeldonRestApi","msg":"Listening","Address":"0.0.0.0:8000"}
{"level":"error","ts":1591611965.4901059,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp 127.0.0.1:9000: connect: connection refused","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\ngithub.com/seldonio/seldon-core/executor/api/rest.(*SeldonRestApi).checkReady\n\t/workspace/api/rest/server.go:198\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2007\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/server.go:176\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2007\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/server.go:191\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2007\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/mux@v1.7.3/mux.go:212\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2802\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1890"}

I follow the guide exactly for MLFLOW_SERVER can anybody guide me what is this issue please.
Thank you very much in advance.

Regards,
Nithin

@Nithinbs18 Nithinbs18 changed the title Unable to download content from remote Minio bucket with MLFLOW_SERVER Issue in "seldon-container-engine" with MLFLOW_SERVER Jun 8, 2020
@adriangonz
Copy link
Contributor

@Nithinbs18 those logs seem to suggest that the classifier node is not coming up (and thus the orchestrator / seldon-container-engine is not able to reach it).

Have you checked if there is anything in the classifier logs?

Something worth mentioning is that if the environment is too large, creating it from scratch may take longer than the readiness / liveness probes. This is a problem with how the MLFLOW_SERVER creates the environment dynamically on pod startup. You can find more details and some potential solutions here: https://docs.seldon.io/projects/seldon-core/en/latest/servers/mlflow.html#conda-environment-creation

@adriangonz adriangonz self-assigned this Jun 9, 2020
@adriangonz adriangonz removed the triage Needs to be triaged and prioritised accordingly label Jun 9, 2020
@Nithinbs18
Copy link
Author

Hi @adriangonz ,

Thank you for your response.
You are right the issue is with the classifier I use a proxy in my network and it is not able to create a conda environment. Is there any way that I can set env variables via deployment YAML?

Thank you very much in advance.

@adriangonz
Copy link
Contributor

You can use the componentSpecs key to override anything on your pod / containers specs. For example, on the SeldonDeployment you shared above you could do:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: model-a
spec:
  name: model-a
  - componentSpecs:
    - spec:
        containers:
        - env:
          - name: FOO
            value: bar
          # Note that name matches your nodes's name
          name: classifier
  predictors:
  - graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://seldon/
      envSecretRefName: seldon-init-container-secret
      name: classifier
    name: default
    replicas: 1

@Nithinbs18
Copy link
Author

Hi @adriangonz ,

Greetings!!
I tried this approach but I am still unable to set the appropriate ENV variable can you please share any documents that could be of help. I tried to search in the focal documentation but was not able to get any concrete results.
Thank you.

BR,
Nithin

@adriangonz
Copy link
Contributor

Hey @Nithinbs18 , is the problem that you aren’t able to set the environment variable? Or is it that the proxy is still blocking access to Conda after setting it?

@Nithinbs18
Copy link
Author

Hey @adriangonz ,
As a workaround, I am using PodPreset for ENV variables now but I figured its a very bad approach as the labels are dynamically assigned every time.
Back to the problem, No I could not set the ENV variables with the below yaml;

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  name: mlflow
  componentSpecs:
    - spec:
        containers:
          - env:
              - name: FOO
                value: bar
            name: classifier
  predictors:
    - graph:
        children: []
        implementation: MLFLOW_SERVER
        modelUri: s3://seldon
        envSecretRefName: seldon-init-container-secret
        name: classifier
      name: default
      replicas: 1

And also a new issue that I face currently is when the Conda environment is getting created the classifies container keeps crash and the downloads start all over again and again.
Does this have to do anything with my hardware resource as I am running the tests on Docker for desktop?
Thank you very much again for your time.

Regards,
Nithin

@adriangonz
Copy link
Contributor

@Nithinbs18 I believe that the problem may be related to what's described in here: https://docs.seldon.io/projects/seldon-core/en/latest/servers/mlflow.html#conda-environment-creation

The pre-packaged MLFLOW_SERVER can be used to download any model with any environment. Thus, the environment needs to be created dynamically for each different model. The inference server only knows what this environment is during runtime, therefore it has to create it during start up.

This slows down the start up time very aggresively, which can be a blocker when you take into account that Kubernetes has a timeout limit for pods to come up. If the pod exceeds this timeout (i.e. bc it's creating the Conda environment), Kubernetes will kill it.

The immediate solution is to build your own inference server specifying your Conda environment at image build time. In other words, creating your own re-usable inference server, with your particular dependencies pre-installed. You can find more info on that here: https://docs.seldon.io/projects/seldon-core/en/latest/servers/custom.html

Alternatively, you can also increase the timeouts for Kubernetes' liveness and readiness probes, thus giving more time to your pod to create the environment. You can find an example here on how those can be specified on your SeldonDeployment: https://docs.seldon.io/projects/seldon-core/en/latest/python/python_component.html?highlight=liveness#rest-health-endpoint

@Nithinbs18
Copy link
Author

Hi @adriangonz,

As suggested I tried to increase the timeout and it worked absolutely fine now. I could also set the ENV variables using the below format

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: test
spec:
  name: test
  predictors:
    - graph:
        children: []
        implementation: MLFLOW_SERVER
        modelUri: s3://seldon
        envSecretRefName: seldon-init-container-secret
        name: classifier
      name: default
      labels:
        test-nithin: nithin
      replicas: 1
      componentSpecs:
       - spec:
          containers:
          - name: classifier
            env:
              - name: FOO
                value: bar
            livenessProbe:
                failureThreshold: 3
                initialDelaySeconds: 150
                periodSeconds: 5
                successThreshold: 1
                tcpSocket:
                  port: http
                timeoutSeconds: 1

The deployment works fine now but I have issues accessing them now I cannot make any prediction on my deployments I have installed everything i.e. seldon-core-operator, ambassador and the model deployment on the default namespace. I get errors when to access the model below are the errors;

ACCESS [2020-06-21T19:53:34.391Z] "POST /seldon/test/api/v0.1/predictions HTTP/1.1" 404 NR 0 0 0 - "10.1.1.214" "HTTPie/0.9.8" "8f859f30-e6a9-4bc6-9294-6154238bc9f0" "localhost:8003" "-"
2020-06-21 19:53:56 diagd 1.5.3 [P62TThreadPoolExecutor-0_2] INFO: 0AC9E726-8854-44AC-8DE7-61FDF37BE8F4: 127.0.0.1 "GET /ambassador/v0/diag/" 15ms 200 success
time="2020-06-21 19:53:56" level=error msg="Bad HTTP response" func=github.com/datawire/apro/cmd/amb-sidecar/devportal/server.HTTPGet.func1 file="github.com/datawire/apro@/cmd/amb-sidecar/devportal/server/fetcher.go:165" status_code=404 subsystem=fetcher url="https://127.0.0.1:8443/seldon/default/test/.ambassador-internal/openapi-docs"
time="2020-06-21 19:53:56" level=error msg="HTTP error 404 from https://127.0.0.1:8443/seldon/default/test/.ambassador-internal/openapi-docs" func=github.com/datawire/apro/cmd/amb-sidecar/devportal/server.HTTPGet file="github.com/datawire/apro@/cmd/amb-sidecar/devportal/server/fetcher.go:172" subsystem=fetcher url="https://127.0.0.1:8443/seldon/default/test/.ambassador-internal/openapi-docs"
time="2020-06-21 19:53:56" level=error msg="Bad HTTP response" func=github.com/datawire/apro/cmd/amb-sidecar/devportal/server.HTTPGet.func1 file="github.com/datawire/apro@/cmd/amb-sidecar/devportal/server/fetcher.go:165" status_code=404 subsystem=fetcher url="https://127.0.0.1:8443/%28seldon.protos.%2A%7Ctensorflow.serving.%2A%29/.%2A/.ambassador-internal/openapi-docs"
time="2020-06-21 19:53:56" level=error msg="HTTP error 404 from https://127.0.0.1:8443/%28seldon.protos.%2A%7Ctensorflow.serving.%2A%29/.%2A/.ambassador-internal/openapi-docs" func=github.com/datawire/apro/cmd/amb-sidecar/devportal/server.HTTPGet file="github.com/datawire/apro@/cmd/amb-sidecar/devportal/server/fetcher.go:172" subsystem=fetcher url="https://127.0.0.1:8443/%28seldon.protos.%2A%7Ctensorflow.serving.%2A%29/.%2A/.ambassador-internal/openapi-docs"
ACCESS [2020-06-21T19:53:56.546Z] "GET /seldon/default/test/.ambassador-internal/openapi-docs HTTP/1.1" 404 - 0 19 1 0 "10.1.1.214" "Go-http-client/1.1" "aa91c271-3f39-4edf-93e3-d40c21e16232" "127.0.0.1:8443" "10.106.2.53:8000"
ACCESS [2020-06-21T19:53:56.550Z] "GET /%28seldon.protos.%2A%7Ctensorflow.serving.%2A%29/.%2A/.ambassador-internal/openapi-docs HTTP/1.1" 404 NR 0 0 0 - "10.1.1.214" "Go-http-client/1.1" "3d96ed75-5eef-4f1d-a4d8-0ee5b0e7ad0e" "127.0.0.1:8443" "-"

Can you please suggest on this I have been stuck with this for some time now.
Thank you very much in advance.

@adriangonz
Copy link
Contributor

adriangonz commented Jun 22, 2020

Hey @Nithinbs18 ,

The inference URL format is something like:

POST /seldon/<namespace>/<model-name>/api/v1.0/predictions

Therefore, if your model is named test and is running on the default namespace, the URL to use should look like:

POST /seldon/default/test/api/v1.0/predictions

Could you try that one and see if it works? You can read more details on how to test your inference endpoints in the docs: https://docs.seldon.io/projects/seldon-core/en/latest/workflow/serving.html

@Nithinbs18
Copy link
Author

Hi @adriangonz ,

Thank you very much for your response. but I have no luck yet :(
My resources:

PS C:\Users\Nithin> kubectl get sdep -n default
NAME   AGE
test   17h
PS C:\Users\Nithin> kubectl get pods -n default
NAME                                         READY   STATUS    RESTARTS   AGE
ambassador-654d65d7f8-5gt64                  1/1     Running   7          18h
ambassador-654d65d7f8-5t6g9                  1/1     Running   7          18h
ambassador-654d65d7f8-qfq84                  1/1     Running   7          18h
ambassador-redis-8556cbb4c6-x5cnv            1/1     Running   1          18h
seldon-controller-manager-6d75c6b8-95wmv     1/1     Running   5          11d
test-default-0-classifier-854699785b-bd76h   2/2     Running   1          17h
PS C:\Users\Nithin> kubectl port-forward $(kubectl get pods -l app.kubernetes.io/name=ambassador -o jsonpath='{.items[0].metadata.name}') 8003:8080
Forwarding from 127.0.0.1:8003 -> 8080
Forwarding from [::1]:8003 -> 8080
Handling connection for 8003

Curl request and response;

(base) root@Nithin:/home/nithin# curl -v -X POST \
>   http://localhost:8003/seldon/default/test/api/v1.0/predictions \
>   -H 'content-type: application/json' \
>   -d '{"data":{"names": ["fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide","total sulfur dioxide","density","pH","sulphates","alcohol"], "ndarray": [[7,0.27,0.36,20.7,0.045,45,170,1.001,3,0.45,8.8]]}}'
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1:8003...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8003 (#0)
> POST /seldon/default/test/api/v1.0/predictions HTTP/1.1
> Host: localhost:8003
> User-Agent: curl/7.68.0
> Accept: */*
> content-type: application/json
> Content-Length: 244
>
* upload completely sent off: 244 out of 244 bytes
* Mark bundle as not supporting multiuse
< HTTP/1.1 301 Moved Permanently
< location: https://localhost:8003/seldon/default/test/api/v1.0/predictions
< date: Mon, 22 Jun 2020 10:59:47 GMT
< server: envoy
< connection: close
< content-length: 0
<
* Closing connection 0

Ambassador logs;

ACCESS [2020-06-22T11:05:03.704Z] "POST /seldon/default/test/api/v1.0/predictions HTTP/1.1" 301 - 0 0 0 - "10.1.1.231" "curl/7.68.0" "6190dc4c-51b8-4b90-b61d-c843ca51b82e" "localhost:8003" "-"
ACCESS [2020-06-22T11:05:09.320Z] "POST /seldon/default/test/api/v1.0/predictions HTTP/1.1" 301 - 0 0 0 - "10.1.1.231" "curl/7.68.0" "856dcaf1-cccc-4111-afd1-507e1364269c" "localhost:8003" "-"

I followed the guide exactly but no luck

@Nithinbs18
Copy link
Author

Hi @adriangonz,

I also tried using the Seldon client i.e.

import pandas as pd
import numpy as np
from seldon_core.seldon_client import SeldonClient

sc = SeldonClient(gateway="ambassador",namespace="default",deployment_name='test')
df = pd.read_csv("./wine-quality.csv")
def _get_reward(y, y_pred):
    if y == y_pred:
        return 500
    return 1 / np.square(y - y_pred)

def _test_row(row):
    input_features = row[:-1]
    feature_names = input_features.index.to_list()
    X = input_features.values.reshape(1, -1)
    y = row[-1].reshape(1, -1)
    r = sc.predict(
        data=X,
        names=feature_names)

    y_pred = r.response.data.tensor.values
    reward = _get_reward(y, y_pred)
    sc.feedback(
        prediction_request=r.request,
        prediction_response=r.response,
        reward=reward)
    return reward[0]

df.apply(_test_row, axis=1)

I got the below error;

 raise SSLError(e, request=request)
requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='localhost', port=8003): Max retries exceeded with url: /seldon/default/test/api/v1.0/predictions (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1108)')))"), 'occurred at index 0')

@adriangonz
Copy link
Contributor

Your Ambassador ingress seems to be redirecting the request to an SSL endpoint. You can see on the logs from curl that it's returning a 301 response with Location set to:

https://localhost:8003/seldon/default/test/api/v1.0/predictions

I'm not sure what could be causing this. Is there any chance you've installed Ambassador in a different way or that you've tweaked any setting? It may also have to do with how your environment is set up.

@adriangonz
Copy link
Contributor

@Nithinbs18 have you been able to identify why is your ingress layer redirecting to an SSL endpoint?

I will be closing this for now, since the original issue seems to have been resolved. Please re-open if that's still a problem.

@Nithinbs18
Copy link
Author

Hi @adriangonz ,
Good morning!
Yes after a lot of debugging I got it working a few hours ago, I had followed the official guide and installed the edge stack of ambassador that was the issues I removed it and reinstalled it with the stable version in helm/chart repo and it started working.

Thank you very much for your time means a lot.
Regards,
Nithin

@adriangonz
Copy link
Contributor

That's amazing! I'm really glad to hear that @Nithinbs18!

It would actually be really useful if you could share your learnings with regards to Ambassador in #2007 , where we are exploring adding support for Ambassador's Edge stack.

@Utkagr
Copy link

Utkagr commented Jul 8, 2020

Hi @adriangonz @Nithinbs18, I'm facing the same issue but it looks like conda environment is being setup properly.

Here's the issue.

{"level":"error","ts":1594187819.2373793,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp 127.0.0.1:9000: connect: connection refused","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\ngithub.com/seldonio/seldon-core/executor/api/rest.(*SeldonRestApi).checkReady\n\t/workspace/api/rest/server.go:164\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2007\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2007\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/mux@v1.7.4/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2007\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2007\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2007\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2007\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/mux@v1.7.4/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2802\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1890"}

Here're the logs for classifer container

...
Activating Conda environment 'mlflow'
starting microservice
2020-07-08 05:58:16,207 - seldon_core.microservice:main:205 - INFO:  Starting microservice.py:main
2020-07-08 05:58:16,207 - seldon_core.microservice:main:206 - INFO:  Seldon Core version: 1.2.1
2020-07-08 05:58:16,208 - seldon_core.microservice:main:268 - INFO:  Parse JAEGER_EXTRA_TAGS []
2020-07-08 05:58:16,209 - seldon_core.microservice:load_annotations:129 - INFO:  Found annotation kubernetes.io/config.seen:2020-07-08T05:52:42.059133935Z
2020-07-08 05:58:16,209 - seldon_core.microservice:load_annotations:129 - INFO:  Found annotation kubernetes.io/config.source:api
2020-07-08 05:58:16,209 - seldon_core.microservice:load_annotations:129 - INFO:  Found annotation prometheus.io/path:/prometheus
2020-07-08 05:58:16,209 - seldon_core.microservice:load_annotations:129 - INFO:  Found annotation prometheus.io/scrape:true
2020-07-08 05:58:16,209 - seldon_core.microservice:main:279 - INFO:  Annotations: {'kubernetes.io/config.seen': '2020-07-08T05:52:42.059133935Z', 'kubernetes.io/config.source': 'api', 'prometheus.io/path': '/prometheus', 'prometheus.io/scrape': 'true'}
2020-07-08 05:58:16,209 - seldon_core.microservice:main:283 - INFO:  Importing MLFlowServer
2020-07-08 05:58:16,464 - root:__init__:19 - INFO:  Creating MLFLow server with URI /mnt/models
2020-07-08 05:58:16,478 - seldon_core.microservice:main:362 - INFO:  REST microservice running on port 9000 single-threaded=0
2020-07-08 05:58:16,478 - seldon_core.microservice:main:410 - INFO:  REST metrics microservice running on port 6000
2020-07-08 05:58:16,479 - seldon_core.microservice:main:420 - INFO:  Starting servers
 * Serving Flask app "seldon_core.wrapper" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
2020-07-08 05:58:16,493 - werkzeug:_log:113 - INFO:   * Running on http://0.0.0.0:6000/ (Press CTRL+C to quit)
/microservice/MLFlowServer.py:46: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  return yaml.load(f.read())
2020-07-08 05:58:16,500 - root:load:24 - INFO:  Downloading model from /mnt/models
2020-07-08 05:58:16,500 - root:download:44 - INFO:  Copying contents of /mnt/models to local
 * Serving Flask app "seldon_core.wrapper" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
2020-07-08 05:58:16,790 - werkzeug:_log:113 - INFO:   * Running on http://0.0.0.0:9000/ (Press CTRL+C to quit)

These logs suggest that the conda env is being created properly right?

When I'm using python client to send a request, I'm getting this in the ambasssador logs.

ACCESS [2020-07-08T05:59:28.162Z] "POST /seldon/default/mlflow-default-0-classifier/api/v1.0/predictions HTTP/1.1" 404 NR 0 0 0 - "172.17.0.6" "python-requests/2.24.0" "f4cbdf89-c695-4581-9df8-fb2f49d5e66a" "localhost:8003" "-"

Python client

from seldon_core.seldon_client import SeldonClient
sc = SeldonClient(deployment_name="mlflow-default-0-classifier",namespace="default")

r = sc.predict(gateway="ambassador",transport="rest",shape=(1,11))
print(r)
assert(r.success==True)

returns

Success:False message:404:Not Found
Request:
meta {
}
data {
  tensor {
    shape: 1
    shape: 11
    values: 0.9713260595640437
    values: 0.7596965188754176
    values: 0.8646927959022535
    values: 0.8383797034554457
    values: 0.6236100927150203
    values: 0.11923213987864845
    values: 0.9632190560758449
    values: 0.3112095637650232
    values: 0.4377568452403575
    values: 0.5467027394131142
    values: 0.024684886012952045
  }
}

Response:
None
Traceback (most recent call last):
  File "wines_client.py", line 9, in <module>
    assert(r.success==True)

All pods are in running state and I had port forwarded 8003->8080 as well. Can you guys throw some light on what could I be doing wrong?

@Nithinbs18
Copy link
Author

Hi @Utkagr

Try with ;
sc = SeldonClient(deployment_name="mlflow",namespace="default")
or
with curl use /seldon/default/mlflow/api/v1.0/predictions as the endpoint

It should work if it still does not work please share your k8 manifest used to create the sdep.

@Utkagr
Copy link

Utkagr commented Jul 8, 2020

@Nithinbs18 Thanks for helping me out. It worked!

But I'm not sure why?

> kubectl get deployments
NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
ambassador                    3/3     3            3           97m
mlflow-default-0-classifier   1/1     1            1           52m

Deployment is named mlflow-default-0-classifier above.

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  name: wines
  predictors:
    - graph:
        children: []
        implementation: MLFLOW_SERVER
        modelUri: gs://seldon-models/mlflow/elasticnet_wine
        name: classifier
      name: default
      replicas: 1

Is it because the metadata in manifest is named mlflow only?

@Nithinbs18
Copy link
Author

Hi @Utkagr

it simple Deployment != SeldonDeployment
you are creating a SeldonDeployment, not a simple K8 deployment.
try using kubectl get sdep

@Utkagr
Copy link

Utkagr commented Jul 8, 2020

Got it! Thanks a lot for helping out.

@akshay2490
Copy link

Hi @adriangonz @Nithinbs18 ,

I am also facing the same issue with seldon(on aws eks cluster) + ambassador in deploying model using the MLFLOW server.
My deployment file is as follows:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  predictors:
  - graph:
      children: []
      endpoint:
        type: REST
      implementation: MLFLOW_SERVER
      modelUri: s3://model_uri
      envSecretRefName: seldon-init-container-secret
      name: tradebot
    name: rf
    traffic: 100
    componentSpecs:
    - spec:
        containers:
        - name: tradebot
          livenessProbe:
                failureThreshold: 3
                initialDelaySeconds: 150
                periodSeconds: 5
                successThreshold: 1
                tcpSocket:
                  port: http
                timeoutSeconds: 1
          readinessProbe:
                failureThreshold: 6
                initialDelaySeconds: 150
                periodSeconds: 5
                successThreshold: 1
                tcpSocket:
                  port: http
                timeoutSeconds: 1
          resources:
            requests:
              memory: "4Gi"
              cpu: 1
            limits:
              memory: "8Gi"
              cpu: 4

My pods are running fine :

NAME                                       READY   STATUS      RESTARTS   AGE
pod/ambassador-869646876-5rl7m             1/1     Running     0          11m
pod/ambassador-869646876-6sjbb             1/1     Running     0          11m
pod/ambassador-869646876-72ctq             1/1     Running     0          11m
pod/ambassador-agent-f5579c74b-bkhqw       1/1     Running     0          11m
pod/ambassador-crd-cleanup-ghznn           0/1     Completed   0          7m35s
pod/mlflow-rf-0-tradebot-d74fc7ffb-9gw7x   2/2     Running     1          63m
NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ambassador             3/3     3            3           11m
deployment.apps/ambassador-agent       1/1     1            1           11m
deployment.apps/mlflow-rf-0-tradebot   1/1     1            1           63m
NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/ambassador-869646876             3         3         3       11m
replicaset.apps/ambassador-agent-f5579c74b       1         1         1       11m
replicaset.apps/mlflow-rf-0-tradebot-d74fc7ffb   1         1         1       63m

My mlflow pod logs tail is as follows:

[2021-05-13 14:14:57 +0000] [2001] [INFO] Booting worker with pid: 2001
2021-05-13 14:14:57,738 - seldon_core.gunicorn_utils:load:88 - INFO:  Tracing branch is active
2021-05-13 14:14:57,749 - seldon_core.utils:setup_tracing:724 - INFO:  Initializing tracing
2021-05-13 14:14:57,837 - seldon_core.utils:setup_tracing:731 - INFO:  Using default tracing config
2021-05-13 14:14:57,837 - jaeger_tracing:_create_local_agent_channel:446 - INFO:  Initializing Jaeger Tracer with UDP reporter
2021-05-13 14:14:57,842 - jaeger_tracing:new_tracer:384 - INFO:  Using sampler ConstSampler(True)
2021-05-13 14:14:57,843 - jaeger_tracing:_initialize_global_tracer:435 - INFO:  opentracing.tracer initialized to <jaeger_client.tracer.Tracer object at 0x7feeea728160>[app_name=MLFlowServer]
2021-05-13 14:14:57,843 - seldon_core.gunicorn_utils:load:93 - INFO:  Set JAEGER_EXTRA_TAGS []
2021-05-13 14:14:57,844 - root:load:28 - INFO:  Downloading model from /mnt/models
2021-05-13 14:14:57,844 - root:download:31 - INFO:  Copying contents of /mnt/models to local
[2021-05-13 14:15:37 +0000] [2003] [INFO] Booting worker with pid: 2003
2021-05-13 14:15:37,181 - seldon_core.gunicorn_utils:load:88 - INFO:  Tracing branch is active
2021-05-13 14:15:37,191 - seldon_core.utils:setup_tracing:724 - INFO:  Initializing tracing
2021-05-13 14:15:37,285 - seldon_core.utils:setup_tracing:731 - INFO:  Using default tracing config
2021-05-13 14:15:37,285 - jaeger_tracing:_create_local_agent_channel:446 - INFO:  Initializing Jaeger Tracer with UDP reporter
2021-05-13 14:15:37,290 - jaeger_tracing:new_tracer:384 - INFO:  Using sampler ConstSampler(True)
2021-05-13 14:15:37,291 - jaeger_tracing:_initialize_global_tracer:435 - INFO:  opentracing.tracer initialized to <jaeger_client.tracer.Tracer object at 0x7feeea728160>[app_name=MLFlowServer]
2021-05-13 14:15:37,291 - seldon_core.gunicorn_utils:load:93 - INFO:  Set JAEGER_EXTRA_TAGS []
2021-05-13 14:15:37,292 - root:load:28 - INFO:  Downloading model from /mnt/models
2021-05-13 14:15:37,292 - root:download:31 - INFO:  Copying contents of /mnt/models to local

I installed amabassador API gateway with the link:

Got the sevice URL/load-balancer-url by,
export SERVICE_IP=$(kubectl get svc --namespace seldon ambassador -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

Running the command kubectl logs mlflow-rf-0-tradebot-d74fc7ffb-9gw7x -c seldon-container-engine gives the following logs for the seldon-container-engine

{"level":"info","ts":1620910834.7631397,"logger":"SeldonRestApi","msg":"Listening","Address":"0.0.0.0:8000"}
{"level":"error","ts":1620910854.4270034,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp 127.0.0.1:9000: connect: connection refused","stacktrace":"github.com/seldonio/seldon-core/executor/api/rest.(*SeldonRestApi).checkReady\n\t/workspace/api/rest/server.go:188\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2036\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2036\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2036\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2036\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2036\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2036\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2831\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1919"}

I am able to access the ambassador diagnostic dashboard at <load-balancer-external-ip/ambassador/v0/diag/> and the swagger seldon ui at <load-balancer-external-ip/seldon/seldon/mlflow/api/v1.0/doc/>.

But, the prediction API sends,
ERROR 504 Error: Gateway Timeout with response body
upstream request timeout
I was able to get my predictions once but after that I am unable to reproduce that using the above settings.

Any help would be welcome,
thanks in advance

@Nithinbs18
Copy link
Author

Hi @akshay2490

Can you increase the "initialDelaySeconds" value to 450, and try once again?

@akshay2490
Copy link

Hi @Nithinbs18,

No luck.

Screenshot from 2021-05-13 22-15-37

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants