Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

routing metadata missing in new executor #1823

Closed
RafalSkolasinski opened this issue May 13, 2020 · 15 comments · Fixed by #2723
Closed

routing metadata missing in new executor #1823

RafalSkolasinski opened this issue May 13, 2020 · 15 comments · Fixed by #2723
Assignees
Milestone

Comments

@RafalSkolasinski
Copy link
Contributor

Reporter on community slack by user.

As new orchestrator (executor) is not inspecting payloads passed around the graph it is not currently adding the routing metadata, e.g.

"meta" : {
    "puid" : "...",
    "tags" : {},
    "routing" : {
        "ab_test" : 1,
        "combiner1" : -1,
        "mypostprocess1" : -1,
        "router1" : 1
    }
}

This may be important in context of multi-armed bandits and AB testing.

@RafalSkolasinski RafalSkolasinski added bug triage Needs to be triaged and prioritised accordingly labels May 13, 2020
@ukclivecox ukclivecox added priority/p1 and removed triage Needs to be triaged and prioritised accordingly labels May 14, 2020
@ukclivecox ukclivecox added this to the 1.2 milestone Jun 25, 2020
@RafalSkolasinski
Copy link
Contributor Author

This may be possible to be done on the wrapper itself.
The name should be available under one of the environmental variables (need to check which one) and the route integer is actual output of the wrapper so it should also be known there.

@ukclivecox ukclivecox modified the milestones: 1.2, 1.3 Jul 9, 2020
@amirclam
Copy link

Hi @cliveseldon
when can we expect this to be added for now we are using seldon-core 1.1 with the old java engine
since we need the routing /tags information in our system

@ukclivecox
Copy link
Contributor

We have not had a chance to look at it but plan to discuss to see if it can be in language wrapper and prioritized.

@axsaucedo axsaucedo modified the milestones: 1.3, 1.2 Jul 15, 2020
@RafalSkolasinski
Copy link
Contributor Author

RafalSkolasinski commented Jul 15, 2020

After reviewing it does not seem it is possible to reconstruct current behaviour on the wrapper only.

This is because tags set on ROUTER nodes are disregarded by the executor - it only uses the route information to send original payload down the graph.

Changing that behaviour would mean that executor would have to modify the original payloads and that could mean performance hit to all graphs with the ROUTER component.

Alternative solution may be to record route information using the request headers.


@amirclam Would love to hear more about how you are using these in your system. I am mainly interested if you need to know the integer representing the chosen route or for example plain information which nodes were visited would be sufficient -> knowing the graph one could workout the route.

@ukclivecox ukclivecox modified the milestones: 1.2, 1.3 Jul 16, 2020
@ukclivecox ukclivecox modified the milestones: 1.3, 1.4 Sep 17, 2020
@theone4ever
Copy link

@RafalSkolasinski @cliveseldon We are experimenting Seldon-core and mostly interested with multi arm bandit solution, currently we are on Seldon 1.1 and feedback method is available. What is your suggestion? Does the fix included in 1.3 or there is some other workaround on 1.1?

@RafalSkolasinski
Copy link
Contributor Author

No, unfortunately not in 1.3.0.
This one is next one on my task list, though.

@theone4ever
Copy link

Ok, thx for quick feedback. And there is no workaround on 1.1, right?

@RafalSkolasinski
Copy link
Contributor Author

RafalSkolasinski commented Sep 30, 2020

You could always run with old orchestrator system, the engine, for time being. There is a doc entry on how to set it up.

@theone4ever
Copy link

@RafalSkolasinski thx for suggestion, I follow the doc you pointed above, re-installed with following command:
helm install seldon-core seldon-core-operator --repo https://storage.googleapis.com/seldon-charts --set istio.enabled=true --set usageMetrics.enabled=true --set executor.enabled=false --namespace seldon-system.
However, I can't even deploy a prediction service anymore, the Seldon-container-engine container is always in error state, see info below:

seldon-container-engine:
    Container ID:   docker://6c5abf378419f26335da3c361e2c459642613ccf3808dbb90654da9d3d7f0e1d
    Image:          docker.io/seldonio/engine:1.3.0
    Image ID:       docker-pullable://seldonio/engine@sha256:25dd6b71a641b5d18c7f36e9282c7e69322b6be0c17db43f27b55f4e596816e8
    Ports:          8000/TCP, 8000/TCP, 5001/TCP, 8082/TCP, 9090/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 01 Oct 2020 22:18:19 +0200
      Finished:     Thu, 01 Oct 2020 22:19:13 +0200
    Ready:          False
    Restart Count:  5
    Requests:
      cpu:      100m
    Liveness:   http-get https://:admin/live delay=20s timeout=60s period=5s #success=1 #failure=3
    Readiness:  http-get https://:admin/ready delay=20s timeout=60s period=5s #success=1 #failure=3
    Environment:
      ENGINE_PREDICTOR:                
      DEPLOYMENT_NAME:                 pipeline
      DEPLOYMENT_NAMESPACE:            seldon-system
      ENGINE_SERVER_PORT:              8000
      ENGINE_SERVER_GRPC_PORT:         5001
      JAVA_OPTS:                       -server
      SELDON_LOG_MESSAGES_EXTERNALLY:  false
    Mounts:
      /etc/podinfo from seldon-podinfo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-5ztmq (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  seldon-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  default-token-5ztmq:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-5ztmq
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From                               Message
  ----     ------     ----                  ----                               -------
  Normal   Scheduled  6m9s                  default-scheduler                  Successfully assigned seldon-system/seldon-fe3d02ba53e29f068d3499250bbcdd5f-7f85df48c7-ccpz2 to aks-nodepool1-12337826-1
  Normal   Pulling    6m8s                  kubelet, aks-nodepool1-12337826-1  Pulling image "seldonacr.azurecr.io/demo/seldon_predict:latest"
  Normal   Pulled     6m8s                  kubelet, aks-nodepool1-12337826-1  Successfully pulled image "seldonacr.azurecr.io/demo/seldon_predict:latest"
  Normal   Created    6m8s                  kubelet, aks-nodepool1-12337826-1  Created container router
  Normal   Created    6m7s                  kubelet, aks-nodepool1-12337826-1  Created container model-a
  Normal   Pulling    6m7s                  kubelet, aks-nodepool1-12337826-1  Pulling image "seldonacr.azurecr.io/demo/seldon_model_a:latest"
  Normal   Pulled     6m7s                  kubelet, aks-nodepool1-12337826-1  Successfully pulled image "seldonacr.azurecr.io/demo/seldon_model_a:latest"
  Normal   Started    6m7s                  kubelet, aks-nodepool1-12337826-1  Started container router
  Normal   Started    6m7s                  kubelet, aks-nodepool1-12337826-1  Started container model-a
  Normal   Pulling    6m7s                  kubelet, aks-nodepool1-12337826-1  Pulling image "seldonacr.azurecr.io/demo/seldon_model_b:latest"
  Normal   Pulled     6m6s                  kubelet, aks-nodepool1-12337826-1  Successfully pulled image "seldonacr.azurecr.io/demo/seldon_model_b:latest"
  Normal   Created    6m6s                  kubelet, aks-nodepool1-12337826-1  Created container model-b
  Normal   Started    6m6s                  kubelet, aks-nodepool1-12337826-1  Started container model-b
  Normal   Pulled     6m6s                  kubelet, aks-nodepool1-12337826-1  Successfully pulled image "seldonacr.azurecr.io/demo/seldon_model_c:latest"
  Normal   Pulling    6m6s                  kubelet, aks-nodepool1-12337826-1  Pulling image "seldonacr.azurecr.io/demo/seldon_model_c:latest"
  Normal   Created    6m5s                  kubelet, aks-nodepool1-12337826-1  Created container model-c
  Normal   Started    6m5s                  kubelet, aks-nodepool1-12337826-1  Started container model-c
  Normal   Pulling    6m5s                  kubelet, aks-nodepool1-12337826-1  Pulling image "seldonacr.azurecr.io/demo/seldon_predict:latest"
  Normal   Pulled     6m5s                  kubelet, aks-nodepool1-12337826-1  Successfully pulled image "seldonacr.azurecr.io/demo/seldon_predict:latest"
  Normal   Created    6m5s                  kubelet, aks-nodepool1-12337826-1  Created container input-transformer
  Normal   Started    6m5s                  kubelet, aks-nodepool1-12337826-1  Started container input-transformer
  Normal   Pulling    6m5s                  kubelet, aks-nodepool1-12337826-1  Pulling image "seldonacr.azurecr.io/demo/seldon_predict:latest"
  Normal   Pulled     6m5s                  kubelet, aks-nodepool1-12337826-1  Successfully pulled image "seldonacr.azurecr.io/demo/seldon_predict:latest"
  Normal   Created    6m4s                  kubelet, aks-nodepool1-12337826-1  Created container output-transformer
  Normal   Started    6m4s                  kubelet, aks-nodepool1-12337826-1  Started container output-transformer
  Normal   Pulled     6m4s                  kubelet, aks-nodepool1-12337826-1  Container image "docker.io/seldonio/engine:1.3.0" already present on machine
  Warning  Unhealthy  67s (x36 over 5m42s)  kubelet, aks-nodepool1-12337826-1  Readiness probe failed: Get https://10.244.1.8:8082/ready: http: server gave HTTP response to HTTPS client

Then container log of Seldon-container-engine looks like below:


  .   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::        (v2.2.0.RELEASE)

2020-10-01 20:18:23.839  WARN 6 --- [           main] o.s.h.c.j.Jackson2ObjectMapperBuilder    : For Jackson Kotlin classes support please add "com.fasterxml.jackson.module:jackson-module-kotlin" to the classpath

Any tips?

@RafalSkolasinski
Copy link
Contributor Author

Hi @theone4ever, thanks for reporting that. This is quite unfortunate - could you please open a separate issue about it?

@theone4ever
Copy link

@RafalSkolasinski sure, but before that just want to double confirm if java engine is still supported in 1.3? In doc here, it says:
You can continue to use the Java engine Service Orchestrator but this will be deprecated in release 1.2.: https://docs.seldon.io/projects/seldon-core/en/v1.1.0/graph/svcorch.html#using-the-java-engine

@RafalSkolasinski
Copy link
Contributor Author

RafalSkolasinski commented Oct 1, 2020

It's considered last resort fallback if for some reason one really, really cannot run with Executor. The above (this issue) seems the only missing piece and we believe only in context of multi-armed bandits...

Saying that we don't really develop java engine anymore and are seriously considering fully dropping it (probably with 1.4 though not decided)

@theone4ever
Copy link

ok, then in 1.3, at least java engine should still work, right @RafalSkolasinski ? In this case, I will report another issue then.

@RafalSkolasinski
Copy link
Contributor Author

@theone4ever, yes, please - we briefly looked at your logs and believe that the problem is that your liveness and readiness probes are configured to use https endpoint when in reality they are http ones. But it's better to discuss this in separate issue and keep this one for routing metadata only

@axsaucedo axsaucedo modified the milestones: 1.4, 1.5 Oct 15, 2020
@adriangonz
Copy link
Contributor

Hey @theone4ever @RafalSkolasinski, I've opened up a new issue with the Java engine issue you described. We can continue the discussion there: #2586

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants