Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow operator webhook port to be configured #849

Closed
amigniox opened this issue Sep 8, 2019 · 4 comments · Fixed by #992
Closed

Allow operator webhook port to be configured #849

amigniox opened this issue Sep 8, 2019 · 4 comments · Fixed by #992
Assignees
Milestone

Comments

@amigniox
Copy link

amigniox commented Sep 8, 2019

Context:
Our production cluster is a private GKE cluster in a shared VPC. We installed seldon under the seldon-system namespace. The seldon helm charts were applied by a cluster admin. The seldon-operator-controller-manager pod/service/statefulsets, webhook-server-service have been shown successfully rolled out.

Then we created the some-tenant-ns namespace with label istio-injection enabled, under which I (as a tenant of bluenose) have admin role to create/update/delete seldon custom resources.

However we got the following errors when we try to create SeldonDeployment under some-tenant-ns namespace:

kubectl apply -f model.json -n some-tenant-ns

Errorbody:

Error from server (InternalError): error when creating "model.json": Internal error occurred: failed calling admission webhook "mutating-create-update-seldondeployment.seldon.io": Post https://webhook-server-service.seldon-system.svc:443/mutating-create-update-seldondeployment?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

What we expected to see:
seldondeployment.machinelearning.seldon.io/seldon-model created

Troubleshooting findings:
The KubeAPI server having problem to connect this webhook sevice: webhook-server-service

Namespace: seldon-system

Type of webhook service: ClusterIP

Describe service:

Name:              webhook-server-service
Namespace:         seldon-system
Labels:            <none>
Annotations:       <none>
Selector:          control-plane=seldon-controller-manager
Type:              ClusterIP
IP:                192.168.40.88
Port:              <unset>  443/TCP
TargetPort:        9876/TCP
Endpoints:         10.71.133.185:9876
Session Affinity:  None
Events:            <none>

Current diagnosis:

0.Seldon charts were successfully applied, checked all Seldon related api-resources and RBACs, no issues found.

1.service is running

2.service hostname is DNS solvable(checked with DNS lookup, and we can curl the service's IP in a Pod)

3.service endpoint is correct (podIP of the seldon-operator-controller-manager-0, and this Pod is working)

Steps to reproduce the error:

I set up a private cluster with the same network configuration as the production cluster, and reproduced the error.

Details of cluster setup:

  1. Create VPC network and subnet
    create a network net-1 in region us-east1.
    create a subnet, subnet-1, with the following specs:
    primary range: 10.71.20.128/25 for cluster nodes.
    secondary address ranges:
    my-pods-1 for the Pod IP addresses 10.71.128.0/20
    my-services-1 for the Service IP addresses 192.168.40.0/22

  2. Create private cluster using the network, subnet, and secondary ranges I created.
    enable public endpoint access, enable private nodes, enable network policy, disable http loadbalancing
    masterIpv4CidrBlock: 192.168.7.0/28
    enable master-authorized-networks: 0.0.0.0/0 (allow-all)
    apply IP masquerading as a configmap:

Config:
nonMasqueradeCIDRs: 
- 169.254.0.0/16 
- 10.0.0.0/8 
- 162.53.38.203/32
resyncInterval: 60s
masqLinkLocal: false
  1. Create a NAT configuration using Cloud Router
    network: net-1
    region: us-east1
    nat all subnet ip ranges
  2. Install Istio 1.2.0 with custom configuration profile
#Gateways related config:
gateways:
  istio-ilbgateway:
    enabled: true
    loadBalancerIP: 10.71.20.204
    autoscaleEnabled: true
    autoscaleMin: 1
    autoscaleMax: 10
    ports:
    ## google ILB default quota is 5 ports
    # Add 80/443
    - port: 80
      name: http
    - port: 443
      name: https
    # Add 2003 for graphite data-ingestion
    - port: 2003
      name: graphite-data-ingest
  istio-ingressgateway:
    enabled: false
  1. Install Seldon-core 0.3.1
curl -X GET \
    -o "seldon-core-operator-0.3.1.tgz" \
    "https://www.googleapis.com/storage/v1/b/seldon-charts/o/seldon-core-operator-0.3.1.tgz?alt=media"

helm template seldon-core-operator*tgz --name seldon-core \
    --set istio.enabled=true \
    --set usageMetrics.enabled=false \
    --namespace seldon-system | kubectl apply -n seldon-system -f -

Then I created a bastion host VM in the same vpc network, ssh to cluster nodes and did some quick check:

  • checked kube-proxy is running and its log on node, got to know which mode kube-proxy is working (iptables mode)
  • checking iptables rules for service (but need help to check if there's any problem in it)
    tried update ip-masq-agent config map, add 192.168.0.0/16 to non-masquerate cidr, not working
  • tried to modify iptables rules, e.g. iptables -t nat -A POSTROUTING -j MASQUERADE, error staying the same
@amigniox
Copy link
Author

amigniox commented Sep 10, 2019

Note: even if we install Istio using their built-in configuration profiles (istio-ingressgateway enabled), for example,

helm template install/kubernetes/helm/istio --name istio --namespace istio-system \
    --values install/kubernetes/helm/istio/values-istio-demo.yaml | kubectl apply -f -

we will still have the same error.

We also tried to use Ambassador instead of Istio on the private GKE cluster. The error remains the same. I don't think this error is related to Istio configs. To my understanding, the problem comes from the part when kube api server talks to the seldon webhook service then to the seldon operator. Istio or Ambassador has not kicked in yet at this stage.

@ukclivecox ukclivecox added this to the 0.5.x milestone Sep 17, 2019
@amigniox
Copy link
Author

@cliveseldon after our troubleshooting with GCP support, we identified the root cause of this issue is the port(9876) of the admission webwook endpoint. I understand that this port is hardwired in the seldon operator code (L74).

On private GKE cluster it is a known issuethat only traffic between the master and the node pools on ports 443 and 10250 is allowed. The current walk-around is to add a firewall rule to allow the traffic on the custom port 9876.

@ukclivecox
Copy link
Contributor

ukclivecox commented Sep 25, 2019

That's great you found the root cause. We can see if we can make the webhook port configurable in the Kubebuilder v2 version of the Operator : #841

@amigniox
Copy link
Author

That's great you found the root cause. We can see if we can make the webhook port configurable in the Kubebuilder v2 version of the Operator : #841

That will be great! And I think it will be very helpful if this extra network setting is documented in the Seldon installation guide, because in production environment the GKE clusters are most likely set up under a shared VPC (I can help organize this installation notes if needed).

@ukclivecox ukclivecox self-assigned this Oct 10, 2019
@ukclivecox ukclivecox changed the title unable to create SeldonDeployment in private GKE clusters Allow operator webhook port to be configured Oct 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants