-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow operator webhook port to be configured #849
Comments
Note: even if we install Istio using their built-in configuration profiles (istio-ingressgateway enabled), for example,
we will still have the same error. We also tried to use Ambassador instead of Istio on the private GKE cluster. The error remains the same. I don't think this error is related to Istio configs. To my understanding, the problem comes from the part when kube api server talks to the seldon webhook service then to the seldon operator. Istio or Ambassador has not kicked in yet at this stage. |
@cliveseldon after our troubleshooting with GCP support, we identified the root cause of this issue is the port(9876) of the admission webwook endpoint. I understand that this port is hardwired in the seldon operator code (L74). On private GKE cluster it is a known issuethat only traffic between the master and the node pools on ports 443 and 10250 is allowed. The current walk-around is to add a firewall rule to allow the traffic on the custom port 9876. |
That's great you found the root cause. We can see if we can make the webhook port configurable in the Kubebuilder v2 version of the Operator : #841 |
That will be great! And I think it will be very helpful if this extra network setting is documented in the Seldon installation guide, because in production environment the GKE clusters are most likely set up under a shared VPC (I can help organize this installation notes if needed). |
Context:
Our production cluster is a private GKE cluster in a shared VPC. We installed seldon under the
seldon-system
namespace. The seldon helm charts were applied by a cluster admin. The seldon-operator-controller-manager pod/service/statefulsets, webhook-server-service have been shown successfully rolled out.Then we created the
some-tenant-ns
namespace with label istio-injection enabled, under which I (as a tenant of bluenose) have admin role to create/update/delete seldon custom resources.However we got the following errors when we try to create SeldonDeployment under
some-tenant-ns
namespace:kubectl apply -f model.json -n some-tenant-ns
Errorbody:
What we expected to see:
seldondeployment.machinelearning.seldon.io/seldon-model created
Troubleshooting findings:
The KubeAPI server having problem to connect this webhook sevice: webhook-server-service
Namespace: seldon-system
Type of webhook service: ClusterIP
Describe service:
Current diagnosis:
0.Seldon charts were successfully applied, checked all Seldon related api-resources and RBACs, no issues found.
1.service is running
2.service hostname is DNS solvable(checked with DNS lookup, and we can curl the service's IP in a Pod)
3.service endpoint is correct (podIP of the
seldon-operator-controller-manager-0
, and this Pod is working)Steps to reproduce the error:
I set up a private cluster with the same network configuration as the production cluster, and reproduced the error.
Details of cluster setup:
Create VPC network and subnet
create a network net-1 in region us-east1.
create a subnet, subnet-1, with the following specs:
primary range: 10.71.20.128/25 for cluster nodes.
secondary address ranges:
my-pods-1 for the Pod IP addresses 10.71.128.0/20
my-services-1 for the Service IP addresses 192.168.40.0/22
Create private cluster using the network, subnet, and secondary ranges I created.
enable public endpoint access, enable private nodes, enable network policy, disable http loadbalancing
masterIpv4CidrBlock: 192.168.7.0/28
enable master-authorized-networks: 0.0.0.0/0 (allow-all)
apply IP masquerading as a configmap:
network: net-1
region: us-east1
nat all subnet ip ranges
Then I created a bastion host VM in the same vpc network, ssh to cluster nodes and did some quick check:
tried update ip-masq-agent config map, add 192.168.0.0/16 to non-masquerate cidr, not working
The text was updated successfully, but these errors were encountered: