-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solr Operator seems very picky about the Kubernetes environment it's using (guessing networking/dns) #498
Comments
I see the same on macOS on Apple M1, using Docker Desktop 4.9.1. There is some communication issue betwen the zk nodes, so I always run with only 1 zk when debugging on my dev MacBook. |
Thanks. Adding |
@ramayer can you maybe catch any of the error logs from:
and
and any other cases where you seem to have an idea of what is happening. I know you shared how to reproduce (and that is helpful) - any log messages from ZK or Solr would potentially help as well. |
The error messages don't show much except for timeouts that look like this
but this is an easy way to reproduce a similar timeout on Minikube on Ubuntu using the docker runtime, and on Colima on Macos, this command
works fine (returning the Solr admin page as you'd expect), but this
times out. I think similar timeouts happen when a collection has multiple shards on the same node the client connected to. |
For the minikube failure mode mentioned above, I think it's related to this minikube github issue: kubernetes/minikube#13370 with the workaround from those comments including adding I still didn't have any luck finding workarounds for the Some comments there suggest that iptables-based implementations of services can have trouble when a pod tries to connect to itself through a service-url. I wonder if this operator could be tweaked to have pods not try to reach themselves through the service name --- but don't even understand enough to know if that makes sense. |
Solr Operator's working great in about half of the Kubernetes environment I'm testing; but fails in about the other half.
It fails for me on MacOS using a kubernetes environment created with:
where it seems like the zookeeper cluster never reaches a quorum; apparently timing out when the second zookeeper node attempts to connect to example-solrcloud-zookeeper-client:2181 . It seems as if colima's kubernetes's (I think k3s) default networking is not allowing connections to that service until the service is ready (which never seems to happen); but I don't know how to debug this further.
It works fine for me on the same MacOS host using a kubernetes environment created with:
It fails for me on Ubuntu 22.04 using a kubernetes environment started with:
where it seems each Solr instance can communicate with the other two just fine, but appears to have a network timeout when it attempts to communicate with another shard on the same host. I can create some collections, but am unable to creates any collection that has as many shards as solr pod instances.
It works fine for me on the same Ubuntu 22.04 host using:
It works fine for me on Microsoft Azure's AKS using the instructions here.
In all cases, after creating the Kubernetes environment, I'm attempting to create the solr cluster with
I think most of the failure modes seem to be related to when during the startup process Kuberentes exposes enough information (DNS? IP addresses?) to nodes during the startup process -- but I don't quite know Kubernetes networking well enough to debug this.
The text was updated successfully, but these errors were encountered: