Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr Operator seems very picky about the Kubernetes environment it's using (guessing networking/dns) #498

Open
ramayer opened this issue Nov 10, 2022 · 5 comments

Comments

@ramayer
Copy link

ramayer commented Nov 10, 2022

Solr Operator's working great in about half of the Kubernetes environment I'm testing; but fails in about the other half.

It fails for me on MacOS using a kubernetes environment created with:

colima start --cpu 4 --memory 8 --kubernetes

where it seems like the zookeeper cluster never reaches a quorum; apparently timing out when the second zookeeper node attempts to connect to example-solrcloud-zookeeper-client:2181 . It seems as if colima's kubernetes's (I think k3s) default networking is not allowing connections to that service until the service is ready (which never seems to happen); but I don't know how to debug this further.

It works fine for me on the same MacOS host using a kubernetes environment created with:

podman machine init -m 16000 --cpus 4 -v "$HOME:$HOME" --rootful
podman machine start
minikube start --driver=podman --cpus 4 --memory 12000 --profile=minikube-on-podman

It fails for me on Ubuntu 22.04 using a kubernetes environment started with:

minikube start

where it seems each Solr instance can communicate with the other two just fine, but appears to have a network timeout when it attempts to communicate with another shard on the same host. I can create some collections, but am unable to creates any collection that has as many shards as solr pod instances.

It works fine for me on the same Ubuntu 22.04 host using:

 minikube start --container-runtime=containerd --cpus 4 --mount-string=$HOME/proj/kube/persistent_volumes:/mnt/host --mount 

It works fine for me on Microsoft Azure's AKS using the instructions here.

In all cases, after creating the Kubernetes environment, I'm attempting to create the solr cluster with

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/static/provider/cloud/deploy.yaml
kubectl create -f https://solr.apache.org/operator/downloads/crds/v0.6.0/all-with-dependencies.yaml
helm install solr-operator apache-solr/solr-operator --version 0.6.0
helm install example-solr apache-solr/solr --version 0.6.0 \
  --set image.tag=9.0 \
  --set solrOptions.security.authenticationType="Basic" \
  --set solrOptions.javaMemory="-Xms300m -Xmx300m" \
  --set addressability.external.method=Ingress \
  --set addressability.external.domainName="ing.local.domain" \
  --set addressability.external.useExternalAddress="true" \
  --set ingressOptions.ingressClassName="nginx"

I think most of the failure modes seem to be related to when during the startup process Kuberentes exposes enough information (DNS? IP addresses?) to nodes during the startup process -- but I don't quite know Kubernetes networking well enough to debug this.

@janhoy
Copy link
Contributor

janhoy commented Nov 10, 2022

...where it seems like the zookeeper cluster never reaches a quorum; apparently timing out when the second zookeeper node attempts to connect to example-solrcloud-zookeeper-client:2181

I see the same on macOS on Apple M1, using Docker Desktop 4.9.1. There is some communication issue betwen the zk nodes, so I always run with only 1 zk when debugging on my dev MacBook.

@ramayer
Copy link
Author

ramayer commented Nov 11, 2022

@janhoy

Thanks. Adding --set zk.provided.replicas=1 worked for me on the macOS/colima environment. However the minikube-on-ubuntu-when-not-using-containerd example I mentioned above it still fails where there seems to be a different communication failure between the solr nodes when trying to create a 3-shard collection on a 3-node cluster.

@risdenk
Copy link

risdenk commented Nov 11, 2022

@ramayer can you maybe catch any of the error logs from:

where it seems like the zookeeper cluster never reaches a quorum; apparently timing out when the second zookeeper node attempts to connect to example-solrcloud-zookeeper-client:2181

and

where it seems each Solr instance can communicate with the other two just fine, but appears to have a network timeout when it attempts to communicate with another shard on the same host. I can create some collections, but am unable to creates any collection that has as many shards as solr pod instances.

and any other cases where you seem to have an idea of what is happening. I know you shared how to reproduce (and that is helpful) - any log messages from ZK or Solr would potentially help as well.

@ramayer
Copy link
Author

ramayer commented Dec 8, 2022

The error messages don't show much except for timeouts that look like this

2022-12-08 21:22:07.442 WARN  (qtp1221991240-60) [] o.a.s.h.a.AdminHandlersProxy Timeout when fetching result from node default-example-solrcloud-1.ing.local.domain:80_solr => java.util.concurrent.TimeoutException
        at java.base/java.util.concurrent.FutureTask.get(Unknown Source)
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.FutureTask.get(Unknown Source) ~[?:?]
        at org.apache.solr.handler.admin.AdminHandlersProxy.maybeProxyToNodes(AdminHandlersProxy.java:112) ~[?:?]
        at org.apache.solr.handler.admin.SystemInfoHandler.handleRequestBody(SystemInfoHandler.java:137) ~[?:?]
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:224) ~[?:?]
        at org.apache.solr.handler.admin.InfoHandler.handle(InfoHandler.java:94) ~[?:?]
        at org.apache.solr.handler.admin.InfoHandler.handleRequestBody(InfoHandler.java:82) ~[?:?]
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:224) ~[?:?]
        at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:941) ~[?:?]
        at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:893) ~[?:?]
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:584) ~[?:?]
        at org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:250) ~[?:?]
        at org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:218) ~[?:?]
        at org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:257) ~[?:?]
  ...

but this is an easy way to reproduce a similar timeout on Minikube on Ubuntu using the docker runtime, and on Colima on Macos, this command

kubectl exec -it pod/example-solrcloud-1 -- curl default-example-solrcloud-0.ing.local.domain/solr/

works fine (returning the Solr admin page as you'd expect), but this

kubectl exec -it pod/example-solrcloud-1 -- curl default-example-solrcloud-1.ing.local.domain/solr/

times out. I think similar timeouts happen when a collection has multiple shards on the same node the client connected to.

@ramayer
Copy link
Author

ramayer commented Jan 12, 2023

For the minikube failure mode mentioned above, I think it's related to this minikube github issue: kubernetes/minikube#13370 with the workaround from those comments including adding --cni=bridge to the minikube startup line.

I still didn't have any luck finding workarounds for the colima --kubernetes distribution of kuberenets.

Some comments there suggest that iptables-based implementations of services can have trouble when a pod tries to connect to itself through a service-url. I wonder if this operator could be tweaked to have pods not try to reach themselves through the service name --- but don't even understand enough to know if that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants