-
-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes Stateful Clustering not working properly #304
Comments
|
actually, killing the pods does not work at all right now… |
log output of vernemq-1 says though:
|
@Robbilie It's all happening in the Line 80 in 5b2f21c
so i guess that's where we have to look/doublecheck. |
I"ll be looking into the Helm stuff within the coming days. |
lovely, if you need any assistance or want me to test something, let me know :) |
I was having the same issue and narrowed it down to the image I was building, so this may not apply if you are using the public image. Should look something like this:
Hope that helps. |
the join command is part of the start script: https://github.com/vernemq/docker-vernemq/blob/master/bin/vernemq.sh |
I have the same issue. 2 nodes in the K8s cluster behave as they are separate clusters: node0: node1: bash-4.4$ vmq-admin cluster show It is very strange because I have the same helm chart with the same values deployed in another namespace and it works fine - I have 2 nodes in the cluster. Here is values.yaml file:
In the pod log I see that it was going to join the cluster: Will join an existing Kubernetes cluster with discovery node at vernemq-development-1.vernemq-development-headless.vernemq-development.svc.cluster.local |
It is strange... I upgraded Vernemq to the latest version (1.12.3) using helm chart 1.6.12 and my cluster is showing 2 nodes... But clients can't connect... So I downgraded it again to v.1.11.0 (helm chart 1.6.6) and now everything is fine - there are 2 nodes in the cluster and the clients can connect. Have no idea what happened here... |
Another workaround guys. I left only 1 running node in the cluster and for another one, I deleted persistent volume (by deleting persistent volume claim for that node) and then started node 2 again. It successfully connected to the cluster and synced all the data. Now the cluster is up and running correctly with 2 nodes as desired. So, it looks like the issue was because nodes were not able to sync data. |
If I'm reading this right, the default PV mode is |
I am currently running to this issue using |
Oddly enough, if I exec into the pod, extract the discovery node IP from the end of the |
@mgagliardo91 hm, thanks. I don't see why this happens. Either this is a connectivity issue, with the point against it that it works with a manual cluster join. Or there's no join issued at all. 👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq |
@ioolkos thanks for the response.
Then after the manual join request:
|
@mgagliardo91 Thanks, strange. The interesting point are the first initial lines, they come from the 👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq |
@ioolkos I am using the default I added a second script that pulls a lot of logic from the main script and attempts to retry the join, until successful. I updated the join_cluster.sh: #!/usr/bin/env bash
SECRETS_KUBERNETES_DIR="/var/run/secrets/kubernetes.io/serviceaccount"
DOCKER_VERNEMQ_KUBERNETES_CLUSTER_NAME=${DOCKER_VERNEMQ_KUBERNETES_CLUSTER_NAME:-cluster.local}
if [ -d "${SECRETS_KUBERNETES_DIR}" ] ; then
# Let's get the namespace if it isn't set
DOCKER_VERNEMQ_KUBERNETES_NAMESPACE=${DOCKER_VERNEMQ_KUBERNETES_NAMESPACE:-$(cat "${SECRETS_KUBERNETES_DIR}/namespace")}
fi
insecure=""
if env | grep "DOCKER_VERNEMQ_KUBERNETES_INSECURE" -q; then
echo "Using curl with \"--insecure\" argument to access kubernetes API without matching SSL certificate"
insecure="--insecure"
fi
function k8sCurlGet () {
local urlPath=$1
local hostname="kubernetes.default.svc.${DOCKER_VERNEMQ_KUBERNETES_CLUSTER_NAME}"
local certsFile="${SECRETS_KUBERNETES_DIR}/ca.crt"
local token=$(cat ${SECRETS_KUBERNETES_DIR}/token)
local header="Authorization: Bearer ${token}"
local url="https://${hostname}/${urlPath}"
curl -sS ${insecure} --cacert ${certsFile} -H "${header}" ${url} \
|| ( echo "### Error on accessing URL ${url}" )
}
try_join() {
local exit_code=0
if env | grep "DOCKER_VERNEMQ_DISCOVERY_KUBERNETES" -q; then
# Let's set our nodename correctly
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#list-pod-v1-core
podList=$(k8sCurlGet "api/v1/namespaces/${DOCKER_VERNEMQ_KUBERNETES_NAMESPACE}/pods?labelSelector=${DOCKER_VERNEMQ_KUBERNETES_LABEL_SELECTOR}")
kube_pod_names=$(echo ${podList} | jq '.items[].spec.hostname' | sed 's/"//g' | tr '\n' ' ')
VERNEMQ_KUBERNETES_SUBDOMAIN=${DOCKER_VERNEMQ_KUBERNETES_SUBDOMAIN:-$(echo ${podList} | jq '.items[0].spec.subdomain' | tr '\n' '"' | sed 's/"//g')}
for kube_pod_name in $kube_pod_names; do
if [[ $kube_pod_name == "null" ]]; then
echo "Kubernetes discovery selected, but no pods found. Maybe we're the first?"
echo "Anyway, we won't attempt to join any cluster."
exit 0
fi
if [[ $kube_pod_name != "$MY_POD_NAME" ]]; then
discoveryHostname="${kube_pod_name}.${VERNEMQ_KUBERNETES_SUBDOMAIN}.${DOCKER_VERNEMQ_KUBERNETES_NAMESPACE}.svc.${DOCKER_VERNEMQ_KUBERNETES_CLUSTER_NAME}"
echo "Will join an existing Kubernetes cluster with discovery node at ${discoveryHostname}"
vmq-admin cluster show | grep "VerneMQ@${discoveryHostname}" > /dev/null || exit_code=$?
if [ $exit_code -eq 0 ]; then
echo "We have already joined the cluster - no extra work required."
exit 0
else
echo "We have yet to join the cluster - attempting manual join..."
vmq-admin cluster join discovery-node="VerneMQ@${discoveryHostname}"
sleep 2
fi
break
fi
done
else
exit 0
fi
}
while true
do
try_join
done; |
@mgagliardo91 oh, that's brilliant. When you say 10 re-tries: how much was that in seconds? 👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq |
Up to 20 seconds for us. I can add it, sure |
@mgagliardo91 oh, ok, so that's 20 seconds to get a PVC. 👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq |
The helm chart sets up a statefulset and while that should guarantee a certain level of stability it apparently isnt enough to properly cluster in all cases. I am currently running two nodes and when i connect with a subscribing client and then start to publish using another i get only 50% of the published messages on the first client. If i kill one of the two nodes using kubectl it seems to fix it, i am unsure if permanently because i dont recall recreating the nodes and right now it happens again but they may have been rescheduled.
To me it seems the nodes dont always properly form a cluster.
The text was updated successfully, but these errors were encountered: