-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task] CreateRecording form does not recognize when targets reappear #294
Comments
Hmm... I deployed cryostat operator with the latest images of
where it just continues to GET from that target. |
It's been a while since I filed this issue. It looks like the template dropdown is always populated now.
If you fill out the Create Recording form until the 'Create' button is blue, then restart the pod, Cryostat will see the new target as you've noticed, but when you click the 'Create' button, I don't think there's an error message indicating that the target you're trying to start a recording on is unreachable. |
Sounds good, thanks! |
I've been trying various things but it seems to me that the back-end cryostat k8Client never notices that a pod disappears, through the
or
it falls into the |
There definitely should be a I wonder if perhaps the LOST/FOUND WebSocket notifications are being sent in the wrong order and that causes the visual bug of the outdated target definition remaining on the frontend? |
I just did a simple After doing that, The operator is aware of the sample applications because by using However - the Cryostat container logs do not show that it ever noticed the target applications appearing or disappearing - there are no logs of WebSocket messages being sent, and the web-client instance also does not show any target discovery notifications. The target selection dropdown does not dynamically update at all, but clicking the refresh icon to cause a re-query does work.
Containers:
cryostat-sample:
Container ID: cri-o://f510ce16cf5c70a8a2242299c61ed7c652cb7f0397acf60ce871acc3dc3f3d49
Image: quay.io/cryostat/cryostat:latest
Image ID: quay.io/cryostat/cryostat@sha256:01e0dc1c020318e2ef448036dac987cf90f5e231400b900c4760264e6e1cd013 That sha256 corresponds to the So it does look like something in or around |
|
|
|
I'm working on a |
Current theory/explanation: it isn't a regression in Cryostat itself. What appears to be happening is that after the OAuth integration in Cryostat 2.1.0, deploying in CRC results in a temporary OpenShift/k8s API server outage since we modify the API server config to add ourselves for CORS. In a cluster-bot instance or other real OpenShift deployments there are multiple replicas of the API server and a rolling redeploy occurs with no overall downtime, but in CRC there is only one replica and so the config update results in an outage. During this outage the Cryostat container is already running and has a k8s Watch open with an underlying WebSocket to the API server. In CRC this gets interrupted, whereas in the cluster-bot instance it survives. We're not yet sure why that Watch WebSocket sometimes manages to reconnect to the API server and allow Cryostat to continue receiving Endpoints resource updates, and sometimes not. In Cryostat 2.0.0 or prior there was no OAuth integration, no API server CORS config, and therefore no API server update rollout/outage, so the Watch remains uninterrupted. |
So, this is really just a crc issue then since usually real OpenShift deployments have multiple replicas of the API server. If the theory is true, this particular bug shouldn't be that important right? |
The bugged behaviour where the target discovery LOST/FOUND notifications aren't coming through seems like it's just a consequence of that single API server replica in crc. But, it may still be a bug on our end that our resource watcher isn't reconnecting to the API server when it does come back online. I'm not sure if this is the same cause of the issue you described where target FOUND notifications came through but LOST did not. That might be something separate since both of those originate from the same watcher, so it seems you should either be getting both kinds or neither. It also sounds like there is still a frontend bug where the recording creation form lets you click the blue creation button even if the selected target has disappeared after you selected it. |
Thinking more about it now I'm not sure how in quicklab/cluster-bot the Watch WebSocket wouldn't be interrupted. After all, even if there are multiple replicas, it seems the long-lived WebSocket connection would get routed to just one of those replicas at a time through the load balancing service. Eventually that replica will get replaced during the rolling update, so the WebSocket connection would still be interrupted. However, if the watcher on the Cryostat side does try to reconnect here, then it should find another live API server replica through the load balancer and be able to immediately reconnect on its next attempt. In crc with its limited resources and single replica, the watcher reconnection would probably fail multiple times over a relatively extended period of time until the API server replica is available again. Digging through the fabric8 kubernetes-client sources I do see that the watcher should retry connections and should do so indefinitely, but it has an exponential backoff interval each time it fails, where the retry time doubles on each reconnection failure. So maybe even in crc the watcher will eventually come back online, it just might take some time to do so? |
Here's a recent log of 2.2.0-dev in crc. Cryostat initially had a working watcher connection and soon after startup it even detected itself coming online and logged its own WebSocket notifications about that (no client was connected to actually send those messages to, but it logs regardless). Then we see some repeated Then we get:
after this there are no more logged exceptions that look like watcher reconnection attempts. We can (and do) catch these exceptions, but currently they are only logged and nothing else happens. The |
Seems relevant: https://stackoverflow.com/questions/61409596/kubernetes-too-old-resource-version @ebaron I think the fix for upstream Cryostat here is just to detect what kind of watcher exception occurred and if the watcher will be retrying. If not, close it (it's probably already auto-closed but to be safe) and schedule a new one to be opened. WDYT? |
If a target becomes briefly unreachable while filling out
the “Create recording” form, clicking the "Create" button will trigger
an expected HTTP 500 error. When the target becomes available again,
the Create button sometimes stays greyed out and the template dropdown
remains empty. Clicking the “refresh targets” button doesn’t update
the template dropdown. If you exit and re-enter the Create Recording
form then you can successfully create a recording.
Steps to reproduce:
The text was updated successfully, but these errors were encountered: