fix(discovery): discovery synchronization for stale lost targets #689

andrewazores · 2024-10-09T14:07:23Z

Welcome to Cryostat! 👋

Before contributing, make sure you have:

Read the contributing guidelines
Linked a relevant issue which this PR resolves
Linked any other relevant issues, PR's, or documentation, if any
Resolved all conflicts, if any
Rebased your branch PR on top of the latest upstream main branch
Attached at least one of the following labels to the PR: [chore, ci, docs, feat, fix, test]
Signed all commits using a GPG signature

To recreate commits with GPG signature git fetch upstream && git rebase --force --gpg-sign upstream/main

Fixes: #634

Description of the change:

Changes a bit of logic within the KubeApiDiscovery class to try to ensure that events are handled serially
Adds logic to DiscoveryNode utility methods to retrieve existing entities if they exist, rather than potentially creating duplicates. When many events are processed at once there can be multiple transactions, which can interrupt each other, and each may have tried to create duplicate entities. This added logic allows the later executions to retrieve the previously created entities and reuse them, rather than erroneously creating duplicates.
Uses the existing k8s discovery resync period to do a "full resync" periodically. This full resync consists of simply checking the Informer store for each target namespace and checking that the database matches the expected set. This should not be strictly necessary if everything is otherwise implemented properly, but if anything does go wrong then this helps ensure that the database is kept up to date and eventually becomes consistent, even if an Informer notification gets missed or is incorrectly handled etc.
(Not directly related to original change) adds periodic retry logic to the S3 storage bucket startup existence/creation check. Previously this would only be done once when the Cryostat container first started, but there is no guarantee that the storage container is up and ready before Cryostat is. Related to fix(startup): improve startup detection for bucket creation cryostat-storage#23 .
Extracts JVM ID updating logic to a separate utility service class and decouples it from the Target persistence lifecycle. JVM IDs are now updated asynchronously by a separate worker - Targets can be persisted in the database without blocking the transaction to open a network connection and attempt to retrieve the JVM ID. The initial connection attempt occurs a short delay after the target is first discovered. A scheduled task fires periodically to attempt to connect to any known target which does not yet have a JVM ID and update it.

Motivation for the change:

See #634

How to manually test:

Check out and build PR, or use quay.io/andrewazores/cryostat:k8s-discovery-30
Deploy in Kubernetes (or OpenShift) using the Operator or Helm chart. I have been using Helm: helm install cryostat --set authentication.openshift.enabled=true --set core.route.enabled=true --set core.discovery.kubernetes.enabled=true --set core.image.repository=quay.io/andrewazores/cryostat --set core.image.tag=k8s-discovery-30 ./charts/cryostat/
Open UI and go to Topology view to visually observe what Cryostat has discovered
See comments below about using oc rollout restart and oc scale deployment to try to trigger the bad behaviour.

andrewazores · 2024-10-09T14:53:49Z

Deployed on OpenShift with:

$ helm install cryostat \
  --set authentication.openshift.enabled=true \
  --set core.route.enabled=true \
  --set core.image.repository=quay.io/andrewazores/cryostat \
  --set core.image.tag=k8s-discovery-2 \
  ./charts/cryostat/

Manually testing with various ways to mess around with a sample application deployment:

$ oc rollout restart deployment/quarkus-test
$ oc exec -it deployment/quarkus-test -- /bin/bash -c 'kill 1'
$ oc scale deployment quarkus-test --replicas=3
$ oc scale deployment quarkus-test --replicas=1
$ oc scale deployment quarkus-test --replicas=0

With or without this PR I can eventually get into a bad state as described in the original issue - either a stale discovered Pod that is not really there anymore, or else some Pods that exist but are not discovered.

cryostat-6659dd8598-d2n75-cryostat.log

After getting Cryostat into this state, with the various discovery exceptions logged above, I can no longer get it to discover other sample applications - even fully undeploying the sample application and deploying it fresh, or restarting or re-scaling the deployment.

andrewazores · 2024-10-09T17:51:14Z

$ oc scale deployment quarkus-test --replicas=3
$ oc rollout restart deployment/quarkus-test

and repeatedly restarting the deployment to cause rollouts of multiple replicas seems to be a good way to trigger the bugged behaviour, eventually. It isn't deterministic and there are definitely multiple worker threads involved, so this seems like a thread synchronization issue and/or race condition.

andrewazores · 2024-10-17T16:02:37Z

/build_test

github-actions · 2024-10-17T16:03:02Z

Workflow started at 10/17/2024, 12:03:01 PM. View Actions Run.

github-actions · 2024-10-17T16:05:19Z

No OpenAPI schema changes detected.

github-actions · 2024-10-17T16:05:20Z

No GraphQL schema changes detected.

github-actions · 2024-10-17T16:11:41Z

CI build and push: All tests pass ✅
https://github.com/cryostatio/cryostat/actions/runs/11388705321

andrewazores · 2024-10-18T14:34:46Z

/build_test

github-actions · 2024-10-18T14:35:07Z

Workflow started at 10/18/2024, 10:35:06 AM. View Actions Run.

github-actions · 2024-10-18T14:37:38Z

No OpenAPI schema changes detected.

github-actions · 2024-10-18T14:37:39Z

No GraphQL schema changes detected.

github-actions · 2024-10-18T14:43:11Z

CI build and push: All tests pass ✅
https://github.com/cryostatio/cryostat/actions/runs/11405720776

andrewazores · 2024-10-22T17:56:07Z

Seems like this is not a proper fix yet, only a mitigation - the problem occurs less frequently, but I can still get it to happen on occasion. I'll keep working on it.

andrewazores · 2024-10-23T14:45:11Z

/build_test

github-actions · 2024-10-23T14:45:32Z

Workflow started at 10/23/2024, 10:45:31 AM. View Actions Run.

github-actions · 2024-10-23T14:47:53Z

No GraphQL schema changes detected.

github-actions · 2024-10-23T14:48:17Z

No OpenAPI schema changes detected.

github-actions · 2024-10-23T14:53:46Z

CI build and push: All tests pass ✅
https://github.com/cryostatio/cryostat/actions/runs/11482383330

andrewazores · 2024-10-23T15:19:36Z

/build_test

github-actions · 2024-10-23T15:19:56Z

Workflow started at 10/23/2024, 11:19:56 AM. View Actions Run.

github-actions · 2024-10-23T15:22:09Z

No OpenAPI schema changes detected.

… worker threads, attempt reconnections periodically

…ogic for periodic updates vs on-discovery updates

…ve recordings can be updated as well

…fter JVM ID is determined

…e attempt fails

This reverts commit c2b8d3f.

andrewazores · 2024-11-25T19:04:19Z

/build_test

github-actions · 2024-11-25T19:04:40Z

Workflow started at 11/25/2024, 2:04:39 PM. View Actions Run.

github-actions · 2024-11-25T19:06:56Z

No OpenAPI schema changes detected.

github-actions · 2024-11-25T19:06:57Z

No GraphQL schema changes detected.

github-actions · 2024-11-25T19:12:49Z

CI build and push: All tests pass ✅
https://github.com/cryostatio/cryostat/actions/runs/12016993150

andrewazores added fix safe-to-test labels Oct 9, 2024

andrewazores force-pushed the k8s-discovery-sync branch from 2a01b3a to 963e26e Compare October 17, 2024 15:10

andrewazores marked this pull request as ready for review October 18, 2024 14:35

andrewazores requested a review from Josh-Matsuoka October 18, 2024 14:43

andrewazores removed the request for review from Josh-Matsuoka October 22, 2024 17:56

andrewazores marked this pull request as draft October 22, 2024 17:56

andrewazores force-pushed the k8s-discovery-sync branch 2 times, most recently from 6805f84 to 6b4a902 Compare October 23, 2024 14:37

andrewazores added 23 commits November 25, 2024 14:03

periodically retry storage bucket creation

bd42f0b

ensure ordered queue processing of transactions, error handling

966297e

fixup! ensure ordered queue processing of transactions, error handling

facfe29

decouple target JVM ID retrieval from persistence, handle in separate…

5dc0591

… worker threads, attempt reconnections periodically

ensure JVM ID is nulled if connection fails

a182230

delay MODIFIED events same as FOUND events

4016408

refactor to use quartz for scheduling delayed connection, and reuse l…

75568cd

…ogic for periodic updates vs on-discovery updates

remove unused case

d33e8cc

remove unnecessary job identity

a8ce783

handle nullable input data

94844e3

slower initial delay, periodic delay based on connection timeout

c2f5fd0

unwrap exception handling so transactions can be rolled back

95151f5

cleanup

2cec483

handle single-target updates within existing transaction

c14f1ab

reduce delay

730ddce

skip update if jvmId already known

35b17bc

updates should continue even if JVM ID is already known, so that acti…

c1262c3

…ve recordings can be updated as well

rename

9be3035

rules ignore target discovery when JVM ID is still blank, act later a…

2699300

…fter JVM ID is determined

handle updating JVM ID on credential change, null out JVM ID if updat…

1d372f3

…e attempt fails

handle events in ordered serial fashion

9caa89a

use infrastructure pool instead of forkjoin

64b2320

Revert "handle events in ordered serial fashion"

d9ea05b

This reverts commit c2b8d3f.

andrewazores force-pushed the k8s-discovery-sync branch from ef99212 to d9ea05b Compare November 25, 2024 19:03

andrewazores mentioned this pull request Nov 26, 2024

feat(k8s): implement wildcard All Namespaces discovery #725

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(discovery): discovery synchronization for stale lost targets #689

fix(discovery): discovery synchronization for stale lost targets #689

andrewazores commented Oct 9, 2024 •

edited

Loading

andrewazores commented Oct 9, 2024 •

edited

Loading

andrewazores commented Oct 9, 2024

andrewazores commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

andrewazores commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

andrewazores commented Oct 22, 2024

andrewazores commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

andrewazores commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

andrewazores commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

fix(discovery): discovery synchronization for stale lost targets #689

Are you sure you want to change the base?

fix(discovery): discovery synchronization for stale lost targets #689

Conversation

andrewazores commented Oct 9, 2024 • edited Loading

Welcome to Cryostat! 👋

Before contributing, make sure you have:

Description of the change:

Motivation for the change:

How to manually test:

andrewazores commented Oct 9, 2024 • edited Loading

andrewazores commented Oct 9, 2024

andrewazores commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

andrewazores commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

andrewazores commented Oct 22, 2024

andrewazores commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

andrewazores commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

andrewazores commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

andrewazores commented Oct 9, 2024 •

edited

Loading

andrewazores commented Oct 9, 2024 •

edited

Loading