-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node 2 never returns to ACTIVE after reconnect in FCM-VM-MultiSBReconnect-3R-1k-20m JRS Test #17180
Comments
FCM-VM-MultiSBReconnect-3R-1k-20m.pdf This is the stats pdf. |
Analysis:
It feels like there is a logical error here somewhere. Assuming the heartbeats were working and no queues were backing up, the |
Possible Future System Monitoring:
|
@edward-swirldslabs |
I agree. I just don't have an explanation for why this happened. I suspect the heartbeat that pumps the unhealthyduration metric to the event creator's platform health rule died or seized up. It doesn't look like any of the queues backed up. I don't know what would cause it to die or how we can detect and mitigate this. |
I believe the proposed theory is sound and here is the scenario how health status misreporting could happen:
|
😱 That's not what the documentation says or what I assume when I hear the word "sequential". I assume order preserving. Lines 23 to 34 in ef25861
I see. The order of ENQUEUE is not guaranteed. Once enqueued, order is guaranteed. So if there are multiple threads enqueueing, the threads can be out of order with respect to each other, but the successive enqueueing by the same thread preserves order with respect to the subsequence of enqueues made by that thread. Lines 129 to 140 in bc0e229
I am curious why we did not use a non-blocking queue like: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html
|
@edward-swirldslabs There are no locks in |
@OlegMazurov You are absolutely right and I was wrong. Thank you for educating me and sorry for my mistake. Both the ConcurrentLinkedQueue and AtomicReference are using |
My task reordering theory above was not correct. Tasks submitted by one |
Pasting Oleg's theory from slack which makes perfect sense to me: Here is a new theory.
If it's correct, the proposed fix for |
Node 2 appears to not have its events chosen as parents by other nodes after it reconnects.
JRS Data Folder: http://35.247.76.217:8095/swirlds-automation/release/0.58/10N/MultiSBReconnect/20241228-094044-GCP-Daily-MultiSBReconnect-3R-10N/FCM-VM-MultiSBReconnect-3R-1k-20m/
Node 2 did not reach ACTIVE state after reconnect (Node 2 logs)
FCM-Restart-Stake-2.5k-10m
observed ISS #11254Looking at the stats pdf, confirmed, node 2 remained in checking after reconnect/
The text was updated successfully, but these errors were encountered: