Node 2 never returns to ACTIVE after reconnect in FCM-VM-MultiSBReconnect-3R-1k-20m JRS Test #17180

edward-swirldslabs · 2024-12-30T18:33:03Z

Node 2 appears to not have its events chosen as parents by other nodes after it reconnects.

JRS Data Folder: http://35.247.76.217:8095/swirlds-automation/release/0.58/10N/MultiSBReconnect/20241228-094044-GCP-Daily-MultiSBReconnect-3R-10N/FCM-VM-MultiSBReconnect-3R-1k-20m/

Node 2 did not reach ACTIVE state after reconnect (Node 2 logs)

13m 53.138s - Reconnect completed and entered CHECKING state.
13m 56.346s - init judges found.
14m 3.196s - warning from Shadowgraph, missing non-expired self-parent.
18m 33.903s - 22m 44.618s - repeated entries saying Copying round X, transactions ignored by expected map #. This log is added to debug Debug FCM-Restart-Stake-2.5k-10m observed ISS #11254

Looking at the stats pdf, confirmed, node 2 remained in checking after reconnect/

secC2C is maxed at 250 min after reconnect (indicating a self event never comes to consensus after reconnect).
bufferedPriorityTransactions steadily climbs unbounded after the reconnect.
TIPSET: Node 1 chose Node 2 as a pity parent near the reconnect, no other node chooses node 2 afterwards.
TIPSET: Node 2 stops choosing pityparents for other nodes after the reconnect.
secC2RC, secOR2T, secR2C all spike during the reconnect, but return to normal levels afterwards.
secSc2T flatlines and twice the value of all other nodes after the reconnect.
TIPSET: selfishness score flatlines at 0.1, less than half of all the other nodes.
TIPSET: tipsetAdvancement flatlines at around 0.375, all other nodes are at 0.2 after reconnect.
TIPSET: parentage selection of node2 by other nodes drops to 0 after reconnect and is never selected.
cEvents_pre_second: node 2 drops to 0 after reconnect.
eventCreation_fraction_OVERLOADED is maxed out at 1 for node 2.

The text was updated successfully, but these errors were encountered:

edward-swirldslabs · 2024-12-30T18:35:22Z

FCM-VM-MultiSBReconnect-3R-1k-20m.pdf

This is the stats pdf.

edward-swirldslabs · 2024-12-30T20:21:04Z

Analysis:

Node 2 stops creating and sending self events due to the PlatformHealthRule which is permanently reporting OVERLOADED status for event creation fraction.
This status is cleared when all the event creation rules indicate that event creation is permitted.
These rules are an Aggregate of MaximumRateRule, PlatformStatusRule, and PlatformHealthRule.
The aggregate isEventCreationPermitted() must always be firing and the PlatformHealthRule must always be failing its call as that is what is setting the most recent status for the aggregate to report.
This is taking place in the EventCreationManager.maybeCreateEvent() which is triggered by a heartbeat based on the event creation frequency configuration.
The PlatformHealthRule queries unhealthy durations from the DefaultEventCreationManager which stored the heartbeated unhealthy durations from the HealthMonitor.
The Node 2 CSV indicates the HealthMonitor would be reporting 0 as unhealthy duration after the reconnect.

It feels like there is a logical error here somewhere. Assuming the heartbeats were working and no queues were backing up, the PlatformHealthRule would be getting unhealthyDuration updates of 0 and that would cause it to pass its isEventCreationPermitted() call.

edward-swirldslabs · 2024-12-31T17:43:49Z

Possible Future System Monitoring:

metrics to report the frequency of sending data down individual wires.
running a thread dump on every node at the end of a JRS test.

lpetrovic05 · 2025-01-09T12:56:32Z

@edward-swirldslabs
From what I see, it looks like the EventCreationManager did not get the unhealthyDuration, but the metrics did. The healthy metric is set to 1 on the problematic node, even though events are not created because the platform is supposedly unhealthy.

edward-swirldslabs · 2025-01-09T15:45:31Z

@edward-swirldslabs From what I see, it looks like the EventCreationManager did not get the unhealthyDuration, but the metrics did. The healthy metric is set to 1 on the problematic node, even though events are not created because the platform is supposedly unhealthy.

I agree. I just don't have an explanation for why this happened. I suspect the heartbeat that pumps the unhealthyduration metric to the event creator's platform health rule died or seized up. It doesn't look like any of the queues backed up. I don't know what would cause it to die or how we can detect and mitigate this.

OlegMazurov · 2025-01-23T20:35:58Z

I believe the proposed theory is sound and here is the scenario how health status misreporting could happen:

HealthMonitor.checkSystemHealth() is called periodically from the heartbeat and reports the longest unhealthy duration, which may be zero, to EventCreationManager::reportUnhealthyDuration
if there is no change in the duration since last report, no duration is reported (null); that means zero is reported only once
the health monitor scheduler is built by StandardWiringModel as SEQUENTIAL
eventCreationManager is also configured as SEQUENTIAL in PlatformSchedulersConfig
SEQUENTIAL schedulers guarantee that no two submitted tasks can be executed simultaneously but they don't guarantee any particular order of execution (see compareAndSet in SequentialTaskScheduler.scheduleTask())
the HealthMonitor_unhandled_task_count metric indicated that we had contention (more than one scheduled task) for the health monitor scheduler ~~and thus potential reordering of its tasks~~
~~ditto for EventCreationManager_unhandled_task_count~~

So, the problem may be caused by asynchronous interaction between the health monitor and the event creation manager and task reordering due to high contention in the platform ForkJoinPool, to which all tasks are submitted.

edward-swirldslabs · 2025-01-24T19:57:37Z

SEQUENTIAL schedulers guarantee that no two submitted tasks can be executed simultaneously but they don't guarantee any particular order of execution (see compareAndSet in SequentialTaskScheduler.scheduleTask())

😱 That's not what the documentation says or what I assume when I hear the word "sequential". I assume order preserving.

hedera-services/platform-sdk/swirlds-component-framework/src/main/java/com/swirlds/component/framework/schedulers/builders/TaskSchedulerType.java

Lines 23 to 34 in ef25861

    
           public enum TaskSchedulerType { 
        
               /** 
        
                * Tasks are executed in a fork join pool one at a time in the order they were enqueued. There is a happens before 
        
                * relationship between each task. 
        
                */ 
        
               SEQUENTIAL, 
        
               /** 
        
                * Tasks are executed on a dedicated thread one at a time in the order they were enqueued. There is a happens before 
        
                * relationship between each task. This scheduler type has very similar semantics as {@link #SEQUENTIAL}, although 
        
                * the implementation and performance characteristics are not identical. 
        
                */ 
        
               SEQUENTIAL_THREAD,

I see. The order of ENQUEUE is not guaranteed. Once enqueued, order is guaranteed. So if there are multiple threads enqueueing, the threads can be out of order with respect to each other, but the successive enqueueing by the same thread preserves order with respect to the subsequence of enqueues made by that thread.

hedera-services/platform-sdk/swirlds-component-framework/src/main/java/com/swirlds/component/framework/schedulers/internal/SequentialTaskScheduler.java

Lines 129 to 140 in bc0e229

    
           private void scheduleTask(@NonNull final Consumer<Object> handler, @NonNull final Object data) { 
        
               // This method may be called by many threads, but actual execution is required to happen serially. This method 
        
               // organizes tasks into a linked list. Tasks in this linked list are executed one at a time in order. 
        
               // When execution of one task is completed, execution of the next task is scheduled on the pool. 
        
               final SequentialTask nextTask = new SequentialTask(pool, offRamp, busyTimer, uncaughtExceptionHandler, false); 
        
               SequentialTask currentTask; 
        
               do { 
        
                   currentTask = nextTaskPlaceholder.get(); 
        
               } while (!nextTaskPlaceholder.compareAndSet(currentTask, nextTask)); 
        
               currentTask.send(nextTask, handler, data); 
        
           }

I am curious why we did not use a non-blocking queue like: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html

The AtomicReference used in the existing implementation is a non-order preserving synchronization lock. If 5 threads hit it in the order 1, 2, 3, 4, 5, such that 1 gets the lock and 2, 3, 4, 5 block, there is no guarantee of order on who wakes up next to get the lock once 1 frees it.

~~In my opinion the implementation would be more correct to use a non-blocking concurrent queue.~~

~~The current implementation introduces the non-trivial opportunity for the following race condition:~~

~~1. Thread 1 grabs the get and holds the lock on the atomic reference.~~
~~2. Thread 2 tries to grab the lock through get and blocks.~~
~~3. Thread 1 finished the get and releases the lock. Thread 1 performs a context switch because its time slice is up before it can grab the lock again in the compare and set.~~
~~4. Thread 2 wakes up and grabs the lock to perform the get.~~
~~5. Thread 1 is put back on the processor and calls compareAndSet but blocks because the lock is held by thread 2.~~
~~6. Thread 2 completes the get, releases the lock, but still has time on its slice to grab the lock with the compare and set.~~
~~7. Thread 2 sets the value in the atomic reference before Thread 1.~~
~~8. Thread 2 releases the lock from the compare and set, this notifies Thread 1 to perform its compare and set.~~
~~9. Thread 1 fails and has to go through the while loop to get the value that Thread 2 just set.~~

OlegMazurov · 2025-01-24T23:02:32Z

@edward-swirldslabs There are no locks in AtomicReference implementation. The entire wiring framework is essentially lock-free and non-blocking.
If we remove locking from your scenario, it describes how compareAndSet resolves conflicts between competing threads. The outcome was expected and even desired by the initial requirements.
There is an example where a deterministic order of task execution is preserved, see ParallelTask/SequentialTask in BlockStreamManagerImpl. So, the problem is not in a particular algorithm implementation but rather in selecting a wrong algorithm for the job.

edward-swirldslabs · 2025-01-24T23:39:36Z

@OlegMazurov You are absolutely right and I was wrong. Thank you for educating me and sorry for my mistake. Both the ConcurrentLinkedQueue and AtomicReference are using VarHandle.compareSet() tricks to avoid blocking. The race condition I hypothesized due to blocking doesn't exist.

OlegMazurov · 2025-01-29T00:04:08Z

My task reordering theory above was not correct. Tasks submitted by one SEQUENTIAL scheduler to another are guaranteed to preserve order despite the CAS in scheduleTask().

lpetrovic05 · 2025-01-29T09:41:08Z

Pasting Oleg's theory from slack which makes perfect sense to me:

Here is a new theory.

the unhealthyDuration metric is high and DefaultEventCreationManager is in the OVERLOADED phase
node gets BEHIND and initiates reconnect
PlatformCoordinator.clear() is called to prepare the wiring framework for reconnect
the following calls are made when in clear():
- eventCreationManagerWiring.startSquelching();
- eventCreationManagerWiring.flush();
- eventCreationManagerWiring.stopSquelching();
- eventCreationManagerWiring.getInputWire(EventCreationManager::clear).inject(NoInput.getInstance());
eventCreationManager is configured as SQUELCHABLE
the health monitor observes that the system becomes healthy and reports zero unhealthy duration to all consumers
if the DefaultEventCreationManager.reportUnhealthyDuration(healthy) task hits the startSquelching()..stopSquelching() window, its handler is not executed in the BindableInputWire.bindConsumer() lambda and DefaultEventCreationManager.unhealthyDuration stays unchanged
while the system remains healthy no more reportUnhealthyDuration(healthy) tasks are submitted due to HealthMonitor::checkSystemHealth logic
DefaultEventCreationManager is stuck in the OVERLOADED phase

If it's correct, the proposed fix for HealthMonitor::checkSystemHealth would work. I still believe that asynchronous control tasks are bad (over-engineered) and schedulers should be defined for data wires, not universally for wiring components.P.S. When analyzing performance or functional issues with the wiring framework I always feel like a boy who bought a bucket of Lego pieces at a garage sale and tries to build the Millennium Falcon without instructions.

edward-swirldslabs added Bug An error that causes the feature to behave differently than what was expected based on design. Platform Tickets pertaining to the platform labels Dec 30, 2024

lpetrovic05 self-assigned this Jan 10, 2025

lpetrovic05 assigned mxtartaglia-sl Jan 22, 2025

OlegMazurov linked a pull request Jan 31, 2025 that will close this issue

fix: reconnect leaves event creation manager in OVERLOADED state #17658

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node 2 never returns to ACTIVE after reconnect in FCM-VM-MultiSBReconnect-3R-1k-20m JRS Test #17180

Node 2 never returns to ACTIVE after reconnect in FCM-VM-MultiSBReconnect-3R-1k-20m JRS Test #17180

edward-swirldslabs commented Dec 30, 2024 •

edited

Loading

edward-swirldslabs commented Dec 30, 2024

edward-swirldslabs commented Dec 30, 2024

edward-swirldslabs commented Dec 31, 2024

lpetrovic05 commented Jan 9, 2025

edward-swirldslabs commented Jan 9, 2025

OlegMazurov commented Jan 23, 2025 •

edited

Loading

edward-swirldslabs commented Jan 24, 2025 •

edited

Loading

OlegMazurov commented Jan 24, 2025

edward-swirldslabs commented Jan 24, 2025

OlegMazurov commented Jan 29, 2025

lpetrovic05 commented Jan 29, 2025

Node 2 never returns to ACTIVE after reconnect in FCM-VM-MultiSBReconnect-3R-1k-20m JRS Test #17180

Node 2 never returns to ACTIVE after reconnect in FCM-VM-MultiSBReconnect-3R-1k-20m JRS Test #17180

Comments

edward-swirldslabs commented Dec 30, 2024 • edited Loading

edward-swirldslabs commented Dec 30, 2024

edward-swirldslabs commented Dec 30, 2024

edward-swirldslabs commented Dec 31, 2024

lpetrovic05 commented Jan 9, 2025

edward-swirldslabs commented Jan 9, 2025

OlegMazurov commented Jan 23, 2025 • edited Loading

edward-swirldslabs commented Jan 24, 2025 • edited Loading

OlegMazurov commented Jan 24, 2025

edward-swirldslabs commented Jan 24, 2025

OlegMazurov commented Jan 29, 2025

lpetrovic05 commented Jan 29, 2025

edward-swirldslabs commented Dec 30, 2024 •

edited

Loading

OlegMazurov commented Jan 23, 2025 •

edited

Loading

edward-swirldslabs commented Jan 24, 2025 •

edited

Loading