Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node 2 never returns to ACTIVE after reconnect in FCM-VM-MultiSBReconnect-3R-1k-20m JRS Test #17180

Open
edward-swirldslabs opened this issue Dec 30, 2024 · 11 comments · May be fixed by #17658
Open
Assignees
Labels
Bug An error that causes the feature to behave differently than what was expected based on design. Platform Tickets pertaining to the platform

Comments

@edward-swirldslabs
Copy link
Contributor

edward-swirldslabs commented Dec 30, 2024

Node 2 appears to not have its events chosen as parents by other nodes after it reconnects.

JRS Data Folder: http://35.247.76.217:8095/swirlds-automation/release/0.58/10N/MultiSBReconnect/20241228-094044-GCP-Daily-MultiSBReconnect-3R-10N/FCM-VM-MultiSBReconnect-3R-1k-20m/

Node 2 did not reach ACTIVE state after reconnect (Node 2 logs)

  • 13m 53.138s - Reconnect completed and entered CHECKING state.
  • 13m 56.346s - init judges found.
  • 14m 3.196s - warning from Shadowgraph, missing non-expired self-parent.
  • 18m 33.903s - 22m 44.618s - repeated entries saying Copying round X, transactions ignored by expected map #. This log is added to debug Debug FCM-Restart-Stake-2.5k-10m observed ISS #11254

Looking at the stats pdf, confirmed, node 2 remained in checking after reconnect/

  • secC2C is maxed at 250 min after reconnect (indicating a self event never comes to consensus after reconnect).
  • bufferedPriorityTransactions steadily climbs unbounded after the reconnect.
  • TIPSET: Node 1 chose Node 2 as a pity parent near the reconnect, no other node chooses node 2 afterwards.
  • TIPSET: Node 2 stops choosing pityparents for other nodes after the reconnect.
  • secC2RC, secOR2T, secR2C all spike during the reconnect, but return to normal levels afterwards.
  • secSc2T flatlines and twice the value of all other nodes after the reconnect.
  • TIPSET: selfishness score flatlines at 0.1, less than half of all the other nodes.
  • TIPSET: tipsetAdvancement flatlines at around 0.375, all other nodes are at 0.2 after reconnect.
  • TIPSET: parentage selection of node2 by other nodes drops to 0 after reconnect and is never selected.
  • cEvents_pre_second: node 2 drops to 0 after reconnect.
  • eventCreation_fraction_OVERLOADED is maxed out at 1 for node 2.
@edward-swirldslabs edward-swirldslabs added Bug An error that causes the feature to behave differently than what was expected based on design. Platform Tickets pertaining to the platform labels Dec 30, 2024
@edward-swirldslabs
Copy link
Contributor Author

FCM-VM-MultiSBReconnect-3R-1k-20m.pdf

This is the stats pdf.

@edward-swirldslabs
Copy link
Contributor Author

Analysis:

  • Node 2 stops creating and sending self events due to the PlatformHealthRule which is permanently reporting OVERLOADED status for event creation fraction.
  • This status is cleared when all the event creation rules indicate that event creation is permitted.
  • These rules are an Aggregate of MaximumRateRule, PlatformStatusRule, and PlatformHealthRule.
  • The aggregate isEventCreationPermitted() must always be firing and the PlatformHealthRule must always be failing its call as that is what is setting the most recent status for the aggregate to report.
  • This is taking place in the EventCreationManager.maybeCreateEvent() which is triggered by a heartbeat based on the event creation frequency configuration.
  • The PlatformHealthRule queries unhealthy durations from the DefaultEventCreationManager which stored the heartbeated unhealthy durations from the HealthMonitor.
  • The Node 2 CSV indicates the HealthMonitor would be reporting 0 as unhealthy duration after the reconnect.

It feels like there is a logical error here somewhere. Assuming the heartbeats were working and no queues were backing up, the PlatformHealthRule would be getting unhealthyDuration updates of 0 and that would cause it to pass its isEventCreationPermitted() call.

@edward-swirldslabs
Copy link
Contributor Author

Possible Future System Monitoring:

  1. metrics to report the frequency of sending data down individual wires.
  2. running a thread dump on every node at the end of a JRS test.

@lpetrovic05
Copy link
Member

@edward-swirldslabs
From what I see, it looks like the EventCreationManager did not get the unhealthyDuration, but the metrics did. The healthy metric is set to 1 on the problematic node, even though events are not created because the platform is supposedly unhealthy.

@edward-swirldslabs
Copy link
Contributor Author

@edward-swirldslabs From what I see, it looks like the EventCreationManager did not get the unhealthyDuration, but the metrics did. The healthy metric is set to 1 on the problematic node, even though events are not created because the platform is supposedly unhealthy.

I agree. I just don't have an explanation for why this happened. I suspect the heartbeat that pumps the unhealthyduration metric to the event creator's platform health rule died or seized up. It doesn't look like any of the queues backed up. I don't know what would cause it to die or how we can detect and mitigate this.

@OlegMazurov
Copy link
Contributor

OlegMazurov commented Jan 23, 2025

I believe the proposed theory is sound and here is the scenario how health status misreporting could happen:

  • HealthMonitor.checkSystemHealth() is called periodically from the heartbeat and reports the longest unhealthy duration, which may be zero, to EventCreationManager::reportUnhealthyDuration
  • if there is no change in the duration since last report, no duration is reported (null); that means zero is reported only once
  • the health monitor scheduler is built by StandardWiringModel as SEQUENTIAL
  • eventCreationManager is also configured as SEQUENTIAL in PlatformSchedulersConfig
  • SEQUENTIAL schedulers guarantee that no two submitted tasks can be executed simultaneously but they don't guarantee any particular order of execution (see compareAndSet in SequentialTaskScheduler.scheduleTask())
  • the HealthMonitor_unhandled_task_count metric indicated that we had contention (more than one scheduled task) for the health monitor scheduler and thus potential reordering of its tasks
  • ditto for EventCreationManager_unhandled_task_count

So, the problem may be caused by asynchronous interaction between the health monitor and the event creation manager and task reordering due to high contention in the platform ForkJoinPool, to which all tasks are submitted.

@edward-swirldslabs
Copy link
Contributor Author

edward-swirldslabs commented Jan 24, 2025

SEQUENTIAL schedulers guarantee that no two submitted tasks can be executed simultaneously but they don't guarantee any particular order of execution (see compareAndSet in SequentialTaskScheduler.scheduleTask())

😱 That's not what the documentation says or what I assume when I hear the word "sequential". I assume order preserving.

public enum TaskSchedulerType {
/**
* Tasks are executed in a fork join pool one at a time in the order they were enqueued. There is a happens before
* relationship between each task.
*/
SEQUENTIAL,
/**
* Tasks are executed on a dedicated thread one at a time in the order they were enqueued. There is a happens before
* relationship between each task. This scheduler type has very similar semantics as {@link #SEQUENTIAL}, although
* the implementation and performance characteristics are not identical.
*/
SEQUENTIAL_THREAD,

I see. The order of ENQUEUE is not guaranteed. Once enqueued, order is guaranteed. So if there are multiple threads enqueueing, the threads can be out of order with respect to each other, but the successive enqueueing by the same thread preserves order with respect to the subsequence of enqueues made by that thread.

private void scheduleTask(@NonNull final Consumer<Object> handler, @NonNull final Object data) {
// This method may be called by many threads, but actual execution is required to happen serially. This method
// organizes tasks into a linked list. Tasks in this linked list are executed one at a time in order.
// When execution of one task is completed, execution of the next task is scheduled on the pool.
final SequentialTask nextTask = new SequentialTask(pool, offRamp, busyTimer, uncaughtExceptionHandler, false);
SequentialTask currentTask;
do {
currentTask = nextTaskPlaceholder.get();
} while (!nextTaskPlaceholder.compareAndSet(currentTask, nextTask));
currentTask.send(nextTask, handler, data);
}

I am curious why we did not use a non-blocking queue like: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html

The AtomicReference used in the existing implementation is a non-order preserving synchronization lock. If 5 threads hit it in the order 1, 2, 3, 4, 5, such that 1 gets the lock and 2, 3, 4, 5 block, there is no guarantee of order on who wakes up next to get the lock once 1 frees it.

In my opinion the implementation would be more correct to use a non-blocking concurrent queue.

The current implementation introduces the non-trivial opportunity for the following race condition:

1. Thread 1 grabs the get and holds the lock on the atomic reference.
2. Thread 2 tries to grab the lock through get and blocks.
3. Thread 1 finished the get and releases the lock. Thread 1 performs a context switch because its time slice is up before it can grab the lock again in the compare and set.
4. Thread 2 wakes up and grabs the lock to perform the get.
5. Thread 1 is put back on the processor and calls compareAndSet but blocks because the lock is held by thread 2.
6. Thread 2 completes the get, releases the lock, but still has time on its slice to grab the lock with the compare and set.
7. Thread 2 sets the value in the atomic reference before Thread 1.
8. Thread 2 releases the lock from the compare and set, this notifies Thread 1 to perform its compare and set.
9. Thread 1 fails and has to go through the while loop to get the value that Thread 2 just set.

@OlegMazurov
Copy link
Contributor

@edward-swirldslabs There are no locks in AtomicReference implementation. The entire wiring framework is essentially lock-free and non-blocking.
If we remove locking from your scenario, it describes how compareAndSet resolves conflicts between competing threads. The outcome was expected and even desired by the initial requirements.
There is an example where a deterministic order of task execution is preserved, see ParallelTask/SequentialTask in BlockStreamManagerImpl. So, the problem is not in a particular algorithm implementation but rather in selecting a wrong algorithm for the job.

@edward-swirldslabs
Copy link
Contributor Author

@OlegMazurov You are absolutely right and I was wrong. Thank you for educating me and sorry for my mistake. Both the ConcurrentLinkedQueue and AtomicReference are using VarHandle.compareSet() tricks to avoid blocking. The race condition I hypothesized due to blocking doesn't exist.

@OlegMazurov
Copy link
Contributor

My task reordering theory above was not correct. Tasks submitted by one SEQUENTIAL scheduler to another are guaranteed to preserve order despite the CAS in scheduleTask().

@lpetrovic05
Copy link
Member

Pasting Oleg's theory from slack which makes perfect sense to me:

Here is a new theory.

  • the unhealthyDuration metric is high and DefaultEventCreationManager is in the OVERLOADED phase

  • node gets BEHIND and initiates reconnect

  • PlatformCoordinator.clear() is called to prepare the wiring framework for reconnect

  • the following calls are made when in clear():
    - eventCreationManagerWiring.startSquelching();
    - eventCreationManagerWiring.flush();
    - eventCreationManagerWiring.stopSquelching();
    - eventCreationManagerWiring.getInputWire(EventCreationManager::clear).inject(NoInput.getInstance());

  • eventCreationManager is configured as SQUELCHABLE

  • the health monitor observes that the system becomes healthy and reports zero unhealthy duration to all consumers

  • if the DefaultEventCreationManager.reportUnhealthyDuration(healthy) task hits the startSquelching()..stopSquelching() window, its handler is not executed in the BindableInputWire.bindConsumer() lambda and DefaultEventCreationManager.unhealthyDuration stays unchanged

  • while the system remains healthy no more reportUnhealthyDuration(healthy) tasks are submitted due to HealthMonitor::checkSystemHealth logic

  • DefaultEventCreationManager is stuck in the OVERLOADED phase

If it's correct, the proposed fix for HealthMonitor::checkSystemHealth would work. I still believe that asynchronous control tasks are bad (over-engineered) and schedulers should be defined for data wires, not universally for wiring components.P.S. When analyzing performance or functional issues with the wiring framework I always feel like a boy who bought a bucket of Lego pieces at a garage sale and tries to build the Millennium Falcon without instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug An error that causes the feature to behave differently than what was expected based on design. Platform Tickets pertaining to the platform
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants