[Core] Enable concurrent connections between nodes #2664

AhmedSoliman · 2025-02-07T12:51:14Z

This enables nodes to maintain concurrent connections across different actual TCP connections to increase message processing concurrency. This is controlled by a new configuration option in [networking] section.

This also tags a few operations with tokio::task::unconstrained to reduce unnecessary coop-driven yields that happen at some hot-paths.

// intentionally empty

Stack created with Sapling. Best reviewed with ReviewStack.

github-actions · 2025-02-07T13:22:11Z

Test Results

7 files ±0 7 suites ±0 2m 55s ⏱️ - 1m 41s
47 tests ±0 46 ✅ ±0 1 💤 ±0 0 ❌ ±0
182 runs ±0 179 ✅ ±0 3 💤 ±0 0 ❌ ±0

Results for commit 96b61a0. ± Comparison against base commit 765636b.

♻️ This comment has been updated with latest results.

tillrohrmann

LGTM. +1 for merging :-)

tillrohrmann · 2025-02-07T20:21:39Z

crates/core/src/metadata.rs

+        // If we are already at the metadata version, avoid tokio's yielding to
+        // improve tail latencies when this is used in latency-sensitive operations.
+        let v = tokio::task::unconstrained(recv.wait_for(|v| *v >= min_version))


tillrohrmann · 2025-02-07T20:24:28Z

crates/core/src/network/connection_manager.rs

@@ -669,8 +687,8 @@ where
                    global::get_text_map_propagator(|propagator| propagator.extract(span_ctx))
                });

-                if let Err(e) = router
-                    .call(
+                if let Err(e) = tokio::task::unconstrained(


Maybe add a comment why this task should be unconstrained for my future self.

tillrohrmann · 2025-02-07T20:27:05Z

crates/core/src/network/types.rs

@@ -421,7 +421,7 @@ impl<M: Targeted + WireEncode> Outgoing<M, HasConnection> {
        let connection = bail_on_error!(self, self.try_upgrade());
        let permit = bail_on_none!(
            self,
-            connection.reserve().await,
+            tokio::task::unconstrained(connection.reserve()).await,


Did you observe an effect of cooperative scheduling w/o unconstrained?

Yes, although it of course varies depending on the workload. I was able to observe 30% improvement in tail latencies of GetSequencerState processing when the system is under heavy load.

tillrohrmann · 2025-02-07T20:29:55Z

crates/core/src/task_center/task_kind.rs

+    #[strum(props(runtime = "default"))]
    SocketHandler,
-    #[strum(props(OnError = "log"))]
+    #[strum(props(OnError = "log", runtime = "default"))]


Did you observe an effect of cross runtime overhead (global task queue delays)?

There are two reasons for this change:
1- We don't want connections to drop when a partition processor is terminated
2- The majority of writers to the network are running in default runtime (bifrost, syncing metadata, cluster controller) the only exception to this is ingress. The receiving end has somewhat similar, albeit slightly different story.

To answer your question, the effect that I measured was the unnecessary drops in connections during leadership changes which may cause extra retries has disappeared.

tillrohrmann · 2025-02-07T20:32:05Z

crates/types/src/config/networking.rs

+    /// This is used as a guiding value for how many connections every node can
+    /// maintain with its peers. With more connections, concurrency of network message


Maybe "maintain with each peer" to make it clearer that it's the number of connections per peer. Maybe num_concurrent_connections_per_peer (a bit of a mouthful)?

I'll update the documentation but I'll probably stick with the short name.

…tdown

This enables nodes to maintain concurrent connections across different actual TCP connections to increase message processing concurrency. This is controlled by a new configuration option in `[networking]` section. This also tags a few operations with `tokio::task::unconstrained` to reduce unnecessary coop-driven yields that happen at some hot-paths. ``` // intentionally empty ```

AhmedSoliman mentioned this pull request Feb 7, 2025

[LogServer] GET_LOGLET_INFO is high priority and drain workers on shutdown #2654

Merged

AhmedSoliman force-pushed the pr2664 branch from 22310c0 to 40c0e02 Compare February 7, 2025 12:53

AhmedSoliman marked this pull request as ready for review February 7, 2025 12:53

AhmedSoliman requested a review from tillrohrmann February 7, 2025 12:53

AhmedSoliman force-pushed the pr2664 branch from 40c0e02 to b86c969 Compare February 7, 2025 12:55

AhmedSoliman force-pushed the pr2664 branch from b86c969 to 7708024 Compare February 7, 2025 13:55

AhmedSoliman mentioned this pull request Feb 7, 2025

[Core] Network Telemetry improvements #2665

Merged

AhmedSoliman force-pushed the pr2664 branch 2 times, most recently from 23dbf1c to 96b61a0 Compare February 7, 2025 15:41

This was referenced Feb 7, 2025

Disable debug_assertions in release builds #2669

Merged

Fix checker display output #2670

Merged

[Bifrost] Sequencer gives priority to GetSequencerState requests #2672

Merged

Fix sequencer drain in challenging situations #2673

Merged

tillrohrmann approved these changes Feb 7, 2025

View reviewed changes

[LogServer] GET_LOGLET_INFO is high priority and drain workers on shu…

3aa4050

…tdown

AhmedSoliman force-pushed the pr2664 branch from 96b61a0 to 6bd3e53 Compare February 8, 2025 12:22

AhmedSoliman force-pushed the pr2664 branch from 6bd3e53 to 45e4664 Compare February 8, 2025 12:43

AhmedSoliman merged commit 45e4664 into main Feb 8, 2025
34 checks passed

AhmedSoliman deleted the pr2664 branch February 8, 2025 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Enable concurrent connections between nodes #2664

[Core] Enable concurrent connections between nodes #2664

AhmedSoliman commented Feb 7, 2025 •

edited

Loading

github-actions bot commented Feb 7, 2025 •

edited

Loading

tillrohrmann left a comment

tillrohrmann Feb 7, 2025

tillrohrmann Feb 7, 2025

tillrohrmann Feb 7, 2025

AhmedSoliman Feb 8, 2025

tillrohrmann Feb 7, 2025

AhmedSoliman Feb 8, 2025

AhmedSoliman Feb 8, 2025

tillrohrmann Feb 7, 2025

AhmedSoliman Feb 8, 2025

		/// This is used as a guiding value for how many connections every node can
		/// maintain with its peers. With more connections, concurrency of network message

[Core] Enable concurrent connections between nodes #2664

[Core] Enable concurrent connections between nodes #2664

Conversation

AhmedSoliman commented Feb 7, 2025 • edited Loading

github-actions bot commented Feb 7, 2025 • edited Loading

Test Results

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AhmedSoliman commented Feb 7, 2025 •

edited

Loading

github-actions bot commented Feb 7, 2025 •

edited

Loading