451 reduce the "already-connected peer" error logs from appearing #454

b-yap · 2023-11-27T10:50:01Z

closes #451

General overview of the changes:

This PR does not entirely eliminate the already-connected peer; it's more on how to handle it.
It is explained in users.rust-lang.org and stackexchange that closing the stream does not necessarily mean the connection is immediately cut.
A certain waiting time has to be allowed before reconnecting with the same tuple (address, port, etc).

I started with 10 or 15 seconds but both were too early. I've tested 20 seconds and reconnection was successful.
This wait time increases by 10 seconds, if the reconnect action is called too often (less than 30 minutes). It becomes 30 seconds, then 40, 50, etc.
However should reconnection happen after a long time (more than 30 minutes), waiting time reverts back to 20 seconds.

tokio::select is disruptive:

Waits on multiple concurrent branches, returning when the first branch completes, cancelling the remaining branches.

spacewalk/clients/vault/src/oracle/agent.rs

Lines 102 to 111 in 13d25ae

    
           tokio::select! { 
        
           	// runs the stellar-relay and listens to data to collect the scp messages and txsets. 
        
           	Some(msg) = overlay_conn.listen() => { 
        
           		handle_message(msg, collector_clone.clone(), &sender).await?; 
        
           	}, 
        
           	Some(msg) = receiver.recv() => { 
        
           		// We received the instruction to send a message to the overlay network by the receiver 
        
           		overlay_conn.send(msg).await?; 
        
           	}

The current implementation is waiting on a listen() which constantly/rapidly receives messages from the Stellar Node.
The recv() happens when user (or the inner workings inside the agent) wants to send a message to the Node, and it does not occur as plenty as listen(). On every loop, the listen() completes too fast for the recv() to be acknowledged.
Here are where the messages (which recv() should handle) are sent:

spacewalk/clients/vault/src/oracle/collector/proof_builder.rs

Lines 96 to 99 in 13d25ae

    
           if let Err(e) = sender.send(StellarMessage::GetScpState(slot)).await { 
        
           	tracing::error!( 
        
           		"ask_node_for_envelopes(): Proof Building for slot {slot}: failed to send `GetScpState` message: {e:?}" 
        
           	);

spacewalk/clients/vault/src/oracle/collector/handler.rs

Line 39 in 13d25ae

message_sender.send(StellarMessage::GetTxSet(txset_hash)).await?;

These I found, are one of the major results of tokio::select in the integration tests:
Proof should be available: ProofTimeout("Timeout elapsed for building proof of slot...
could not find event: Redeem::ExecuteRedeem..

Since the tokio::select block of code is already inside the loop, it makes sense to just await each of these calls, without a need to choose between them.

Overhaul of the stellar-relay-lib is required, to simplify message sending/receiving between the user/agent and the Stellar Node.

the REMOVAL of message wrappers / extra message types

spacewalk/clients/stellar-relay-lib/src/connection/connector/mod.rs

Lines 12 to 17 in 13d25ae

    
           pub enum ConnectorActions { 
        
           	SendHello, 
        
           	SendMessage(Box<StellarMessage>), 
        
           	HandleMessage(Xdr), 
        
           	Disconnect, 
        
           }

spacewalk/clients/stellar-relay-lib/src/connection/mod.rs

Lines 30 to 45 in 13d25ae

    
           pub enum StellarRelayMessage { 
        
           	/// Successfully connected to the node 
        
           	Connect { 
        
           		pub_key: PublicKey, 
        
           		node_info: NodeInfo, 
        
           	}, 
        
           	/// Stellar messages from the node 
        
           	Data { 
        
           		p_id: u32, 
        
           		msg_type: MessageType, 
        
           		msg: Box<StellarMessage>, 
        
           	}, 
        
           	Error(String), 
        
           	/// The amount of time to wait for Stellar Node messages 
        
           	Timeout, 
        
           }

the REMOVAL of extra fields in Connector struct:

spacewalk/clients/stellar-relay-lib/src/connection/connector/connector.rs

Lines 18 to 40 in 13d25ae

    
           pub struct Connector { 
        
           	local: LocalInfo, 
        
           	remote_info: Option<RemoteInfo>, 
        
           	hmac_keys: Option<HMacKeys>, 
        
           	pub(crate) connection_auth: ConnectionAuth, 
        
           	pub(crate) timeout_in_secs: u64, 
        
           	pub(crate) retries: u8, 
        
           	remote_called_us: bool, 
        
           	receive_tx_messages: bool, 
        
           	receive_scp_messages: bool, 
        
           	handshake_state: HandshakeState, 
        
           	flow_controller: FlowController, 
        
           	/// a channel for writing xdr messages to stream. 
        
           	actions_sender: mpsc::Sender<ConnectorActions>, 
        
           	/// a channel for communicating back to the caller 
        
           	relay_message_sender: mpsc::Sender<StellarRelayMessage>, 
        
           }

retries -> removed. Reconnection should be implemented outside this struct.
actions_sender -> Removed. This is replaced with a new field: write_stream_overlay which is the write half of the TcpStream
relay_message_sender -> Removed. Relaying message to user should be implemented outside this struct.

redistribute code from services.rs to the message_reader.rs (directly reading messages from the stream).
replace overlay_connection.rs to overlay.rs found outside the connector mod, as this will be for PUBLIC use.

How to begin the review:

clients/service/src/lib.rs -> implementation as mentioned on the 1st point.
clients/vault/src/oracle/agent.rs
- handle_message() -> accepts StellarMessage instead of StellarRelayMessage. On the 3rd point: No more extra enums.
- start_oracle_agent()
  - the StellarOverlayConnection has a sender that will send messages to Stellar Node. Instead of creating new sender/receiver channels, we utilize a direct sender.
  - On the 2nd point: since a direct sender is used, we don't need to do a tokio::select.
  - The disconnect_signal_sender and receiver are to signal the overlay connection inside the thread to DISCONNECT from the Stellar Node, if a shutdown is triggered.
clients/stellar-relay-lib/src/overlay.rs
- replaces the current overlay_connection.rs
- StellarOverlayConnection
  - has 2 fields:
    * sender - to send messages to Stellar Node
    * receiver - receive messages from Stellar Node
  - the connect() replaces the fn connect() of clients/stellar-relay-lib/src/connection/overlay_connection.rs
    * instead of 2 spawned threads, there is only 1: poll_messages_from_stellar().
  - contains the listen() function called by start_oracle_agent() to listen to messages from Stellar Node
clients/stellar-relay-lib/src/connection/connector/message_reader.rs
- poll_messages_from_stellar()
  - send_to_node_receiver.try_recv() -> listens for messages from the user, and then send that message to Stellar
  - match read_message_from_stellar(... -> listens for messages from Stellar; process it, and then send to user.
  - outside the loop, make sure to close all channels and the TcpStream

The following changes do not need to be reviewed in order:

clients/stellar-relay-lib/examples/connect.rs
- interpret immediately from StellarMessage, since StellarRelayMessage is removed.
clients/stellar-relay-lib/src/config.rs
- stellar-relay-lib will NOT handle retries anymore, hence it is removed.
clients/stellar-relay-lib/src/connection/error.rs
- additional defined errors: AuthSignatureFailed, AuthFailed, Timeout, and VersionStrTooLong
clients/stellar-relay-lib/src/connection/connector/message_handler.rs
- it will return an Option<StellarMessage>, instead of sending it through a sender ( send_to_user() is removed, mentioned on the 3rd point).
clients/stellar-relay-lib/src/connection/connector/message_sender.rs
- send_to_node(...) accepts a StellarMessage, converts to xdr and write directly to the write half of the stream (which is a field of the Connector)
clients/stellar-relay-lib/src/connection/connector/mod.rs
- removal of ConnectorActions
clients/stellar-relay-lib/src/connection/helper.rs
- remove of log_error macro since it's not being used anymore.
- the create_stream() is a function from the clients/stellar-relay-lib/src/connection/services.rs
clients/stellar-relay-lib/src/connection/mod.rs
- removal of StellarRelayMessage
- removal of retries field in ConnectionInfo. Retry is performed in clients/service/src/lib.rs

clients/stellar-relay-lib/src/connection/helper.rs

clients/stellar-relay-lib/src/connection/connector/connector.rs

clients/stellar-relay-lib/src/connection/overlay_connection.rs

introduce "disconnect" message

…llar-overlay-when-restarting

by passing the same shutdown sender to the oracle agent

b-yap · 2023-12-07T08:19:13Z

clients/stellar-relay-lib/src/connection/connector/connector.rs

-	/// a channel for communicating back to the caller
-	relay_message_sender: mpsc::Sender<StellarRelayMessage>,
+	/// for writing xdr messages to stream.
+	pub(crate) write_stream_overlay: OwnedWriteHalf,


instead of interpreting the message because of its wrapper (ConnectorActions, StellarRelayMesssage), the Connector will receive a StellarMessage from the user and send that immediately to the Node.

ebma

Thanks for all your efforts @b-yap. I really like the refactorings and I think you improved it significantly, especially removing the overhead with the extra senders, the ConnectorActions and tokio::select!.
Also, nice catch with the disconnect and the backoff delay when restarting the service. 👍

I tested it locally and it seems to work fine. I can confirm that the client restarts properly when it encounters the 'already connected' error.

But I encountered an issue with the test_get_proof_for_current_slot() though as it again timed out when running the tests locally. Can you confirm that this test is shaky? Maybe we should increase the latest_slot variable here by 3 or more instead of using 2?

clients/service/src/lib.rs

clients/vault/src/oracle/agent.rs

clients/stellar-relay-lib/src/overlay.rs

clients/stellar-relay-lib/src/connection/connector/message_reader.rs

clients/stellar-relay-lib/src/overlay.rs

clients/stellar-relay-lib/src/connection/error.rs

…319850, #454 (comment), #454 (comment), #454 (comment), #454 (comment), #454 (comment), #454 (comment), #454 (comment), #454 (comment), #454 (review)

ebma

Thanks for addressing my comments 👍
Let's see if the CI passes, otherwise it's good to go from my side

clients/vault/src/oracle/agent.rs

…ob/19521307640#step:13:66

b-yap · 2023-12-12T08:50:39Z

@ebma I had to revert from 3 to 2 for the test test_get_proof_for_current_slot because it kept failing on the CI.

…ring (#454)" This reverts commit f03cae0.

b-yap added 2 commits November 27, 2023 18:46

reduce the "already-connected peer" error logs from appearing

db45d0a

drop stream and connector before returning

6be1485

b-yap commented Nov 27, 2023

View reviewed changes

clients/stellar-relay-lib/src/connection/helper.rs Show resolved Hide resolved

b-yap commented Nov 27, 2023

View reviewed changes

clients/stellar-relay-lib/src/connection/connector/connector.rs Show resolved Hide resolved

b-yap commented Nov 27, 2023

View reviewed changes

clients/stellar-relay-lib/src/connection/overlay_connection.rs Outdated Show resolved Hide resolved

b-yap and others added 16 commits November 28, 2023 00:02

remove timeout;

c40a5d0

introduce "disconnect" message

new connection change

c192eb0

new connection change

142f551

new changes

0bec097

cleanup

ee24e88

added more meaningful errors

56a6cfe

new update

d2487e4

Fix compilation

33e1a00

Small refactoring

a88c162

fix test cases and run cargo fmt

5a3d55e

merge from remote

9dcb618

Merge branch 'main' into 451-vault-client-unable-to-re-connect-to-ste…

1c9eb18

…llar-overlay-when-restarting

Fix shutdown not propagated

3fdc156

by passing the same shutdown sender to the oracle agent

add retry counter and remove extra channel

eb664ec

cargo fmt and fix warnings

bfde71a

fix failing test cases due to time elapsed with building proofs

461337d

b-yap commented Dec 7, 2023

View reviewed changes

remove never-ending test and additional cleanup

e49b266

b-yap requested a review from a team December 7, 2023 12:52

b-yap marked this pull request as ready for review December 7, 2023 12:52

b-yap mentioned this pull request Dec 7, 2023

Refactor stellar-relay-lib and vault/oracle crate implementations #372

Open

2 tasks

support multi-thread in tests

fd5b72f

ebma reviewed Dec 7, 2023

View reviewed changes

https://github.com/pendulum-chain/spacewalk/pull/454#discussion_r1419…

df86a24

…319850, #454 (comment), #454 (comment), #454 (comment), #454 (comment), #454 (comment), #454 (comment), #454 (comment), #454 (comment), #454 (review)

ebma approved these changes Dec 11, 2023

View reviewed changes

clients/vault/src/oracle/agent.rs Outdated Show resolved Hide resolved

b-yap added 3 commits December 11, 2023 20:53

remove env logger

956da84

reverting back to 2. Maybe 3 is too forward-looking

3c967a0

https://github.com/pendulum-chain/spacewalk/actions/runs/7169885052/j…

2d1e6b3

…ob/19521307640#step:13:66

b-yap merged commit f03cae0 into main Dec 12, 2023
2 checks passed

b-yap deleted the 451-vault-client-unable-to-re-connect-to-stellar-overlay-when-restarting branch December 12, 2023 08:49

b-yap added a commit that referenced this pull request Dec 14, 2023

Revert "451 reduce the "already-connected peer" error logs from appea…

905c3a6

…ring (#454)" This reverts commit f03cae0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

451 reduce the "already-connected peer" error logs from appearing #454

451 reduce the "already-connected peer" error logs from appearing #454

b-yap commented Nov 27, 2023 •

edited

Loading

b-yap Dec 7, 2023

ebma left a comment

ebma left a comment

b-yap commented Dec 12, 2023

	tokio::select! {
	// runs the stellar-relay and listens to data to collect the scp messages and txsets.
	Some(msg) = overlay_conn.listen() => {
	handle_message(msg, collector_clone.clone(), &sender).await?;
	},

	Some(msg) = receiver.recv() => {
	// We received the instruction to send a message to the overlay network by the receiver
	overlay_conn.send(msg).await?;
	}

	if let Err(e) = sender.send(StellarMessage::GetScpState(slot)).await {
	tracing::error!(
	"ask_node_for_envelopes(): Proof Building for slot {slot}: failed to send `GetScpState` message: {e:?}"
	);

	pub enum ConnectorActions {
	SendHello,
	SendMessage(Box<StellarMessage>),
	HandleMessage(Xdr),
	Disconnect,
	}

	pub enum StellarRelayMessage {
	/// Successfully connected to the node
	Connect {
	pub_key: PublicKey,
	node_info: NodeInfo,
	},
	/// Stellar messages from the node
	Data {
	p_id: u32,
	msg_type: MessageType,
	msg: Box<StellarMessage>,
	},
	Error(String),
	/// The amount of time to wait for Stellar Node messages
	Timeout,
	}

	pub struct Connector {
	local: LocalInfo,

	remote_info: Option<RemoteInfo>,
	hmac_keys: Option<HMacKeys>,

	pub(crate) connection_auth: ConnectionAuth,
	pub(crate) timeout_in_secs: u64,
	pub(crate) retries: u8,

	remote_called_us: bool,
	receive_tx_messages: bool,
	receive_scp_messages: bool,

	handshake_state: HandshakeState,
	flow_controller: FlowController,

	/// a channel for writing xdr messages to stream.
	actions_sender: mpsc::Sender<ConnectorActions>,

	/// a channel for communicating back to the caller
	relay_message_sender: mpsc::Sender<StellarRelayMessage>,
	}

451 reduce the "already-connected peer" error logs from appearing #454

451 reduce the "already-connected peer" error logs from appearing #454

Conversation

b-yap commented Nov 27, 2023 • edited Loading

General overview of the changes:

How to begin the review:

b-yap Dec 7, 2023

Choose a reason for hiding this comment

ebma left a comment

Choose a reason for hiding this comment

ebma left a comment

Choose a reason for hiding this comment

b-yap commented Dec 12, 2023

b-yap commented Nov 27, 2023 •

edited

Loading