Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

451 reduce the "already-connected peer" error logs from appearing #454

Conversation

b-yap
Copy link
Contributor

@b-yap b-yap commented Nov 27, 2023

closes #451

General overview of the changes:

  1. This PR does not entirely eliminate the already-connected peer; it's more on how to handle it.
    It is explained in users.rust-lang.org and stackexchange that closing the stream does not necessarily mean the connection is immediately cut.
    A certain waiting time has to be allowed before reconnecting with the same tuple (address, port, etc).

    I started with 10 or 15 seconds but both were too early. I've tested 20 seconds and reconnection was successful.
    This wait time increases by 10 seconds, if the reconnect action is called too often (less than 30 minutes). It becomes 30 seconds, then 40, 50, etc.
    However should reconnection happen after a long time (more than 30 minutes), waiting time reverts back to 20 seconds.

  2. tokio::select is disruptive:

    Waits on multiple concurrent branches, returning when the first branch completes, cancelling the remaining branches.

    tokio::select! {
    // runs the stellar-relay and listens to data to collect the scp messages and txsets.
    Some(msg) = overlay_conn.listen() => {
    handle_message(msg, collector_clone.clone(), &sender).await?;
    },
    Some(msg) = receiver.recv() => {
    // We received the instruction to send a message to the overlay network by the receiver
    overlay_conn.send(msg).await?;
    }

    The current implementation is waiting on a listen() which constantly/rapidly receives messages from the Stellar Node.
    The recv() happens when user (or the inner workings inside the agent) wants to send a message to the Node, and it does not occur as plenty as listen(). On every loop, the listen() completes too fast for the recv() to be acknowledged.
    Here are where the messages (which recv() should handle) are sent:
    if let Err(e) = sender.send(StellarMessage::GetScpState(slot)).await {
    tracing::error!(
    "ask_node_for_envelopes(): Proof Building for slot {slot}: failed to send `GetScpState` message: {e:?}"
    );

    message_sender.send(StellarMessage::GetTxSet(txset_hash)).await?;

    These I found, are one of the major results of tokio::select in the integration tests:
    Proof should be available: ProofTimeout("Timeout elapsed for building proof of slot...
    could not find event: Redeem::ExecuteRedeem..

    Since the tokio::select block of code is already inside the loop, it makes sense to just await each of these calls, without a need to choose between them.

  3. Overhaul of the stellar-relay-lib is required, to simplify message sending/receiving between the user/agent and the Stellar Node.

    • the REMOVAL of message wrappers / extra message types
    • the REMOVAL of extra fields in Connector struct:
      • pub struct Connector {
        local: LocalInfo,
        remote_info: Option<RemoteInfo>,
        hmac_keys: Option<HMacKeys>,
        pub(crate) connection_auth: ConnectionAuth,
        pub(crate) timeout_in_secs: u64,
        pub(crate) retries: u8,
        remote_called_us: bool,
        receive_tx_messages: bool,
        receive_scp_messages: bool,
        handshake_state: HandshakeState,
        flow_controller: FlowController,
        /// a channel for writing xdr messages to stream.
        actions_sender: mpsc::Sender<ConnectorActions>,
        /// a channel for communicating back to the caller
        relay_message_sender: mpsc::Sender<StellarRelayMessage>,
        }
        • retries -> removed. Reconnection should be implemented outside this struct.
        • actions_sender -> Removed. This is replaced with a new field: write_stream_overlay which is the write half of the TcpStream
        • relay_message_sender -> Removed. Relaying message to user should be implemented outside this struct.
    • redistribute code from services.rs to the message_reader.rs (directly reading messages from the stream).
    • replace overlay_connection.rs to overlay.rs found outside the connector mod, as this will be for PUBLIC use.

How to begin the review:

  1. clients/service/src/lib.rs -> implementation as mentioned on the 1st point.
  2. clients/vault/src/oracle/agent.rs
    • handle_message() -> accepts StellarMessage instead of StellarRelayMessage. On the 3rd point: No more extra enums.
    • start_oracle_agent()
      • the StellarOverlayConnection has a sender that will send messages to Stellar Node. Instead of creating new sender/receiver channels, we utilize a direct sender.
      • On the 2nd point: since a direct sender is used, we don't need to do a tokio::select.
      • The disconnect_signal_sender and receiver are to signal the overlay connection inside the thread to DISCONNECT from the Stellar Node, if a shutdown is triggered.
  3. clients/stellar-relay-lib/src/overlay.rs
    • replaces the current overlay_connection.rs
    • StellarOverlayConnection
      • has 2 fields:
        * sender - to send messages to Stellar Node
        * receiver - receive messages from Stellar Node
      • the connect() replaces the fn connect() of clients/stellar-relay-lib/src/connection/overlay_connection.rs
        * instead of 2 spawned threads, there is only 1: poll_messages_from_stellar().
      • contains the listen() function called by start_oracle_agent() to listen to messages from Stellar Node
  4. clients/stellar-relay-lib/src/connection/connector/message_reader.rs
    • poll_messages_from_stellar()
      • send_to_node_receiver.try_recv() -> listens for messages from the user, and then send that message to Stellar
      • match read_message_from_stellar(... -> listens for messages from Stellar; process it, and then send to user.
      • outside the loop, make sure to close all channels and the TcpStream

The following changes do not need to be reviewed in order:

  • clients/stellar-relay-lib/examples/connect.rs
    • interpret immediately from StellarMessage, since StellarRelayMessage is removed.
  • clients/stellar-relay-lib/src/config.rs
    • stellar-relay-lib will NOT handle retries anymore, hence it is removed.
  • clients/stellar-relay-lib/src/connection/error.rs
    • additional defined errors: AuthSignatureFailed, AuthFailed, Timeout, and VersionStrTooLong
  • clients/stellar-relay-lib/src/connection/connector/message_handler.rs
    • it will return an Option<StellarMessage>, instead of sending it through a sender ( send_to_user() is removed, mentioned on the 3rd point).
  • clients/stellar-relay-lib/src/connection/connector/message_sender.rs
    • send_to_node(...) accepts a StellarMessage, converts to xdr and write directly to the write half of the stream (which is a field of the Connector)
  • clients/stellar-relay-lib/src/connection/connector/mod.rs
    • removal of ConnectorActions
  • clients/stellar-relay-lib/src/connection/helper.rs
    • remove of log_error macro since it's not being used anymore.
    • the create_stream() is a function from the clients/stellar-relay-lib/src/connection/services.rs
  • clients/stellar-relay-lib/src/connection/mod.rs
    • removal of StellarRelayMessage
    • removal of retries field in ConnectionInfo. Retry is performed in clients/service/src/lib.rs

/// a channel for communicating back to the caller
relay_message_sender: mpsc::Sender<StellarRelayMessage>,
/// for writing xdr messages to stream.
pub(crate) write_stream_overlay: OwnedWriteHalf,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of interpreting the message because of its wrapper (ConnectorActions, StellarRelayMesssage), the Connector will receive a StellarMessage from the user and send that immediately to the Node.

@b-yap b-yap requested a review from a team December 7, 2023 12:52
@b-yap b-yap marked this pull request as ready for review December 7, 2023 12:52
Copy link
Member

@ebma ebma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your efforts @b-yap. I really like the refactorings and I think you improved it significantly, especially removing the overhead with the extra senders, the ConnectorActions and tokio::select!.
Also, nice catch with the disconnect and the backoff delay when restarting the service. 👍

I tested it locally and it seems to work fine. I can confirm that the client restarts properly when it encounters the 'already connected' error.

But I encountered an issue with the test_get_proof_for_current_slot() though as it again timed out when running the tests locally. Can you confirm that this test is shaky? Maybe we should increase the latest_slot variable here by 3 or more instead of using 2?

clients/service/src/lib.rs Outdated Show resolved Hide resolved
clients/service/src/lib.rs Outdated Show resolved Hide resolved
clients/vault/src/oracle/agent.rs Outdated Show resolved Hide resolved
clients/vault/src/oracle/agent.rs Outdated Show resolved Hide resolved
clients/stellar-relay-lib/src/overlay.rs Outdated Show resolved Hide resolved
clients/stellar-relay-lib/src/overlay.rs Outdated Show resolved Hide resolved
clients/stellar-relay-lib/src/overlay.rs Outdated Show resolved Hide resolved
clients/stellar-relay-lib/src/connection/error.rs Outdated Show resolved Hide resolved
Copy link
Member

@ebma ebma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments 👍
Let's see if the CI passes, otherwise it's good to go from my side

clients/vault/src/oracle/agent.rs Outdated Show resolved Hide resolved
@b-yap b-yap merged commit f03cae0 into main Dec 12, 2023
2 checks passed
@b-yap b-yap deleted the 451-vault-client-unable-to-re-connect-to-stellar-overlay-when-restarting branch December 12, 2023 08:49
@b-yap
Copy link
Contributor Author

b-yap commented Dec 12, 2023

@ebma I had to revert from 3 to 2 for the test test_get_proof_for_current_slot because it kept failing on the CI.

b-yap added a commit that referenced this pull request Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vault client unable to re-connect to Stellar overlay when restarting
2 participants