-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
451 reduce the "already-connected peer" error logs from appearing #454
451 reduce the "already-connected peer" error logs from appearing #454
Conversation
introduce "disconnect" message
…llar-overlay-when-restarting
by passing the same shutdown sender to the oracle agent
/// a channel for communicating back to the caller | ||
relay_message_sender: mpsc::Sender<StellarRelayMessage>, | ||
/// for writing xdr messages to stream. | ||
pub(crate) write_stream_overlay: OwnedWriteHalf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of interpreting the message because of its wrapper (ConnectorActions
, StellarRelayMesssage
), the Connector
will receive a StellarMessage
from the user and send that immediately to the Node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your efforts @b-yap. I really like the refactorings and I think you improved it significantly, especially removing the overhead with the extra senders, the ConnectorActions
and tokio::select!
.
Also, nice catch with the disconnect and the backoff delay when restarting the service. 👍
I tested it locally and it seems to work fine. I can confirm that the client restarts properly when it encounters the 'already connected' error.
But I encountered an issue with the test_get_proof_for_current_slot()
though as it again timed out when running the tests locally. Can you confirm that this test is shaky? Maybe we should increase the latest_slot
variable here by 3
or more instead of using 2
?
clients/stellar-relay-lib/src/connection/connector/message_reader.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments 👍
Let's see if the CI passes, otherwise it's good to go from my side
@ebma I had to revert from 3 to 2 for the test |
closes #451
General overview of the changes:
This PR does not entirely eliminate the
already-connected peer
; it's more on how to handle it.It is explained in users.rust-lang.org and stackexchange that closing the stream does not necessarily mean the connection is immediately cut.
A certain waiting time has to be allowed before reconnecting with the same tuple (address, port, etc).
I started with 10 or 15 seconds but both were too early. I've tested 20 seconds and reconnection was successful.
This wait time increases by 10 seconds, if the reconnect action is called too often (less than 30 minutes). It becomes 30 seconds, then 40, 50, etc.
However should reconnection happen after a long time (more than 30 minutes), waiting time reverts back to 20 seconds.
tokio::select
is disruptive:spacewalk/clients/vault/src/oracle/agent.rs
Lines 102 to 111 in 13d25ae
The current implementation is waiting on a
listen()
which constantly/rapidly receives messages from the Stellar Node.The
recv()
happens when user (or the inner workings inside the agent) wants to send a message to the Node, and it does not occur as plenty aslisten()
. On every loop, thelisten()
completes too fast for therecv()
to be acknowledged.Here are where the messages (which
recv()
should handle) are sent:spacewalk/clients/vault/src/oracle/collector/proof_builder.rs
Lines 96 to 99 in 13d25ae
spacewalk/clients/vault/src/oracle/collector/handler.rs
Line 39 in 13d25ae
These I found, are one of the major results of
tokio::select
in the integration tests:Proof should be available: ProofTimeout("Timeout elapsed for building proof of slot...
could not find event: Redeem::ExecuteRedeem..
Since the
tokio::select
block of code is already inside the loop, it makes sense to justawait
each of these calls, without a need to choose between them.Overhaul of the
stellar-relay-lib
is required, to simplify message sending/receiving between the user/agent and the Stellar Node.spacewalk/clients/stellar-relay-lib/src/connection/connector/mod.rs
Lines 12 to 17 in 13d25ae
spacewalk/clients/stellar-relay-lib/src/connection/mod.rs
Lines 30 to 45 in 13d25ae
Connector
struct:spacewalk/clients/stellar-relay-lib/src/connection/connector/connector.rs
Lines 18 to 40 in 13d25ae
retries
-> removed. Reconnection should be implemented outside this struct.actions_sender
-> Removed. This is replaced with a new field:write_stream_overlay
which is the write half of the TcpStreamrelay_message_sender
-> Removed. Relaying message to user should be implemented outside this struct.connector
mod, as this will be for PUBLIC use.How to begin the review:
clients/service/src/lib.rs
-> implementation as mentioned on the 1st point.clients/vault/src/oracle/agent.rs
handle_message()
-> acceptsStellarMessage
instead ofStellarRelayMessage
. On the 3rd point: No more extra enums.start_oracle_agent()
StellarOverlayConnection
has asender
that will send messages to Stellar Node. Instead of creating new sender/receiver channels, we utilize a direct sender.sender
is used, we don't need to do atokio::select
.disconnect_signal_sender
and receiver are to signal the overlay connection inside the thread to DISCONNECT from the Stellar Node, if a shutdown is triggered.clients/stellar-relay-lib/src/overlay.rs
overlay_connection.rs
StellarOverlayConnection
*
sender
- to send messages to Stellar Node*
receiver
- receive messages from Stellar Nodeconnect()
replaces thefn connect()
ofclients/stellar-relay-lib/src/connection/overlay_connection.rs
* instead of 2 spawned threads, there is only 1:
poll_messages_from_stellar()
.listen()
function called bystart_oracle_agent()
to listen to messages from Stellar Nodeclients/stellar-relay-lib/src/connection/connector/message_reader.rs
poll_messages_from_stellar()
send_to_node_receiver.try_recv()
-> listens for messages from the user, and then send that message to Stellarmatch read_message_from_stellar(...
-> listens for messages from Stellar; process it, and then send to user.The following changes do not need to be reviewed in order:
clients/stellar-relay-lib/examples/connect.rs
StellarMessage
, sinceStellarRelayMessage
is removed.clients/stellar-relay-lib/src/config.rs
stellar-relay-lib
will NOT handle retries anymore, hence it is removed.clients/stellar-relay-lib/src/connection/error.rs
AuthSignatureFailed
,AuthFailed
,Timeout
, andVersionStrTooLong
clients/stellar-relay-lib/src/connection/connector/message_handler.rs
Option<StellarMessage>
, instead of sending it through a sender (send_to_user()
is removed, mentioned on the 3rd point).clients/stellar-relay-lib/src/connection/connector/message_sender.rs
send_to_node(...)
accepts aStellarMessage
, converts to xdr and write directly to the write half of the stream (which is a field of theConnector
)clients/stellar-relay-lib/src/connection/connector/mod.rs
ConnectorActions
clients/stellar-relay-lib/src/connection/helper.rs
log_error
macro since it's not being used anymore.create_stream()
is a function from theclients/stellar-relay-lib/src/connection/services.rs
clients/stellar-relay-lib/src/connection/mod.rs
StellarRelayMessage
retries
field inConnectionInfo
. Retry is performed inclients/service/src/lib.rs