-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(p2p): ensure time synchronization in the network #2255
base: dev
Are you sure you want to change the base?
Conversation
b675759
to
afe2212
Compare
Signed-off-by: onur-ozkan <work@onurozkan.dev>
afe2212
to
4343c0e
Compare
Signed-off-by: onur-ozkan <work@onurozkan.dev>
f9dd830
to
5acfe8d
Compare
Signed-off-by: onur-ozkan <work@onurozkan.dev>
da8335c
to
82ae6d2
Compare
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
/// TODO: This should be called `PeerInfoRequest` instead. However, renaming it | ||
/// will introduce a breaking change in the network and is not worth it. Do this | ||
/// renaming when there is already a breaking change in the release. | ||
NetworkInfo(network_info::NetworkInfoRequest), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR already leads to breaking change. It's perfect time to rename this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to avoid breaking changes is to have a timeout on the request for peer timestamp and allow them to connect if there is no response from the peer. We later remove this when most peers update. Another way is to get the mm2 version and do the timestamp request depending on the version. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to avoid breaking changes is to have a timeout on the request for peer timestamp and allow them to connect if there is no response from the peer. We later remove this when most peers update.
This sounds like a good plan!
Another way is to get the mm2 version and do the timestamp request depending on the version. What do you think?
I think the first idea was better, but then I need to revert the peer-info renaming part. What about bundling these changes in some branch, and then releasing multiple breaking changes at once (like removing mm2 binaries and some other stuff) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about bundling these changes in some branch, and then releasing multiple breaking changes at once (like removing mm2 binaries and some other stuff) ?
I am fine with this or the first approach.
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
5ba1ac5
to
e86a166
Compare
Signed-off-by: onur-ozkan <work@onurozkan.dev>
e86a166
to
c1a6eaf
Compare
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
… into time-synced-network
Signed-off-by: onur-ozkan <work@onurozkan.dev>
/// Used in time validation logic for each peer which runs immediately after the | ||
/// `ConnectionEstablished` event. | ||
const MAX_TIME_GAP_FOR_CONNECTED_PEER: u64 = 30; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is a good default. Should we make it like 20 or 25?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30 seconds are fine since it's half the max allowed difference for swaps..
const MAX_STARTED_AT_DIFF: u64 = 60; |
However, An edge case that can happen in theory is time differences accumulating across the network. Consider a scenario with peers A, B, C and D:
- Peers B and C are relay/seed nodes
- Peers A and D are starting a swap as taker/maker
- If time differences exist as follows:
- Peer A is 30 secs off peer B but they allow them to connect
- Peer D is also 30 secs off peer C but they allow them to connect
- Peer B and C are out of sync as well (any number of seconds less than 30)
- Peer A and D can have more than 60 seconds difference and the swap fails.
This is an edge case that we should keep in mind in case we got time difference errors in a swap after this fix, but I think it will happen in very rare cases if any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. let's make it 20 secs to reduce the probability of this edge case happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last note :)
We should probably make time differences dependent, for instance MAX_STARTED_AT_DIFF = MAX_TIME_GAP_FOR_CONNECTED_PEER * 3
in case a dev wants to change one of them, they should be aware of how this affects or is affected by other time differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The edge case can only happen if seed nodes are not synced in time, which shouldn't be the case I believe? Since we are the maintainers of the seed network.
P.S. let's make it 20 secs to reduce the probability of this edge case happening.
I can do that, but I am not sure if that's a good idea. I had ~9 seconds diff between my laptop and desktop computer, so I believe in some cases it could be even more than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to inform the peer that their clock is off so that they can adjust it?
Just read internal discussions about this, the currently connected peers is one solution to this but not optimal, will think of something if there are any other solutions. A problem we have is the no login mode, GUIs should be able to show the orderbooks and other info for no login mode even if the peer clock is off, it will be strange for someone who wants to have a look at our dex to get a warning about system clock when not logged in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's totally reasonable to tell someone to sync their time if their system isn't (which shouldn't be the case for general-purpose OS users). There are so many apps doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are so many apps doing this.
Never seen it for web apps like komodo wallet though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess maybe you never run your OS unsynced in time :D We can provide some REST api to provide some general data for no-login mode. If we want to keep the network synced in time, I don't know what else we can do other than not connecting to unsynced ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess maybe you never run your OS unsynced in time :D
You are probably right :D
If we want to keep the network synced in time, I don't know what else we can do other than not connecting to unsynced ones.
I would leave the PR as is and if we got something from GUI team about this we can look for workarounds for this particular case. We need to mention it though while they are testing.
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Could you clarify please, we compare which peer time with what? |
Current peer's time against to the remote peer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the detailed PR desc :)
First review iter, will prolly need just one more.
const MAX_TIME_GAP_FOR_CONNECTED_PEER: u64 = 30; | ||
|
||
/// Used for storing peers in [`RECENTLY_DIALED_PEERS`]. | ||
const DIAL_RETRY_DELAY: Duration = Duration::from_secs(60 * 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would write this 5 * 60
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explanation: it's just semantically clearer to put the conversion factors at the end (and ordered, so for 2 days converted to seconds u would use 2 * 24 * 60 * 60
)
mm2src/mm2_p2p/src/application/request_response/network_info.rs
Outdated
Show resolved
Hide resolved
|
||
match request_one_peer(peer, encoded_request, rp_sender).await { | ||
PeerResponse::Ok { res } => { | ||
if let Ok(timestamp) = decode_message::<u64>(&res) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ummm, this is decoded as u64
, but the response here...
mm2src/mm2_main/src/lp_stats.rs
Outdated
fn process_get_peer_utc_timestamp_request() -> Result<Vec<u8>, String> { | ||
let timestamp = common::get_utc_timestamp(); | ||
encode_message(×tamp).map_err(|e| e.to_string()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...encodes it as a i64
. we should stick to only one. u64
makes the most sense application-wise as we don't really need the negative side? right?
but curious, is the encoding of i64
and u64
compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should stick to only one. u64 makes the most sense application-wise as we don't really need the negative side? right?
Why do you want to cast the value again when you can simply encode it?
but curious, is the encoding of i64 and u64 compatible?
They are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to cast the value again when you can simply encode it?
didn't know the encoding for both is compatible, still not sure how the negative side is encoded (probably as rust's then, will checkout).
so thing is, you actually do some kind of an implicit cast here using data encoding and decoding, you encode the number as i64
, send it, decode it as u64
from the receiver. i assume a negative i64
will decode fine as a u64
just like what the as
keyword does but give a garbage data (will check more on that encoding lib).
what i am suggesting is, we can encode it as i64
and decode it as i64
as well, we check abs_diff
without the need for any fallible conversions what so ever. (the suggestion stems from the fact that i64
encodes just like u64
over the network, i.e. if we ever change this to u64
encoding we will still be backward compatible. will have to double check this fact still :) ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not sure how the negative side is encoded
Theoretically, it should be fine on encoding but should panic on decoding, which is the main purpose on the decoding side (as negative ones are invalid).
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I have a few suggestion in addition to these 2 comments #2255 (comment), #2255 (comment)
if !pre_dial_check(&mut recently_dialed_peers, addr) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this check be part of the filtering above? This is to maintain the intended
connect_bootstrap_num
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If peer is not synced in time, we will keep trying to connect it in the maintain logic, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will keep trying to connect it in the maintain logic, no?
Not really as I suggested to filter them out. For example, if we need 4 peers and have 3 recently dialed ones (to get them in this list this means they were disconnected):
Current: We select 4 and we might get 3 recently dialed ones in the list, then pre_dial_check
filters out 3, leaving us with 1 actual attempt
With filtering: We'd select 4 non-recently-dialed peers from the start
While the maintain cycle every 30 seconds can solve this, but we might keep selecting those same 3 recently dialed peers every cycle just to remove them in pre_dial_check
, a peer might find it difficult to maintain the minimum needed peers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, makes sense.
if !pre_dial_check(&mut recently_dialed_peers, &addr) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this, it should be a criteria
when choosing random peers. We can even redial some of the recently dialed peers if we can't find enough fresh peers to connect to for this part and the above comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
mm2src/mm2_main/src/lp_stats.rs
Outdated
let timestamp = common::get_utc_timestamp(); | ||
encode_message(×tamp).map_err(|e| e.to_string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we don't wanna be explicit with casting, let's at least leave a comment that i64 will encode just as u64.
it's so confusing seeing one type sent from this end and another one received.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that explanation alone makes it more complicated than it should be. I didn't think it would be this confusing 🤔. I would rather cast it to u64
before sending it instead of explaining it like "we do this because of encoding numbers..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
me too tbh, i favor explicit casting all the way :/
/// Determines if a dial attempt to the remote should be made. | ||
/// | ||
/// Returns `false` if a dial attempt to the given address has already been made, | ||
/// in which case the caller must skip the dial attempt. | ||
fn pre_dial_check(recently_dialed_peers: &mut MutexGuard<TimedMap<StdClock, Multiaddr, ()>>, addr: &Multiaddr) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither the doc comment nor the func name signify how this function changes the timedmap (only visible from the mutability of the passed map). Let's change this to something like check_and_mark_dialed
or something.
Also, no need for the MutexGuard
annotation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither the doc comment nor the func name signify how this function changes the timedmap (only visible from the mutability of the passed map)
"only visible from the mutability of the passed map" that's visible everywhere, both in the function signature and on the caller side!
I struggle to see how check_and_mark_dialed
describes the function better than pre_dial_check
but whatever, I am okay with it.
Also, no need for the MutexGuard annotation here.
It gives/documents the background context how this map is utilized from caller.
.any(|addr| pre_dial_check(&mut recently_dialed_peers, addr)) | ||
&& !connected_relays.contains(peer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should move the connected_relays check first as there is no need to maker the peer as dialed then find out we are already connected.
also this should be just a check, it shouldn't mutate the recent dialed peers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
@@ -824,11 +926,17 @@ fn maintain_connection_to_relays(swarm: &mut AtomicDexSwarm, bootstrap_addresses | |||
if swarm.behaviour().core.gossipsub.is_connected_to_addr(&addr) { | |||
continue; | |||
} | |||
|
|||
if !pre_dial_check(&mut recently_dialed_peers, &addr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here we check the dialing one more time, if the peer had one address and it wasn't dialed yet, we would have marked it as dialed in the filer we did above and we won't be able to connect to it here.
we should also not update the dial expiry if we didn't dial (due to dial check failing)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here we check the dialing one more time
Was leftover, thanks.
we should also not update the dial expiry if we didn't dial (due to dial check failing)!
I don't understand what you mean, but if you are saying that we shouldn't add peers to recently dialed peers I think that's exactly what we should do. If we fail to connect some peer, we shouldn't try again for sometime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you mean, but if you are saying that we shouldn't add peers to recently dialed peers I think that's exactly what we should do. If we fail to connect some peer, we shouldn't try again for sometime.
e.g. for clarification:
we connected to some node, say bootstrap one, marked it as dialed with some expiry, then disconnected for some reason (you name it). Now when we try to connect back to maintain the min_connected, we will find that we dialed this node recently, but instead of just going with our day we would re-mark it is dialed again (even though we didn't dial it) and reset the expiry timer. in a bad situation where we can't find enough peers, the node will just be stuck there forever and ever resetting the timer.
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Implementation Details
Once a connection is established with a peer, validation check is immediately performed to confirm the peer's clock. If the peer's time exceeds the maximum allowed cap, the connection is terminated. This approach ensures that all peers performing operations in the network are synchronized in terms of time. Current time cap is set to 30 seconds, which can be changed if it doesn't fit the requirements.
Each peer is temporarily added to the
RECENTLY_DIALED_PEERS
map for 5 minutes before any connection attempt. This prevents repeated connection attempts to peers that are unavailable or out-of-sync, which reduces unnecessary reconnection overheads.Breaking Changes
A new request-response payload (
GetPeerUtcTimestamp
) has been added. This request is sent at each connection attempt, which makes it mandatory for all seed nodes and GUI applications in the network to be updated to this version. Meaning, this version is incompatible with any older version.The
NetworkInfoRequest
has been renamed toPeerInfoRequest
. This likely affects the encoded payload ofGetMm2Version
(not yet verified but highly likely).Resolves #1115 and #1683
Blocked by #2256