Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(p2p): ensure time synchronization in the network #2255

Open
wants to merge 21 commits into
base: dev
Choose a base branch
from

Conversation

onur-ozkan
Copy link
Member

@onur-ozkan onur-ozkan commented Oct 28, 2024

Implementation Details

Once a connection is established with a peer, validation check is immediately performed to confirm the peer's clock. If the peer's time exceeds the maximum allowed cap, the connection is terminated. This approach ensures that all peers performing operations in the network are synchronized in terms of time. Current time cap is set to 30 seconds, which can be changed if it doesn't fit the requirements.

Each peer is temporarily added to the RECENTLY_DIALED_PEERS map for 5 minutes before any connection attempt. This prevents repeated connection attempts to peers that are unavailable or out-of-sync, which reduces unnecessary reconnection overheads.

                      +---------------------+
                      |  Connection Attempt |
                      |       On a Peer     |
                      +---------------------+
                                |
                                v
                   +------------------------+
                   |   Is already dialed    |
                   |       recently?        |
                   +------------------------+
                      |                    |
                      Yes                  No
                      |                    |
                      v                    v
        +---------------------+       +-----------------------+
        |   Skip Connection   |       |  Make the Connection  |
        +---------------------+       +-----------------------+
                                             |
                                             v
                            +---------------------------------+
                            |     Check Peer Time Validity    |
                            +---------------------------------+
                                    |              |
                               Time Valid     Time Invalid
                                    |              |
                                    v              v
                                    |       +-------------------------+
                                    |       | Disconnect from Peer    |
                                    |       +-------------------------+
                                    |
                                    |
                                    v
                +------------------------------------+
                |       Validation completed         |
                +------------------------------------+

Breaking Changes

  1. A new request-response payload (GetPeerUtcTimestamp) has been added. This request is sent at each connection attempt, which makes it mandatory for all seed nodes and GUI applications in the network to be updated to this version. Meaning, this version is incompatible with any older version.

  2. The NetworkInfoRequest has been renamed to PeerInfoRequest. This likely affects the encoded payload of GetMm2Version (not yet verified but highly likely).

Resolves #1115 and #1683

Blocked by #2256

@onur-ozkan onur-ozkan added enhancement New feature or request in progress Changes will be made from the author labels Oct 28, 2024
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
@onur-ozkan onur-ozkan added the breaking-change might require breaking backward compatibility label Oct 31, 2024
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Comment on lines -17 to -20
/// TODO: This should be called `PeerInfoRequest` instead. However, renaming it
/// will introduce a breaking change in the network and is not worth it. Do this
/// renaming when there is already a breaking change in the release.
NetworkInfo(network_info::NetworkInfoRequest),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR already leads to breaking change. It's perfect time to rename this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to avoid breaking changes is to have a timeout on the request for peer timestamp and allow them to connect if there is no response from the peer. We later remove this when most peers update. Another way is to get the mm2 version and do the timestamp request depending on the version. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to avoid breaking changes is to have a timeout on the request for peer timestamp and allow them to connect if there is no response from the peer. We later remove this when most peers update.

This sounds like a good plan!

Another way is to get the mm2 version and do the timestamp request depending on the version. What do you think?

I think the first idea was better, but then I need to revert the peer-info renaming part. What about bundling these changes in some branch, and then releasing multiple breaking changes at once (like removing mm2 binaries and some other stuff) ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about bundling these changes in some branch, and then releasing multiple breaking changes at once (like removing mm2 binaries and some other stuff) ?

I am fine with this or the first approach.

Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
@shamardy shamardy added breaking-change: p2p and removed breaking-change might require breaking backward compatibility labels Nov 4, 2024
@onur-ozkan onur-ozkan marked this pull request as ready for review November 5, 2024 05:22
@onur-ozkan onur-ozkan removed the in progress Changes will be made from the author label Nov 5, 2024
Comment on lines 58 to 60
/// Used in time validation logic for each peer which runs immediately after the
/// `ConnectionEstablished` event.
const MAX_TIME_GAP_FOR_CONNECTED_PEER: u64 = 30;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is a good default. Should we make it like 20 or 25?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30 seconds are fine since it's half the max allowed difference for swaps..

const MAX_STARTED_AT_DIFF: u64 = 60;

However, An edge case that can happen in theory is time differences accumulating across the network. Consider a scenario with peers A, B, C and D:

  • Peers B and C are relay/seed nodes
  • Peers A and D are starting a swap as taker/maker
  • If time differences exist as follows:
    • Peer A is 30 secs off peer B but they allow them to connect
    • Peer D is also 30 secs off peer C but they allow them to connect
    • Peer B and C are out of sync as well (any number of seconds less than 30)
    • Peer A and D can have more than 60 seconds difference and the swap fails.

This is an edge case that we should keep in mind in case we got time difference errors in a swap after this fix, but I think it will happen in very rare cases if any.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. let's make it 20 secs to reduce the probability of this edge case happening.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last note :)
We should probably make time differences dependent, for instance MAX_STARTED_AT_DIFF = MAX_TIME_GAP_FOR_CONNECTED_PEER * 3 in case a dev wants to change one of them, they should be aware of how this affects or is affected by other time differences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The edge case can only happen if seed nodes are not synced in time, which shouldn't be the case I believe? Since we are the maintainers of the seed network.

P.S. let's make it 20 secs to reduce the probability of this edge case happening.

I can do that, but I am not sure if that's a good idea. I had ~9 seconds diff between my laptop and desktop computer, so I believe in some cases it could be even more than that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to inform the peer that their clock is off so that they can adjust it?

Just read internal discussions about this, the currently connected peers is one solution to this but not optimal, will think of something if there are any other solutions. A problem we have is the no login mode, GUIs should be able to show the orderbooks and other info for no login mode even if the peer clock is off, it will be strange for someone who wants to have a look at our dex to get a warning about system clock when not logged in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's totally reasonable to tell someone to sync their time if their system isn't (which shouldn't be the case for general-purpose OS users). There are so many apps doing this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are so many apps doing this.

Never seen it for web apps like komodo wallet though.

Copy link
Member Author

@onur-ozkan onur-ozkan Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess maybe you never run your OS unsynced in time :D We can provide some REST api to provide some general data for no-login mode. If we want to keep the network synced in time, I don't know what else we can do other than not connecting to unsynced ones.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess maybe you never run your OS unsynced in time :D

You are probably right :D

If we want to keep the network synced in time, I don't know what else we can do other than not connecting to unsynced ones.

I would leave the PR as is and if we got something from GUI team about this we can look for workarounds for this particular case. We need to mention it though while they are testing.

Signed-off-by: onur-ozkan <work@onurozkan.dev>
@laruh
Copy link
Member

laruh commented Nov 6, 2024

@onur-ozkan

Once a connection is established with a peer, validation check is immediately performed to confirm the peer's clock. If the peer's time exceeds the maximum allowed cap

Could you clarify please, we compare which peer time with what?

@onur-ozkan
Copy link
Member Author

@onur-ozkan

Once a connection is established with a peer, validation check is immediately performed to confirm the peer's clock. If the peer's time exceeds the maximum allowed cap

Could you clarify please, we compare which peer time with what?

Current peer's time against to the remote peer.

@shamardy shamardy added the P2 label Nov 11, 2024
Copy link
Collaborator

@mariocynicys mariocynicys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the detailed PR desc :)
First review iter, will prolly need just one more.

mm2src/mm2_p2p/src/behaviours/atomicdex.rs Outdated Show resolved Hide resolved
const MAX_TIME_GAP_FOR_CONNECTED_PEER: u64 = 30;

/// Used for storing peers in [`RECENTLY_DIALED_PEERS`].
const DIAL_RETRY_DELAY: Duration = Duration::from_secs(60 * 5);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would write this 5 * 60 :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explanation: it's just semantically clearer to put the conversion factors at the end (and ordered, so for 2 days converted to seconds u would use 2 * 24 * 60 * 60)

mm2src/mm2_p2p/src/behaviours/atomicdex.rs Outdated Show resolved Hide resolved
mm2src/mm2_p2p/src/behaviours/atomicdex.rs Show resolved Hide resolved

match request_one_peer(peer, encoded_request, rp_sender).await {
PeerResponse::Ok { res } => {
if let Ok(timestamp) = decode_message::<u64>(&res) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ummm, this is decoded as u64, but the response here...

Comment on lines 178 to 181
fn process_get_peer_utc_timestamp_request() -> Result<Vec<u8>, String> {
let timestamp = common::get_utc_timestamp();
encode_message(&timestamp).map_err(|e| e.to_string())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...encodes it as a i64. we should stick to only one. u64 makes the most sense application-wise as we don't really need the negative side? right?

but curious, is the encoding of i64 and u64 compatible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should stick to only one. u64 makes the most sense application-wise as we don't really need the negative side? right?

Why do you want to cast the value again when you can simply encode it?

but curious, is the encoding of i64 and u64 compatible?

They are

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to cast the value again when you can simply encode it?

didn't know the encoding for both is compatible, still not sure how the negative side is encoded (probably as rust's then, will checkout).
so thing is, you actually do some kind of an implicit cast here using data encoding and decoding, you encode the number as i64, send it, decode it as u64 from the receiver. i assume a negative i64 will decode fine as a u64 just like what the as keyword does but give a garbage data (will check more on that encoding lib).

what i am suggesting is, we can encode it as i64 and decode it as i64 as well, we check abs_diff without the need for any fallible conversions what so ever. (the suggestion stems from the fact that i64 encodes just like u64 over the network, i.e. if we ever change this to u64 encoding we will still be backward compatible. will have to double check this fact still :) ).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not sure how the negative side is encoded

Theoretically, it should be fine on encoding but should panic on decoding, which is the main purpose on the decoding side (as negative ones are invalid).

Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Copy link
Collaborator

@shamardy shamardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I have a few suggestion in addition to these 2 comments #2255 (comment), #2255 (comment)

mm2src/mm2_p2p/src/behaviours/atomicdex.rs Outdated Show resolved Hide resolved
mm2src/mm2_p2p/src/behaviours/atomicdex.rs Outdated Show resolved Hide resolved
mm2src/mm2_p2p/src/behaviours/atomicdex.rs Outdated Show resolved Hide resolved
mm2src/mm2_p2p/src/behaviours/atomicdex.rs Outdated Show resolved Hide resolved
Comment on lines 907 to 909
if !pre_dial_check(&mut recently_dialed_peers, addr) {
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this check be part of the filtering above? This is to maintain the intended
connect_bootstrap_num.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If peer is not synced in time, we will keep trying to connect it in the maintain logic, no?

Copy link
Collaborator

@shamardy shamardy Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will keep trying to connect it in the maintain logic, no?

Not really as I suggested to filter them out. For example, if we need 4 peers and have 3 recently dialed ones (to get them in this list this means they were disconnected):

Current: We select 4 and we might get 3 recently dialed ones in the list, then pre_dial_check filters out 3, leaving us with 1 actual attempt
With filtering: We'd select 4 non-recently-dialed peers from the start

While the maintain cycle every 30 seconds can solve this, but we might keep selecting those same 3 recently dialed peers every cycle just to remove them in pre_dial_check, a peer might find it difficult to maintain the minimum needed peers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, makes sense.

Comment on lines 922 to 924
if !pre_dial_check(&mut recently_dialed_peers, &addr) {
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this, it should be a criteria when choosing random peers. We can even redial some of the recently dialed peers if we can't find enough fresh peers to connect to for this part and the above comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: onur-ozkan <work@onurozkan.dev>
Signed-off-by: onur-ozkan <work@onurozkan.dev>
Comment on lines 179 to 180
let timestamp = common::get_utc_timestamp();
encode_message(&timestamp).map_err(|e| e.to_string())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't wanna be explicit with casting, let's at least leave a comment that i64 will encode just as u64.
it's so confusing seeing one type sent from this end and another one received.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that explanation alone makes it more complicated than it should be. I didn't think it would be this confusing 🤔. I would rather cast it to u64 before sending it instead of explaining it like "we do this because of encoding numbers..."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me too tbh, i favor explicit casting all the way :/

Comment on lines 185 to 189
/// Determines if a dial attempt to the remote should be made.
///
/// Returns `false` if a dial attempt to the given address has already been made,
/// in which case the caller must skip the dial attempt.
fn pre_dial_check(recently_dialed_peers: &mut MutexGuard<TimedMap<StdClock, Multiaddr, ()>>, addr: &Multiaddr) -> bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither the doc comment nor the func name signify how this function changes the timedmap (only visible from the mutability of the passed map). Let's change this to something like check_and_mark_dialed or something.

Also, no need for the MutexGuard annotation here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither the doc comment nor the func name signify how this function changes the timedmap (only visible from the mutability of the passed map)

"only visible from the mutability of the passed map" that's visible everywhere, both in the function signature and on the caller side!

I struggle to see how check_and_mark_dialed describes the function better than pre_dial_check but whatever, I am okay with it.

Also, no need for the MutexGuard annotation here.

It gives/documents the background context how this map is utilized from caller.

Comment on lines 903 to 904
.any(|addr| pre_dial_check(&mut recently_dialed_peers, addr))
&& !connected_relays.contains(peer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should move the connected_relays check first as there is no need to maker the peer as dialed then find out we are already connected.

also this should be just a check, it shouldn't mutate the recent dialed peers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@@ -824,11 +926,17 @@ fn maintain_connection_to_relays(swarm: &mut AtomicDexSwarm, bootstrap_addresses
if swarm.behaviour().core.gossipsub.is_connected_to_addr(&addr) {
continue;
}

if !pre_dial_check(&mut recently_dialed_peers, &addr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here we check the dialing one more time, if the peer had one address and it wasn't dialed yet, we would have marked it as dialed in the filer we did above and we won't be able to connect to it here.

we should also not update the dial expiry if we didn't dial (due to dial check failing)!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here we check the dialing one more time

Was leftover, thanks.

we should also not update the dial expiry if we didn't dial (due to dial check failing)!

I don't understand what you mean, but if you are saying that we shouldn't add peers to recently dialed peers I think that's exactly what we should do. If we fail to connect some peer, we shouldn't try again for sometime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you mean, but if you are saying that we shouldn't add peers to recently dialed peers I think that's exactly what we should do. If we fail to connect some peer, we shouldn't try again for sometime.

e.g. for clarification:
we connected to some node, say bootstrap one, marked it as dialed with some expiry, then disconnected for some reason (you name it). Now when we try to connect back to maintain the min_connected, we will find that we dialed this node recently, but instead of just going with our day we would re-mark it is dialed again (even though we didn't dial it) and reset the expiry timer. in a bad situation where we can't find enough peers, the node will just be stuck there forever and ever resetting the timer.

Signed-off-by: onur-ozkan <work@onurozkan.dev>
@shamardy shamardy removed the P2 label Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants