Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gbn: set resend timeout dynamically #88

Conversation

ViktorTigerstrom
Copy link
Contributor

Based on #87

This PR updates the resend timeout to be dynamically set based on how long it took for the other party to respond during the handshake process.
The timeout is set to the time it took for the server to respond multiplied by the resendMultiplier, unless the duration is shorter than the default resend timeout. If the the resend timeout has been manually set, the resend timeout will always be set to that value, and won't be dynamically set.

Prior to this PR, the timeout before a client resends the queue of packets was always a fixed value. This fixed timeout isn't suitable for all clients as the latency for different clients varies.

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-11-set-resend-timeout-dynamically branch 2 times, most recently from 13d35f1 to 6a54977 Compare November 9, 2023 10:52
Copy link
Contributor Author

@ViktorTigerstrom ViktorTigerstrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a review comment regarding the queue handshakeTimeout below.

Also in general, when making this PR, one alternative approach I considered, was to continuously update resend timeout throughout the lifespan of the connection, and not just set it during the handshake process. I.e. we could instead continuously set it based on how long it took to get a response for random packets we send with some set frequency. Do you think that approach would make more sense? The benefit of that approach would be that resend timeout will reflect if the latency changes throughout the lifetime of the connection, which might be good if we have really long lived connections.

gbn/gbn_conn.go Show resolved Hide resolved
@ViktorTigerstrom
Copy link
Contributor Author

ViktorTigerstrom commented Nov 10, 2023

Rebased on the latest version of #87.

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make things a bit more modular and generic here.

I think it would be cool do having something like:

type TimeoutManager struct {
    ....
}

NewTimeOutManager(defaultTimeout, useStaticTimeout, multiplier, ...) 

(m *TimeoutManager) Sent() {}
(m *TimeoutManager) Received() {}
(m *TimeoutManager) GetTimeout() time.Duration() {}

Then we can use it as follows:

  • everywhere in GBN where we want to start a new timer, query the new manager's GetTimeout method to get what time to use.
  • as is done in this PR, do 1 Sent/Received cycle during the handshake.
  • Then, for each new sequence number we send & when we get the corresponding ACK - do a Sent/Received cycle.

This will be nice cause then the GBN doesnt need to have all this logic in it and can just trust what the timeout manager says :)

gbn/gbn_conn.go Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-11-set-resend-timeout-dynamically branch 5 times, most recently from 6f87bfa to fcc81ed Compare November 15, 2023 13:21
@ViktorTigerstrom
Copy link
Contributor Author

Great idea regarding modularising it to a Timeout manager @ellemouton!
I've updated this PR to a new TimeoutManager struct, which handles all different timeouts throughout the gbn package.
Currently it only updates the resend timeout with the handshake messages, but can be easily updated to update it when we send/receive the other messages as well.
I hope this addresses this the way you ment in your feedback :).

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-11-set-resend-timeout-dynamically branch 2 times, most recently from b5a2195 to fd1f61b Compare November 15, 2023 13:27
gbn/gbn_conn.go Outdated
@@ -118,51 +103,46 @@ func newGoBackNConn(ctx context.Context, sendFunc sendBytesFunc,

ctxc, cancel := context.WithCancel(ctx)

timeoutManager := NewTimeOutManager(isServer)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you think I should dependency inject the timeout manager rather than instantiating it here!

@ViktorTigerstrom
Copy link
Contributor Author

Rebased on #87

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition! Gonna be a game changer!

Left a few comments. Also:

  • does this need to be based on gbn: Sender side only packeting resend bug fix #87? afaikt it doesnt depend on that change - so perhaps better to decouple them since it isnt clear which one will get in first yet.
  • A test for TimeoutMgr should please be added :)

gbn/timeout_manager.go Show resolved Hide resolved
gbn/timeout_manager.go Outdated Show resolved Hide resolved
gbn/timeout_manager.go Show resolved Hide resolved
gbn/timeout_manager.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
gbn/timeout_manager.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
Comment on lines 40 to 136
resendMultiplier int
resendMultiplierMu sync.RWMutex

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you can use functional options for this instead & then dont need the mutex to set it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, thanks! I'm learning a lot by this :)! I'll work on fixing that tomorrow.

gbn/timeout_manager.go Outdated Show resolved Hide resolved

multipliedTimeout := time.Duration(m.resendMultiplier) * responseTime

if multipliedTimeout > defaultResendTimeout {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like default & minimum should be separate values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, agree! Let me know if you'd like me to change the minimum value as well, as I set both to 1sec now.

@ellemouton
Copy link
Member

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-11-set-resend-timeout-dynamically branch 4 times, most recently from f39b75d to 95cb429 Compare November 21, 2023 01:03
@ViktorTigerstrom
Copy link
Contributor Author

Nice addition! Gonna be a game changer!

Thanks 🔥🚀!! I've addressed your feedback with the latest push 🚀!

I'll provide a more detailed answer to this tomorrow, but will provide a short answer for now. I should probably have mentioned this in the PR description previously, but the reason I've based this on #87, was that I noticed some potential compatibility issues when manually testing this, if only 1 of the 2 counterparties updated to a version of LNC that set the resend timeout to a different value than the default. It led to a resend loop of the packets that seemed potentially even worse than original packeting resend bug, but the fix in #87 largely fixed that if that party had also implemented that fix. I'll do more thorough testing tomorrow though, and see if my initial tests were correct.

  • A test for TimeoutMgr should please be added :)

This has been added now 🔥

@ViktorTigerstrom
Copy link
Contributor Author

ViktorTigerstrom commented Jan 9, 2024

Ok so there turned out to be a few more issues effecting the itests, some that which were pre-existing and some some which where made a bit worse by this PR that now increases the resend timeout:

(1)
Now with this PR, we may back-off and wait for the sync for potentially longer duration than the Ping ticker, If we have just sent a previous ping, both the pingTicker and pongTicker may have ticked when waiting to sync. That may cause that we continuously resend a ping without ever timing out due to the PongTicker Tick. I addressed this with: fa371d8

(2)
This issue the issue that happened earlier above, and was pre-existing, and I'd like to ensure that you agree that my solution to it is the correct way to handle the issue:
We had a bug that the server could end up in a scenario where it would not complete the handshake if the server restarted the handshake after sending its SYN, but before received the SYNACK response back from the client, before timing out due to latency or packet loss. If the client then received the SYN, it will proceeded to send the SYNACK and complete the handshake regardless of if the server receives the SYNACK in time or not.
If this happened, the server would then get stuck in the handshake while the client would leave the handshake phase.

I therefore changed the handshake logic for the server, so that if the server receives a SYNACK or PacketData after having sent it's SYN but timing out in the handshake, it would treat those as the completion of handshake. This was updated with b447950. Do you agree that this is a correct handling in the handshake?

(3)
In itests, due to the client closing it's context before closing the connection, any FIN message sent by the client would always fail. I didn't really see any itest failures due to this, but I corrected it to ensure it doesn't cause any errors in the future with 6b5ca8b.

@ellemouton, given the discussion in #87 (comment), is this way of ensuring that c.cancel() gets executed even if c.grpcConn.Close() errors ok as it closely resembles how we normally do with returnErr in itests? Or should I explicitly check the c.grpcConn.Close() and run c.cancel() if & if not an error was thrown.

(4)
Finally I bumped the itest timeout to 60seconds just a precaution for users which have a really bad connection
256546e. This was done as I've realised that if a user get's the resendTimeout with boosting set to more than 10secs, the user is bound to timeout on the reconnection tests if unlucky. This is because when the other party (server or client) is down during the test, if the syncer has then just resent the packets, it will take 30 seconds (10 seconds X 3) for the syncing to time out and complete. That will then time out the itest.

I should probably add that I haven't really had that bad latency so it hasn't been an issue for me yet, but I've been close to it when testing. Therefore I changed this as a precaution, but will drop this commit if you think it's an unnecessary addition.

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great investigations and bug hunting! Very excited to see the itests being stable with this PR as well.
Just one comment about a mutex, otherwise this LGTM 🎉

gbn/timeout_manager.go Outdated Show resolved Hide resolved
gbn/timeout_manager.go Show resolved Hide resolved
gbn/timeout_manager.go Outdated Show resolved Hide resolved
gbn/timeout_manager.go Outdated Show resolved Hide resolved
gbn/timeout_manager.go Outdated Show resolved Hide resolved
Comment on lines 320 to 337
// SetStaticResendTimeout sets the resend timeout to the given value, and also
// marks the timeout manager as using a static resend timeout, which will mean
// that the resend timeout will not be updated dynamically.
func (m *TimeoutManager) SetStaticResendTimeout(resendTimeout time.Duration) {
m.timeoutManagerMu.Lock()
defer m.timeoutManagerMu.Unlock()

m.resendTimeout = resendTimeout
m.useStaticTimeout = true
}

// SetHandshakeTimeout sets the handshake timeout.
func (m *TimeoutManager) SetHandshakeTimeout(handshakeTimeout time.Duration) {
m.timeoutManagerMu.Lock()
defer m.timeoutManagerMu.Unlock()

m.handshakeTimeout = handshakeTimeout
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should not be methods. It doesnt make sense to change them during the lifetime of the timeout manager. These should be functional options instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean specifically multipliedTimeout and SetStaticResendTimeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those functions have now been removed!

gbn/timeout_manager.go Outdated Show resolved Hide resolved
@@ -435,6 +493,8 @@ func (m *TimeoutManager) SetStaticResendTimeout(resendTimeout time.Duration) {
defer m.timeoutManagerMu.Unlock()

m.resendTimeout = resendTimeout
m.resendBooster.SetOriginalTimeout(resendTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead set this timeout at the time of initialising the resendBooster. You can do this by just re-shuffling the order of things in NewTimeOutManager:

1: construct TimeoutMgr
2. Apply it's functional options.
3. initialise boosters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good idea, I've implemented this with the latest push. It required me to now add the resendBoostPercent & handshakeTimeout fields to the TimeoutManager though, which isn't ideal. But I think the new version is much better than the last version though, so I definitely think that's worth it.

gbn/gbn_conn.go Outdated Show resolved Hide resolved
gbn/gbn_server.go Outdated Show resolved Hide resolved
gbn/gbn_server.go Show resolved Hide resolved
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-11-set-resend-timeout-dynamically branch from 256546e to 549c11a Compare January 15, 2024 01:53
@ViktorTigerstrom
Copy link
Contributor Author

Thanks a lot for the reviews @guggero & @ellemouton!
I've addressed your feedback with the latest push!

@ellemouton, please specifically check my response to #88 (comment), as that wasn't addressed with this latest push!

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending one last comment regarding a possible race.

Also tested this with old server & new client & a restarting mailbox 👍

Really great work 🎉

gbn/timeout_manager.go Outdated Show resolved Hide resolved
gbn/timeout_manager.go Show resolved Hide resolved
gbn/gbn_server.go Show resolved Hide resolved
itest/client_harness.go Outdated Show resolved Hide resolved
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-11-set-resend-timeout-dynamically branch from 549c11a to 9e6babd Compare January 16, 2024 11:53
@ViktorTigerstrom
Copy link
Contributor Author

Thanks a lot for the review and testing @ellemouton 🚀🔥!!

Let me know if you disagree with #88 (comment), else this should be good to go 🎉🚀!!

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-11-set-resend-timeout-dynamically branch from 9e6babd to 97d5f64 Compare January 16, 2024 23:36
Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff @ViktorTigerstrom!!!

Ok this is basically good to go. Added 2 final thoughts :)

gbn/timeout_manager.go Outdated Show resolved Hide resolved
gbn/timeout_manager_test.go Show resolved Hide resolved
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-11-set-resend-timeout-dynamically branch 3 times, most recently from 0eee1d2 to 14ae529 Compare January 17, 2024 16:53
@ViktorTigerstrom
Copy link
Contributor Author

After discussing this with @ellemouton offline, I've updated so that the TimeoutManager's Received function grabs the main mu in the beginning of the function. This ensures that any GetResendTimeout call we receive concurrently after the
one thread has just gotten a Received call that will update the resend timeout, will then return the updated timeout!

The timeout manager is a manager for all timeout values used by the
different components of the gbn package. The timeout manager enables
the functionality of dynamically changing the resend timeout value based
on how long it takes to receive an response from the counterparty.

To make the gbn package more modular, we'll move the responsibility of
managing the timeout values of the gbn package to the timeout manager in
the next commit.
This commit moves the responsibility of managing the timeout values
throughout the gbn package to the `TimeoutManager` struct.

Even though the `TimeoutManager` is now used to manage timeouts, the
resend timeout is not yet set dynamically for the connections of the gbn
package. Support and usage of that functionality will added in the
upcoming commits.
This commit changes the gbn `Config` struct to not specifically contain
any timeout fields, but instead contain functional options that modifies
the `TimeoutManager` struct. These options are then applied to the
`TimeoutManager`.
This commit adds support for setting the resend timeout dynamically for
a connection. Prior to this commit, the timeout before a client or
server resends the queue of packets, was always set to fixed value.
A fixed timeout isn't suitable for all connections though, as the
latency for different clients varies.

With this commit, we instead add support for setting the resend timeout
based on:
* How long it took for the other party to respond during the
handshake process
* How long it took for the other party to respond with the connect ACK
for a sent data packet.

The timeout is set then to the time it took for the server to respond,
multiplied by the resendMultiplier, unless the duration is shorter than
the default resend timeout.

Note though that if a connection's resend timeout is set manually set
through the WithStaticResendTimeout, the resend timeout will always be
set to that value, and won't be dynamically updated.
With this commit, the resend timeout will now be set dynamically for the
connections of the gbn package.
TimeoutBooster is a type can be used to boost the value of a timeout by
a specified percentage. The timeout can be boosted multiple times, and
number of boosts are then cumulative.
When we need to resend a data packet, or when we need to resend the SYN
message during the handshake, due to the other side not responding
within given timeout, we will boost the timeout by 50% for each time we
need to resend without receiving any response.
This ensures that if the original timeouts are set to a too short
duration given the current network latency, we will eventually boost the
timeouts to a long enough duration that allows the other side to be able
to respond within the timeout.
If we have awaited a sync after sending the previous ping, both the
pingTicker and pongTicker may have ticked when waiting to sync.
In that case, we need to ensure that the pongTicker gets prioritized,
which this commit ensures.
This fixes a bug where the server would not complete the handshake if
the server restarted the handshake after sending it's SYN, which the
client then received and then completed the handshake.
This could previously happen if the server timedout when waiting for the
clients SYN response, or if the client's SYNACK was lost due to packet
loss.
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-11-set-resend-timeout-dynamically branch from 14ae529 to 931a7c3 Compare January 17, 2024 17:01
Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

@ellemouton ellemouton merged commit d8c9f92 into lightninglabs:master Jan 18, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants