-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send peer deltas in MapResponse #1390
Conversation
e459101
to
1d74d18
Compare
machine.go
Outdated
@@ -588,6 +597,8 @@ func (h *Headscale) DeleteMachine(machine *Machine) error { | |||
return err | |||
} | |||
|
|||
h.setLastStateChangeToNow() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure about this change, but I was surprised to see setLastStateChangeToNow
called in all other changes to machines, but not in DeleteMachine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes makes sense, it kind of highlight how fragile the current "remember to update the state" approach is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job here ! I added some comments to help you pas the tests. Your only real mistake here is forgetting dots at the end of your comments. (I think lint don't like it, but we do xD)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad GIT didn't show me the lint commit at first, thanks for your implication
api_common.go
Outdated
@@ -103,6 +94,62 @@ func (h *Headscale) generateMapResponse( | |||
}, | |||
} | |||
|
|||
// Send only deltas if this is the next map response in a streamed sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you split this out to a function and add some tests?
nil (full update), node added, node removed, no change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please write table driven tests, and do not take inspiration from our old terrible tests :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good... can you point me in the right direction in terms of where to put this test and and preferred framework for table driven tests? I've only written table driven tests in JS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need a framework, but a structure like this should suffice https://github.com/kradalby/headscale/blob/acl-refactor/acls_test.go#L1698C1-L1834
Looks sensible 👍 |
just a reminder, need to ensure there's no delta may miss if there's any network unstable. |
@gps949, this approach should work as the state that determines if we should send the delta is stored in a function that only lives while the connection is alive. That said, it would be have released as testing for a while. |
@kev-the-dev cna you rebase to main? |
a8cb587
to
d013811
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks reasonable, I think there is no point in nitting to much on how we want to strucutre the already silly large mapresponse function as I am about to start work on restructuring it to support some needed changes for ACLs.
If you resolve the few nits, and the tests passes, I'm happy to merge this, and then we can urge users to help test it.
api_common_test.go
Outdated
type TestParameters struct { | ||
// Metadata | ||
Name string | ||
|
||
// Setup | ||
FullUpdate bool | ||
PreviousNodes Machines | ||
CurrentNodes Machines | ||
|
||
// Assertions | ||
ExpectedPeers []*tailcfg.Node | ||
ExpectedPeersChanges []*tailcfg.Node | ||
ExpectedPeersRemoved []tailcfg.NodeID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: these should start with lowercase names
api_common_test.go
Outdated
ExpectedPeers []*tailcfg.Node | ||
ExpectedPeersChanges []*tailcfg.Node | ||
ExpectedPeersRemoved []tailcfg.NodeID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for consistency, please use "want" instead of "Expected"
if you can rebase this of master and fix the lint error, that should be good. |
19a3c4e
to
d7cff01
Compare
Not sure about this integration test failure. It is suspiciously related to the change but has been flaky throughout this PR's lifecycle (sometimes passes). |
Oh I think it's related to the integration test testing multiple versions of tailscale client, some of which don't have support for the map deltas. We'll need to either drop support for these versions or introduce code to do different things depending on client version |
When a maintainer gets the chance, I'm hoping the next run of the integration tests with new debug lines will clarify what versions of tailscale this is failing for |
Most of them seem to pass, so that would be strange, I would expect to see more of them failing. It would have to be done with CapVersion |
be4800f
to
cf83e69
Compare
Given it was 1.36.2 that failed, I think it wasn't an issue with Tailscale versions. My current guess is it's just a timing issue.. the test sleeps for a fixed 30 seconds. Depending on CPU load of CI, etc that may not work. I updated test to explicitly wait for the condition we're looking for. |
b986f95
to
c7ade76
Compare
hscontrol/api_common.go
Outdated
toNodes func(Machines) ([]*tailcfg.Node, error)) (tailcfg.MapResponse, error, | ||
) { | ||
// Full update | ||
if streamState == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems streamState won't be nil when the applyMapResponseDelta called first time in the "case <-updateChan" branch. But there's no peers has been stored in streamState.peersByID. So that may cause the 2nd map response send as delta but generated from an empty state. May that cause some duplicated peers and some peers won't be remove forever in this poll stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think you're right! Good find. This could explain the integration test failure
c7ade76
to
bb6969d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kradalby sorry for all the churn... I believe I now fixed the integration test as it passes locally for me
@@ -466,7 +466,8 @@ func (t *TailscaleInContainer) WaitForLogout() error { | |||
return fmt.Errorf("failed to fetch tailscale status: %w", err) | |||
} | |||
|
|||
if status.CurrentTailnet == nil { | |||
// Catch case where node is never logged in, or has a tailnet but was logged out /expired | |||
if status.CurrentTailnet == nil || status.BackendState == "NeedsLogin" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that when node is expired, CurrentTailnet is not nil, but the BackendState changes
// | ||
// streamState is persistent state maintained across a stream of mapResponse messages. | ||
// For first message in stream, send the address of a zero-initialized struct, | ||
// then send the same address in subsequent messages to enable sending of delta-only messages. | ||
func (h *Headscale) getMapResponseData( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed streamState
no longer be optional. It will be sent in each call, including the first call. For a full update, you can send a zero-init struct
Hm now failures with message:
I can't imagine this being related to the PR... perhaps a re-run? |
@kradalby any more changes to request here? would be great to have this for the next release given it fixes at least 1 bug / security concern |
It is not the best timing, we started quite a large code reorg, and we will likely not cut a new version for quite some time. We will have to discuss it a bit. The second part of the code reorg is in #1458, and that brings in the biggest changes. However, the changes touches so much of the code that I suspect we will run it as beta for quite some time, if not alpha. |
Summary
Implement the sending of peer list deltas in MapResponse, rather then sending the full list each time. Newer versions of tailscale support this, but it had not yet been implemented in Headscale and was left as a TODO. Sending peer list deltas reduces data usage of the control protocol, especially for larger tailnets.
This also fixes a known bug described in #1383, where a peer would not be removed from the network if it is deleted and there are only 2 peers in the network.
This is implemented in a way intended to later be expanded to other deltas being sent in MapResponse. It introduces a new struct
mapResponseStreamState
that is forwarded to sequential calls to generate a MapResponse. Over time, this struct can evolve to track more state to move more of the MapResponse to supporting the deltas.Closes #1383
Testing
This was tested locally by creating a small tailnet with 2 peers, and deleting one to verify that the other peer is notified of the deleted peer.
CI unit and integration tests should give reasonable assurance of no other regressions with this change.