Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send peer deltas in MapResponse #1390

Closed

Conversation

kev-the-dev
Copy link

Summary

Implement the sending of peer list deltas in MapResponse, rather then sending the full list each time. Newer versions of tailscale support this, but it had not yet been implemented in Headscale and was left as a TODO. Sending peer list deltas reduces data usage of the control protocol, especially for larger tailnets.

This also fixes a known bug described in #1383, where a peer would not be removed from the network if it is deleted and there are only 2 peers in the network.

This is implemented in a way intended to later be expanded to other deltas being sent in MapResponse. It introduces a new struct mapResponseStreamState that is forwarded to sequential calls to generate a MapResponse. Over time, this struct can evolve to track more state to move more of the MapResponse to supporting the deltas.

Closes #1383

Testing

This was tested locally by creating a small tailnet with 2 peers, and deleting one to verify that the other peer is notified of the deleted peer.

CI unit and integration tests should give reasonable assurance of no other regressions with this change.

machine.go Outdated
@@ -588,6 +597,8 @@ func (h *Headscale) DeleteMachine(machine *Machine) error {
return err
}

h.setLastStateChangeToNow()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about this change, but I was surprised to see setLastStateChangeToNow called in all other changes to machines, but not in DeleteMachine

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes makes sense, it kind of highlight how fragile the current "remember to update the state" approach is.

Copy link
Contributor

@loprima-l loprima-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job here ! I added some comments to help you pas the tests. Your only real mistake here is forgetting dots at the end of your comments. (I think lint don't like it, but we do xD)

api_common.go Outdated Show resolved Hide resolved
api_common.go Outdated Show resolved Hide resolved
api_common.go Outdated Show resolved Hide resolved
machine.go Outdated Show resolved Hide resolved
machine.go Outdated Show resolved Hide resolved
Copy link
Contributor

@loprima-l loprima-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad GIT didn't show me the lint commit at first, thanks for your implication

api_common.go Outdated
@@ -103,6 +94,62 @@ func (h *Headscale) generateMapResponse(
},
}

// Send only deltas if this is the next map response in a streamed sequence
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split this out to a function and add some tests?

nil (full update), node added, node removed, no change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please write table driven tests, and do not take inspiration from our old terrible tests :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good... can you point me in the right direction in terms of where to put this test and and preferred framework for table driven tests? I've only written table driven tests in JS

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need a framework, but a structure like this should suffice https://github.com/kradalby/headscale/blob/acl-refactor/acls_test.go#L1698C1-L1834

@kradalby
Copy link
Collaborator

kradalby commented May 1, 2023

Looks sensible 👍

@gps949
Copy link

gps949 commented May 1, 2023

just a reminder, need to ensure there's no delta may miss if there's any network unstable.

@kradalby
Copy link
Collaborator

kradalby commented May 1, 2023

just a reminder, need to ensure there's no delta may miss if there's any network unstable.

@gps949, this approach should work as the state that determines if we should send the delta is stored in a function that only lives while the connection is alive.

That said, it would be have released as testing for a while.

@kev-the-dev kev-the-dev requested a review from kradalby May 4, 2023 19:12
@juanfont
Copy link
Owner

juanfont commented May 4, 2023

@kev-the-dev cna you rebase to main?

@kev-the-dev kev-the-dev force-pushed the kev-the-dev/peer-delta branch from a8cb587 to d013811 Compare May 5, 2023 01:08
Copy link
Collaborator

@kradalby kradalby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks reasonable, I think there is no point in nitting to much on how we want to strucutre the already silly large mapresponse function as I am about to start work on restructuring it to support some needed changes for ACLs.

If you resolve the few nits, and the tests passes, I'm happy to merge this, and then we can urge users to help test it.

Comment on lines 12 to 24
type TestParameters struct {
// Metadata
Name string

// Setup
FullUpdate bool
PreviousNodes Machines
CurrentNodes Machines

// Assertions
ExpectedPeers []*tailcfg.Node
ExpectedPeersChanges []*tailcfg.Node
ExpectedPeersRemoved []tailcfg.NodeID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these should start with lowercase names

Comment on lines 22 to 24
ExpectedPeers []*tailcfg.Node
ExpectedPeersChanges []*tailcfg.Node
ExpectedPeersRemoved []tailcfg.NodeID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for consistency, please use "want" instead of "Expected"

@kradalby
Copy link
Collaborator

kradalby commented May 7, 2023

if you can rebase this of master and fix the lint error, that should be good.

@kev-the-dev kev-the-dev force-pushed the kev-the-dev/peer-delta branch from 19a3c4e to d7cff01 Compare May 8, 2023 14:40
@kev-the-dev
Copy link
Author

Not sure about this integration test failure. It is suspiciously related to the change but has been flaky throughout this PR's lifecycle (sometimes passes).

@kev-the-dev
Copy link
Author

Oh I think it's related to the integration test testing multiple versions of tailscale client, some of which don't have support for the map deltas. We'll need to either drop support for these versions or introduce code to do different things depending on client version

@kev-the-dev
Copy link
Author

When a maintainer gets the chance, I'm hoping the next run of the integration tests with new debug lines will clarify what versions of tailscale this is failing for

@kradalby
Copy link
Collaborator

Oh I think it's related to the integration test testing multiple versions of tailscale client, some of which don't have support for the map deltas. We'll need to either drop support for these versions or introduce code to do different things depending on client version

Most of them seem to pass, so that would be strange, I would expect to see more of them failing. It would have to be done with CapVersion

@kev-the-dev kev-the-dev force-pushed the kev-the-dev/peer-delta branch from be4800f to cf83e69 Compare May 10, 2023 15:25
@kev-the-dev
Copy link
Author

Given it was 1.36.2 that failed, I think it wasn't an issue with Tailscale versions. My current guess is it's just a timing issue.. the test sleeps for a fixed 30 seconds. Depending on CPU load of CI, etc that may not work. I updated test to explicitly wait for the condition we're looking for.

@kev-the-dev kev-the-dev force-pushed the kev-the-dev/peer-delta branch 3 times, most recently from b986f95 to c7ade76 Compare May 11, 2023 02:20
toNodes func(Machines) ([]*tailcfg.Node, error)) (tailcfg.MapResponse, error,
) {
// Full update
if streamState == nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems streamState won't be nil when the applyMapResponseDelta called first time in the "case <-updateChan" branch. But there's no peers has been stored in streamState.peersByID. So that may cause the 2nd map response send as delta but generated from an empty state. May that cause some duplicated peers and some peers won't be remove forever in this poll stream?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think you're right! Good find. This could explain the integration test failure

@kev-the-dev kev-the-dev force-pushed the kev-the-dev/peer-delta branch from c7ade76 to bb6969d Compare May 15, 2023 15:20
Copy link
Author

@kev-the-dev kev-the-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kradalby sorry for all the churn... I believe I now fixed the integration test as it passes locally for me

@@ -466,7 +466,8 @@ func (t *TailscaleInContainer) WaitForLogout() error {
return fmt.Errorf("failed to fetch tailscale status: %w", err)
}

if status.CurrentTailnet == nil {
// Catch case where node is never logged in, or has a tailnet but was logged out /expired
if status.CurrentTailnet == nil || status.BackendState == "NeedsLogin" {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that when node is expired, CurrentTailnet is not nil, but the BackendState changes

//
// streamState is persistent state maintained across a stream of mapResponse messages.
// For first message in stream, send the address of a zero-initialized struct,
// then send the same address in subsequent messages to enable sending of delta-only messages.
func (h *Headscale) getMapResponseData(
Copy link
Author

@kev-the-dev kev-the-dev May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed streamState no longer be optional. It will be sent in each call, including the first call. For a full update, you can send a zero-init struct

@kev-the-dev
Copy link
Author

Hm now failures with message:

=== FAIL: integration TestTailscaleNodesJoiningHeadcale/create-tailscale (unknown)
2023/05/16 07:38:36 failed to create tailscale node: could not start tailscale container (version: 1.30.2): : The command '/bin/sh -c apt-get update     && apt-get install -y gnupg curl ssh dnsutils ca-certificates     && adduser --shell=/bin/bash ssh-it-user' returned a non-zero code: 100
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xecb78b]

I can't imagine this being related to the PR... perhaps a re-run?

@kev-the-dev kev-the-dev requested a review from kradalby May 16, 2023 18:18
@kev-the-dev
Copy link
Author

@kradalby any more changes to request here? would be great to have this for the next release given it fixes at least 1 bug / security concern

@kradalby
Copy link
Collaborator

It is not the best timing, we started quite a large code reorg, and we will likely not cut a new version for quite some time. We will have to discuss it a bit.

The second part of the code reorg is in #1458, and that brings in the biggest changes. However, the changes touches so much of the code that I suspect we will run it as beta for quite some time, if not alpha.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FR: Send Peers delta in MapResponse
5 participants