Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to test gateway opening/closing in Connect #14135

Merged
merged 11 commits into from
Jul 7, 2022

Conversation

ravicious
Copy link
Member

Open() and Close() used to not return any error and Open() used to start the gateway in a goroutine, making it rather hard to write tests for it.

This commit makes it so that Open() and Close() return errors and Open() blocks.

Adjustments have been made to other places in lib/teleterm to account for that missing goroutine and returned errors.

ravicious added 2 commits July 6, 2022 11:56
Open() and Close() used to not return any error and Open() used to start
the gateway in a goroutine, making it rather hard to write tests for it.

This commit makes it so that Open() and Close() return errors and Open()
blocks.

Adjustments have been made to other places in lib/teleterm to account
for that missing goroutine and returned errors.
While writing tests for the gateways, I was relying heavily on tests for
the local proxy. I noticed that it starts the server but doesn't close it
so I added an appropriate call to the cleanup function.
Comment on lines 75 to 90
wait := make(chan error)

go func() {
err := gateway.Open()
wait <- err
}()

defer func() {
// Make sure Open() is called.
time.Sleep(time.Millisecond * 200)

err := gateway.Close()
require.NoError(t, err)
require.NoError(t, <-wait)
}()
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already talked with Zac about this pattern under #14074 (comment). time.Sleep is bad but I don't know what would be a better approach with this API. Maybe I should just reword this comment so that it doesn't suggest that sleeping actually makes sure that Open() is called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to block until Open is called then perhaps Gateway should have a chan struct{} field that gets closed before calling localProxy.Start. From what I'm seeing, there's no real threshold in the middle of (*Gateway).Open or (*LocalProxy).Start in which the outer behavior changes, anyway - inbound connections get buffered as soon as the listener is opened in gateway.New, even before Open, and will just be processed by the accept loop in LocalProxy whenever it gets the chance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sidenote: (*Gateway).Open and (*LocalProxy).Start should probably both be called Serve or something along those lines, IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second sidenote: why not just net.Dial the address to confirm that the other side has actually accepted the connection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@espadolini Oh that's a good idea, would something like what I pushed in 1efd5ef work? The APIs in the net package are one level lower than what I'm used to.

sidenote: (*Gateway).Open and (*LocalProxy).Start should probably both be called Serve or something along those lines, IMO

I don't really want to rewrite ALPN proxy code right now but I can certainly take care of Gateway.Open.

lib/teleterm/daemon/daemon.go Show resolved Hide resolved
Comment on lines 75 to 90
wait := make(chan error)

go func() {
err := gateway.Open()
wait <- err
}()

defer func() {
// Make sure Open() is called.
time.Sleep(time.Millisecond * 200)

err := gateway.Close()
require.NoError(t, err)
require.NoError(t, <-wait)
}()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to block until Open is called then perhaps Gateway should have a chan struct{} field that gets closed before calling localProxy.Start. From what I'm seeing, there's no real threshold in the middle of (*Gateway).Open or (*LocalProxy).Start in which the outer behavior changes, anyway - inbound connections get buffered as soon as the listener is opened in gateway.New, even before Open, and will just be processed by the accept loop in LocalProxy whenever it gets the chance.

Comment on lines 75 to 90
wait := make(chan error)

go func() {
err := gateway.Open()
wait <- err
}()

defer func() {
// Make sure Open() is called.
time.Sleep(time.Millisecond * 200)

err := gateway.Close()
require.NoError(t, err)
require.NoError(t, <-wait)
}()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sidenote: (*Gateway).Open and (*LocalProxy).Start should probably both be called Serve or something along those lines, IMO

lib/teleterm/daemon/daemon.go Show resolved Hide resolved
Comment on lines 75 to 90
wait := make(chan error)

go func() {
err := gateway.Open()
wait <- err
}()

defer func() {
// Make sure Open() is called.
time.Sleep(time.Millisecond * 200)

err := gateway.Close()
require.NoError(t, err)
require.NoError(t, <-wait)
}()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second sidenote: why not just net.Dial the address to confirm that the other side has actually accepted the connection?

@ravicious ravicious requested a review from espadolini July 6, 2022 15:52
Comment on lines 84 to 85
_, err = net.DialTimeout("tcp", gatewayAddress, time.Second*1)
require.NoError(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_, err = net.DialTimeout("tcp", gatewayAddress, time.Second*1)
require.NoError(t, err)
c, err = net.DialTimeout("tcp", gatewayAddress, time.Second*1)
require.NoError(t, err)
t.Cleanup(func() { c.Close() })

Copy link
Contributor

@ibeckermayer ibeckermayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM granted that I'm not familiar with this part of the codebase.

I notice that with these changes, Serve() (previously Open()) is always called in a go func() {} block, so making it block rather than launch a goroutine itself is immaterial. There's precedent for both designs, your call which you think is best.

@ravicious
Copy link
Member Author

The change to make Serve() block was done mostly so that I can verify in tests that no error was returned from this function. Previously it swallowed all errors returned by localProxy.Start():

// Open opens a gateway to Teleport proxy
func (g *Gateway) Open() {
go func() {
g.Log.Info("Gateway is open.")
if err := g.localProxy.Start(g.closeContext); err != nil {
g.Log.WithError(err).Warn("Failed to open a connection.")
}
g.Log.Info("Gateway has closed.")
}()
}

And it looks like most stuff that Connect uses behaves this way (ALPN proxy, gRPC server) so it made sense for me to make Gateway behave like this as well.

I didn't know there was a precedent for Serve()-like method launching a goroutine itself! Unless you're counting gateway.Open() as a precedent. :D

Comment on lines 84 to 85
conn, err := net.DialTimeout("tcp", gatewayAddress, time.Second*1)
require.NoError(t, err)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@espadolini One thing that worries me is that this check fails only if I don't start the listener on that address. It doesn't fail if I never actually start the local proxy or if I start it after 1.5 second.

It turns out that localProxy.Start(), unlike gRPC's Server.Serve(), doesn't error if it's called after Close().

So the test not failing if Serve() is called after Close() is okay I guess. But why doesn't the test fail if never start the local proxy?

Copy link
Member Author

@ravicious ravicious Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to adapt this to teleterm_test.go to replace time.Sleep with Dial but it seems that I misunderstood how Dial works. It doesn't actually wait for the other side to accept the connection since, as you said, the listener buffers those connections before they're accepted by the server.

So I cannot really use Dial there to make sure the server is accepting connections.

Diff of teleterm_test.go
diff --git a/lib/teleterm/teleterm.go b/lib/teleterm/teleterm.go
index b5356fdc9..123a5be95 100644
--- a/lib/teleterm/teleterm.go
+++ b/lib/teleterm/teleterm.go
@@ -16,8 +16,10 @@ package teleterm
 
 import (
 	"context"
+	"fmt"
 	"os"
 	"os/signal"
+	"time"
 
 	"github.com/gravitational/teleport/lib/teleterm/apiserver"
 	"github.com/gravitational/teleport/lib/teleterm/clusters"
@@ -60,7 +62,10 @@ func Start(ctx context.Context, cfg Config) error {
 
 	serverAPIWait := make(chan error)
 	go func() {
+		time.Sleep(time.Second * 2)
+		fmt.Println("apiServer.Serve()")
 		err := apiServer.Serve()
+		fmt.Println("apiServer.Serve() returned")
 		serverAPIWait <- err
 	}()
 
diff --git a/lib/teleterm/teleterm_test.go b/lib/teleterm/teleterm_test.go
index 51be1b619..649bac66f 100644
--- a/lib/teleterm/teleterm_test.go
+++ b/lib/teleterm/teleterm_test.go
@@ -16,7 +16,11 @@ package teleterm_test
 
 import (
 	"context"
+	"errors"
 	"fmt"
+	"net"
+	"os"
+	"path/filepath"
 	"testing"
 	"time"
 
@@ -27,27 +31,38 @@ import (
 
 func TestStart(t *testing.T) {
 	homeDir := t.TempDir()
+	sockPath := filepath.Join(homeDir, "teleterm.sock")
+	addr := fmt.Sprintf("unix://%v", sockPath)
 	cfg := teleterm.Config{
-		Addr:    fmt.Sprintf("unix://%v/teleterm.sock", homeDir),
+		Addr:    addr,
 		HomeDir: fmt.Sprintf("%v/", homeDir),
 	}
 
 	ctx, cancel := context.WithCancel(context.Background())
 	defer cancel()
 
-	wait := make(chan error)
+	serveErr := make(chan error)
 	go func() {
 		err := teleterm.Start(ctx, cfg)
-		wait <- err
+		serveErr <- err
 	}()
 
-	defer func() {
-		// Make sure Start() is called.
-		time.Sleep(time.Millisecond * 500)
+	require.Eventually(t, func() bool {
+		_, err := os.Stat(sockPath)
 
-		// Stop the server.
-		cancel()
-		require.NoError(t, <-wait)
-	}()
+		return !errors.Is(err, os.ErrNotExist)
+	}, time.Millisecond*500, time.Millisecond*50)
+
+	// Dial to make sure Start was called.
+	fmt.Println("net.DialTimeout()")
+	conn, err := net.DialTimeout("unix", sockPath, time.Second*1)
+	fmt.Println("net.DialTimeout() returned")
+	require.NoError(t, err)
 
+	// Stop the server.
+	fmt.Println("cancel()")
+	cancel()
+	conn.Close()
+	fmt.Println("waiting for serveErr")
+	require.NoError(t, <-serveErr)
 }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would conn.Read() block until the server accepts the connection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think conn.Read() does the trick. I just pushed 3e34e02 which fails if the connection isn't accepted within a second.

I was also able to improve teleterm tests in a similar way, I'll push those changes in a separate PR.

@ravicious ravicious requested a review from espadolini July 7, 2022 11:23
lib/teleterm/gateway/gateway_test.go Outdated Show resolved Hide resolved
@ravicious ravicious enabled auto-merge (squash) July 7, 2022 12:38
@ravicious ravicious merged commit 1543922 into master Jul 7, 2022
@github-actions
Copy link

github-actions bot commented Jul 7, 2022

@ravicious See the table below for backport results.

Branch Result
branch/v10 Create PR

@ibeckermayer
Copy link
Contributor

I didn't know there was a precedent for Serve()-like method launching a goroutine itself! Unless you're counting gateway.Open() as a precedent. :D

Yeah you're right, I glossed over some of the other implementations, but now that I look more closely its standard practice to block.

ravicious added a commit that referenced this pull request Jul 11, 2022
Edoardo asked about it some time ago. I forgot to add the explanation to
the code.

#14135 (comment)
ravicious added a commit that referenced this pull request Jul 19, 2022
* daemon: Put gateway-related methods next to each other

* Remove unused fields from daemon.Config

* Make Config a private field instead of embedding it

* Add tests for gateway CRUD

* Remove unnecessary ctx and error from daemon.Service gateway methods

* Refactor daemon.Service.gateways to a hash map

* Add comment explaining error handling in removeGateway

Edoardo asked about it some time ago. I forgot to add the explanation to
the code.

#14135 (comment)

* Do not return pointers from ListGateways()

* Remove FindGateway, fix lock issues in RestartGateway() and RemoveGateway()
github-actions bot pushed a commit that referenced this pull request Jul 19, 2022
Edoardo asked about it some time ago. I forgot to add the explanation to
the code.

#14135 (comment)
ravicious added a commit that referenced this pull request Jul 20, 2022
* daemon: Put gateway-related methods next to each other

* Remove unused fields from daemon.Config

* Make Config a private field instead of embedding it

* Add tests for gateway CRUD

* Remove unnecessary ctx and error from daemon.Service gateway methods

* Refactor daemon.Service.gateways to a hash map

* Add comment explaining error handling in removeGateway

Edoardo asked about it some time ago. I forgot to add the explanation to
the code.

#14135 (comment)

* Fix license

* Rename ClusterResolver to Resolver

* Do not return pointers from ListGateways()

* Remove FindGateway, fix lock issues in RestartGateway() and RemoveGateway()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants