Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First big step towards solving issues 2217 and 2465 #3037

Merged
merged 61 commits into from
May 30, 2017

Conversation

magiconair
Copy link
Contributor

not done yet

@magiconair
Copy link
Contributor Author

TestAgent looks better now. Still one test failing. Need to fix the httpTest code.

@magiconair magiconair force-pushed the issue-2217-multiple-bind-addr branch 2 times, most recently from c66eaa5 to 5fb0ce5 Compare May 21, 2017 08:02
@magiconair
Copy link
Contributor Author

Tests are fixed and the agent tests now run fully in parallel. Going from 108 sec locally down to 15 ...

@magiconair magiconair force-pushed the issue-2217-multiple-bind-addr branch from 5fb0ce5 to c0e249d Compare May 21, 2017 15:22
@magiconair
Copy link
Contributor Author

TestAgent refactor is done and the tests pass. However, quite regularly one of the tests hangs because the leader election fails. This behavior can be reduced by reducing the parallelism of go test but it is something we should look at.

@magiconair magiconair force-pushed the issue-2217-multiple-bind-addr branch from 5bae6bc to 138d16f Compare May 21, 2017 19:00
@magiconair magiconair changed the title WIP: work towards solving issue 2217 WIP: work towards solving issues 2217 and 2465 May 21, 2017
Copy link
Contributor

@slackpad slackpad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted a few small things but overall looks good. I like the approach, lack of callbacks and associated de-nesting, and elimination of several test frameworks :-)

a.wgServers.Add(1)
go func() {
defer a.wgServers.Done()
up.Done()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this ensure that Serve() actually got called? It seems like this can still bail out too early since it's before the actual Serve() call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not guarantee it. Until the Go team fixes the bug I think this is the best we can do with the minimal amount of overhead. This guarantees that the go routine that executes Serve gets at least scheduled once before Start returns. My understanding is that it yields again on the next blocking call which should happen within Serve. I don't see how it would yield between up.Done() and Serve until I got it wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough - I'm not sure how you'd do better than this given the current situation w/o polling an endpoint, which would be truly horrible to do.

srv.Shutdown()
}
for _, srv := range a.httpServers {
a.logger.Println("[INFO] agent: Stopping HTTP endpoint", srv.Addr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be plural "endpoints" too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, since it now logs one line per endpoint with the specific address of the endpoint.

err := a.delegate.Shutdown()
a.logger.Print("[INFO] agent: delegate down")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message probably won't make sense to users. We could probably remove this, or move it to debug level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True

@@ -285,7 +277,10 @@ func TestAgent_Reload(t *testing.T) {
}()

retry.Run(t, func(r *retry.R) {
if got, want := len(cmd.httpServers), 1; got != want {
if cmd == nil || cmd.agent == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think cmd can be nil here. Would this fail the race detector on cmd.agent since it looks like the cmd.Run() in a separate goroutine sets it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I check.

in, _, err := c.Exchange(m, addr.String())
if err != nil {
t.Fatalf("err: %v", err)
}

// Only 1 is passing, so we should only get 1 answer
for _, a := range in.Answer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like leftover debug code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a look

}

func ListenUnix(addr string, perm FilePermissions) (net.Listener, error) {
// todo(fs): move this somewhere else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This did get moved to startHTTP().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove

)

func init() {
rand.Seed(time.Now().UnixNano()) // seed random number generator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got https://github.com/hashicorp/consul/tree/master/vendor/github.com/sean-/seed vendored and can use that, though this is probably cheaper and fine for unit tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a lib for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's shared across projects so there's some justification, though it could be paste-able pretty easily. There's some nuance in doing it correctly (doesn't matter for unit tests), so that probably tips in favor vs. pasting.

// during creation or startup instead of returning errors. It manages a
// temporary data directory which is removed after shutdown.
//
// todo(fs): do we need the temp data dir if we run in dev mode?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't, though when I tried to use dev mode here I got some failures (some of the tests use features that don't work in dev mode). We will need to be able to tell the test agent whether or not to use dev mode and default that to true most likely, and opt-out in a few places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like something for a different PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check what the impact is but since we're not using all CPUs on the agent test yet I think there is more potential somewhere else. I got at most half of the cores used. With go test --parallel 24 test execution went from 15 to 11 sec so not such a huge gain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's probably better coverage-wise to activate non-dev as well. Dev mode bypasses some key places that are worth kicking during our unit tests.

}

// BoolTrue and BoolFalse exist to create a *bool value.
var BoolTrue = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, since you can't create a reference to a const. Tried that, too :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes I miss (a few small parts of) C++ (every once in a while).


func nextConfig() *agent.Config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we kill this stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on that right now. That and then the code in command/server_test.go

@magiconair magiconair force-pushed the issue-2217-multiple-bind-addr branch 3 times, most recently from fd0843e to 2191f6a Compare May 23, 2017 19:06
@magiconair magiconair force-pushed the issue-2217-multiple-bind-addr branch from 2e8f77d to f0401ee Compare May 24, 2017 09:44
@magiconair
Copy link
Contributor Author

rebased to master to pick up Azure RetryJoin and latest go-sockaddr

Copy link
Contributor

@slackpad slackpad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted some small things but otherwise looks good to merge!

if err == nil || !strings.Contains(err.Error(), rootDenied) {
t.Fatalf("err: %v", err)
}

// The ACL master token should also not call the server, but should give
// us a working agent token.
acl, err := agent.resolveToken("towel")
// us a working a token.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should still be "agent".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search and replace. You've got eagle eyes.

@@ -75,16 +78,20 @@ type clientServer interface {
// mode, it runs a full Consul server. In client-only mode, it only forwards
// requests to other Consul servers.
type Agent struct {
// id is an optional log prefix.
id string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be confusing w/the existing node ID. Maybe we can call this logPrefix so it's more explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to just provide a properly configured logger which is used everywhere. Since I've fixed the logger passing we can move this into the TestAgent

}

// wait for servers to be up
// todo(fs): not sure whether this is the right approach.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty reasonable - if the server doesn't start then we should hit the timeout case and abort the startup, which I think is the minimum we should do. Failing the agent later if a server error occurs seems good for robustness, but should be done separately in a future PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(that'll let the process supervisor try to clean it up)

break
}
l, err = tls.Listen("tcp", p.Addr, tlscfg)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a default here so we don't just panic if we mis-configure the addrs in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

// Shutdown is called before the Serve go routine was scheduled then
// the Serve go routine never returns. This deadlocks the agent
// shutdown for some tests since it will wait forever.
if strings.Contains("*tls.listener", fmt.Sprintf("%T", l)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this pattern vs. a type assertion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type assertions don't work with non-exported types :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the comment to make that clear since you're not the first one to ask.

// srv.Close()

// graceful shutdown
ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to log anything if we timeout here? Right now I don't think we'd know if we were hitting this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

// connections to consul.
func (c *Config) IncomingTLSConfig() (*tls.Config, error) {
tc := &tlsutil.Config{
VerifyIncoming: c.VerifyIncoming || c.VerifyIncomingHTTPS,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is only set up properly for HTTPS listeners (not RPC listeners), so we should rename it to be more specific and update the comment.

time.Sleep(50 * time.Millisecond)
p := &UserEvent{Name: "second"}
if err := a.UserEvent("dc1", "root", p); err != nil {
t.Fatalf("err: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work inside a goroutine, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which part? the time.Sleep or the t.Fatal ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The t.Fatal

}
list = ln
} else {
list = tcpKeepAliveListener{ln.(*net.TCPListener)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this detail got lost in the refactor, was that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the tcpKeepAliveListener would be used by default in the http package - which it is but only if you use the ListenAndServe methods - which we don't. I'll add it back in. Forgot to do that.

waitForLeader(t, a1.httpAddr)
t.Parallel()
cfg := agent.TestConfig()
cfg.DisableRemoteExec = agent.Bool(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be &BoolFalse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

* use agent logger for consul/serf/raft/dns/agent/...
* support optional id for concurrent tests
* don't use retry to try restarting the agent
  this caused some issues when the startup would fail in
  a separate go routine

* clear out the data directory on every retry since the ports
  are stored in the raft data files

* set a unique id for every agent to allow for tracking of
  concurrent output
* capture all output to test.log only and print only failures
* always print output on travis
* test github.com/hashicorp/consul/consul separately until we've
  found the interactions with the other packages during testing.
We need to init the flagset that cli uses to generate the help
outside of the Run method since Run isn't called anymore for
printing help.
* refactor DNS server to be ready for multiple bind addresses
* drop tcpKeepAliveListener since it is default for the HTTP servers
* add startup timeout watcher for HTTP servers identical to DNS server
@magiconair magiconair force-pushed the issue-2217-multiple-bind-addr branch from f673590 to ec72969 Compare May 30, 2017 23:09
@magiconair magiconair merged commit ec72969 into master May 30, 2017
@magiconair magiconair deleted the issue-2217-multiple-bind-addr branch May 30, 2017 23:12
@magiconair magiconair changed the title WIP: work towards solving issues 2217 and 2465 First big step towards solving issues 2217 and 2465 May 30, 2017
@@ -31,7 +31,7 @@ func TestExecCommand_implements(t *testing.T) {
func TestExecCommandRun(t *testing.T) {
t.Parallel()
cfg := agent.TestConfig()
cfg.DisableRemoteExec = agent.Bool(false)
cfg.DisableRemoteExec = &agent.BoolFalse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were a bunch more of these in this file (I didn't tag them all) :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert to the Bool() function since it is more flexible in situations where you need to derive the pointer from another value. The main issue is the use of *bool in the first place IMO.

select {
case <-done:
// server down within timeout
case <-ctx.Done():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just check the error from ctx.Done() here, this is correct but you don't need the extra channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and a good idea as well since we're timing out every time. The server doesn't shut down cleanly and I need to figure out why. I've refactored it to not use the second channel and will investigate the code that hangs the server during shutdown.

agent.LogWriter = a.LogWriter
agent.logger = log.New(logOutput, id, log.LstdFlags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the log prefix anymore in test.log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm they are there now that I'm looking. Perhaps just some of the tests are missing the prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only tests that use the TestAgent set the prefix.

1yhud3gtn56jf   2017/05/31 09:42:26 [INFO] raft: Node at 127.0.0.1:50636 [Candidate] entering Candidate state in term 2
1yhud3gtn56jf   2017/05/31 09:42:26 [DEBUG] raft: Votes needed: 1
1yhud3gtn56jf   2017/05/31 09:42:26 [DEBUG] raft: Vote granted from 127.0.0.1:50636 in term 2. Tally: 1
1yhud3gtn56jf   2017/05/31 09:42:26 [INFO] raft: Election won. Tally: 1
1yhud3gtn56jf   2017/05/31 09:42:26 [INFO] raft: Node at 127.0.0.1:50636 [Leader] entering Leader state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use the test name as an id as you suggested earlier. That makes more sense even if the logs are a bit misaligned. grep TestName test.log will then show you everything you need to know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ grep TestDNS_Recurse_Truncation test.log
=== RUN   TestDNS_Recurse_Truncation
TestDNS_Recurse_Truncation - 2017/05/31 09:51:15 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:127.0.0.1:54916 Address:127.0.0.1:54916}]
TestDNS_Recurse_Truncation - 2017/05/31 09:51:15 [INFO] raft: Node at 127.0.0.1:54916 [Follower] entering Follower state (Leader: "")
TestDNS_Recurse_Truncation - 2017/05/31 09:51:15 [INFO] serf: EventMemberJoin: Node 77c888a2-7a88-a600-22a6-33064aa14734 127.0.0.1
TestDNS_Recurse_Truncation - 2017/05/31 09:51:15 [INFO] consul: Adding LAN server Node 77c888a2-7a88-a600-22a6-33064aa14734 (Addr: tcp/127.0.0.1:54916) (DC: dc1)
TestDNS_Recurse_Truncation - 2017/05/31 09:51:15 [INFO] serf: EventMemberJoin: Node 77c888a2-7a88-a600-22a6-33064aa14734.dc1 127.0.0.1

@magiconair
Copy link
Contributor Author

pushed changes to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants