Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: ensure first node status written within Server.Start() #37168

Merged
merged 1 commit into from
Apr 29, 2019

Conversation

darinpp
Copy link
Contributor

@darinpp darinpp commented Apr 26, 2019

The fix ensures that the status of the first node is written before
the end of Server.Start. Currently this is deferred
and sometimes the code doesn't execute by the time the tests start
running. For the rest of the nodes, the status is written as part of
the node bootstrap.

Fixes #33559, FIxes #36990.

Release note: None

The fix ensures that the status of the first node is written before
the end of `Server.Start`. Currently this is deferred
and sometimes the code doesn't execute by the time the tests start
running. For the rest of the nodes, the status is written as part of
the node bootstrap.

Fixes cockroachdb#33559 and cockroachdb#36990.

Release note: None
@darinpp darinpp requested a review from a team April 26, 2019 20:56
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@darinpp darinpp requested review from tbg, andreimatei and mberhault and removed request for a team April 27, 2019 06:18
@andreimatei
Copy link
Contributor

This looks good to me, but give me a bit to understand all the node status code and why exactly putting a writeNodeStatus() call here doesn't work (not that I'm convinced that that'd be cleaner; what you're doing in this patch is pretty clean), but I'd still like to understand.

if len(emptyEngines) > 0 {

I'll come back.

Btw, you've changed the "fixes" line in the PR message, but I think that doesn't matter. I think the one that matters is the commit message, which is still wrong (I'm not completely sure, though). There's also a typo, but I don't know if that matters.

@andreimatei
Copy link
Contributor

I did just a little bit of digging. What I see is that, if I were to add the following diff so that a writeNodeStatus() would be called for the first node around the same time when it's called for every other node (i.e. in Node.start()), it wouldn't just work.

--- a/pkg/server/node.go
+++ b/pkg/server/node.go
@@ -469,6 +469,13 @@ func (n *Node) start(
                if err := n.bootstrapStores(ctx, emptyEngines, n.stopper); err != nil {
                        return err
                }
+       } else {
+               if err := n.writeNodeStatus(ctx, 0 /* alertTTL */); err != nil {
+                       log.Warningf(ctx, "error writing node summary !!!: %s", err)
+               } 

It doesn't work because the n.writeNodeStatus() ends up doing nothing because, a couple of layers down, inside MetricsRecorder.GenerateNodeStatus(), this short-circuit is taken:

if mr.mu.nodeRegistry == nil {

mr.mu.nodeRegistry is not set yet. That field is set later by this call:

s.recorder.AddNode(s.registry, s.node.Descriptor, s.node.startedAt, s.cfg.AdvertiseAddr, s.cfg.HTTPAdvertiseAddr)

So now I wonder, is the n.writeNodeStatus() call inside bootstrapStores() ever doing anything (even for nodes other than the first one?)

if err := n.writeNodeStatus(ctx, 0 /* alertTTL */); err != nil {

Cause it seems to me like it wouldn't, for the same reason - because the MetricsRegistry is not properly initialized by that point. Is that call just useless and should we delete it? I'd be interested to know the answer here.
The MetricsRecorder logs this warning if GenerateNodeStatus() is called too early. It'd be fantastic if we could turn it into returning an error...
log.Warning(ctx, "attempt to generate status summary before NodeID allocation.")

Alternatively, I think it'd be nice if node.Start() could call node.writeNodeStatus(). I think that call to s.recorder.AddNode(...) that initializes the MetricsRecorder could be moved inside node.Start(). But I'm not trying to create work for you here if you'd rather be working on something else... But at least I want to know the answer to the question above - if we should delete this line:

if err := n.writeNodeStatus(ctx, 0 /* alertTTL */); err != nil {

@darinpp
Copy link
Contributor Author

darinpp commented Apr 29, 2019

This clears things up a bit. I tried moving s.recorder.AddNode(...) with node.writeNodeStatus() inside node.Start(). I'll need some time to understand how each class is being used and what it does to be able to make the change. This code will definitely benefit from a bit of a cleanup. How about merging this as is so we can resolve the two issues and then follow up with another PR for doing the refactoring?

@andreimatei
Copy link
Contributor

yeah, LGTM, merge at will
I'd still check if the n.writeNodeStatus() call inside bootstrapStores() does anything and if it doesn't, I'd remove it as part of this PR.

@darinpp
Copy link
Contributor Author

darinpp commented Apr 29, 2019

bors r+

craig bot pushed a commit that referenced this pull request Apr 29, 2019
37168: server: ensure first node status written within Server.Start() r=darinpp a=darinpp

The fix ensures that the status of the first node is written before
the end of `Server.Start`. Currently this is deferred
and sometimes the code doesn't execute by the time the tests start
running. For the rest of the nodes, the status is written as part of
the node bootstrap.

Fixes #33559, FIxes #36990.

Release note: None

Co-authored-by: Darin <darinp@gmail.com>
@craig
Copy link
Contributor

craig bot commented Apr 29, 2019

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cli: Example_zone failed under stress server: TestReportUsage failed under stress
3 participants