Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing,testutils: detect span leaks #58902

Merged
merged 3 commits into from
Jan 19, 2021

Conversation

irfansharif
Copy link
Contributor

With always-on tracing, we're maintaining in-memory registry of active
spans (#58490). Spans are added and removed from this registry when
they're Start()-ed and Finish()-ed. Spans that not explicitly finished
(typically using defer sp.Finish()) are now a resource-leak, as they
take up room in the registry. We'll want to find instances of this leak
as soon as they crop up.

To that end we add a check in our TestCluster.Stop codepaths that
asserts against the registry being empty. This should give us wide
coverage given it's usage throughout. We expect this change to capture
the cases described in #58721.


This check is currently failing, given we're actually leaking spans in
txnState.resetForNewSQLTxn. Fixing that doesn't look simple, it ties
into questions around draining SQL and rolling back open txns.

Release note: None

@irfansharif irfansharif requested review from knz and tbg January 13, 2021 02:04
@irfansharif irfansharif requested a review from a team as a code owner January 13, 2021 02:04
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)


pkg/sql/txn_state.go, line 161 at r2 (raw file):

	//
	// XXX: We'll need do address the above with always-on tracing; we're
	// leaking spans.

One thing we could do is finish all the spans registered with the tracer on server shutdown. But... that's exactly destroying any way to check that we're not leaking them.

@asubiotto can you chime in here and suggest/make a change? We need reliable semantics about span finishing in all places, or the registry for long-running operations won't be very useful. My naive assumption is that a txnState only exists in the context of a conn executor, so if that had a close method that is called reliably (which should be possible) that would take care of the issue.

@irfansharif to get this PR merged, you could introduce a (temporary) tracing.WithBypassRegistry() option to StartSpan{,Ctx} that would simply avoid putting the Span in the registry, and use it here. Then we can file an issue against SQL and move on.

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)


pkg/sql/txn_state.go, line 161 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

One thing we could do is finish all the spans registered with the tracer on server shutdown. But... that's exactly destroying any way to check that we're not leaking them.

@asubiotto can you chime in here and suggest/make a change? We need reliable semantics about span finishing in all places, or the registry for long-running operations won't be very useful. My naive assumption is that a txnState only exists in the context of a conn executor, so if that had a close method that is called reliably (which should be possible) that would take care of the issue.

@irfansharif to get this PR merged, you could introduce a (temporary) tracing.WithBypassRegistry() option to StartSpan{,Ctx} that would simply avoid putting the Span in the registry, and use it here. Then we can file an issue against SQL and move on.

I agree with adding a temporary way to bypass adding spans to the registry (maybe with a more explicit you-shouldn't-do-this indicator in the name) to get this PR merged.

With respect to the specific problem of leaking spans on server shutdown I'm assuming this is specifically about txns that don't respond to a cancellation request, otherwise I would expect finishSQLTxn to be called. If not, we'd probably need to do something in the connExecutor's close method, although I'm not sure it gets called in these cases. It's been a while since I took a look at the code, but I think the best thing is to add this bypass and create an issue with a reproduction so that we can tackle it on the SQL exec team.

@irfansharif
Copy link
Contributor Author

I'll wait for the discussion on #58897 before landing this. I added the tracing.WithBypassRegistry() option, so should be ready to go.

...in the import processor. The contract of `RowSource.Next` requires
that at most one of the return values will be non-empty. This wasn't the
case here, and caused opaque failures in cockroachdb#58897. cockroachdb#58897 tries to enable
background tracing by default, which for us means that the trailing meta
can be non-empty (they'll contain span recordings). That behaviour ends
up tickling this bug, tripping up TestCSVImportCanBeResumed.

Release note: None
@irfansharif irfansharif requested review from a team and adityamaru and removed request for a team and adityamaru January 18, 2021 21:03
This is follow-up work from cockroachdb#58712, where we measured the overhead for
always-on tracing and found it to be minimal/acceptable. Lets switch
this on by default to shake the implications of doing so. We can
reasonably expect two kinds of fallout:

(1) Unexpected blow up in memory usage due to resource leakage (which is
a can be problem now that we're always maintaining open spans in an
internal registry, see cockroachdb#58721)

(2) Performance degradataion due to tracing overhead per-request
(something cockroachdb#58712) was spot checking for.

For (1) we'll introduce a future test in a separate PR. For (2), we'll
monitor roachperf over the next few weeks.

---

Also moved some of the documentation for the cluster setting into a
comment form above. Looking at what's rendered in our other cluster
settings (`SHOW ALL CLUSTER SETTINGS`), we default to a very pithy,
unwrapped description.

Release note: None
@irfansharif irfansharif force-pushed the 210112.test-tracing-registry branch 3 times, most recently from 6ee172d to bc7ba4d Compare January 19, 2021 13:08
With always-on tracing, we're maintaining in-memory registry of active
spans (cockroachdb#58490). Spans are added and removed from this registry when
they're `Start()`-ed and `Finish()`-ed. Spans that not explicitly finished
(typically using `defer sp.Finish()`) are now a resource-leak, as they
take up room in the registry. We'll want to find instances of this leak
as soon as they crop up[*].

To that end we add a check in our `TestCluster.Stop` codepaths that
asserts against the registry being empty. This should give us wide
coverage given it's usage throughout. We expect this change to capture
the cases described in cockroachdb#58721.

---

[*] For known issues like `txnState.resetForNewSQLTxn`, we'll
temporarily introduce a `WithBypassRegistry` option. Fixing that doesn't
look simple, it ties into questions around draining SQL and rolling back
open txns.

Release note: None
@irfansharif
Copy link
Contributor Author

Given how difficult CI has been recently, I'm just going to pile on and optimistically bors it. The second commit is from #58897. So either that fails and this goes through, or vice versa, or the commits get de-duped away.

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 19, 2021

Build succeeded:

@craig craig bot merged commit 47eb9f3 into cockroachdb:master Jan 19, 2021
@irfansharif irfansharif deleted the 210112.test-tracing-registry branch January 19, 2021 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants