Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: very frequent near identical lease acquisition log messages #28947

Closed
tbg opened this issue Aug 22, 2018 · 12 comments
Closed

storage: very frequent near identical lease acquisition log messages #28947

tbg opened this issue Aug 22, 2018 · 12 comments
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Milestone

Comments

@tbg
Copy link
Member

tbg commented Aug 22, 2018

Seen in the logs while debugging #28918.

There are 19 lease proposal messages for n2/r604 in a single second. The messages all have an identical lease, but different proposed timestamps. It's unclear to me why that should happen, so we likely have something to fix. And even in the unlikely case in which this is expected, the log is extremely spammy and needs to be throttled.

180821 13:58:11.423098 137 storage/replica_proposal.go:214  [n3,s3,r604/2:/Table/54/{2/945/1…-3}] new range lease repl=(n3,s3):2 seq=5 start=1534855664.670639431,0 epo=1 pro=1534859890.693085948,0 following repl=(n3,s3):2 seq=5 start=1534855664.670639431,0 epo=1 pro=1534859888.543800373,0
180821 14:27:27.092828 120 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861646.829038018,0 following repl=(n3,s3):2 seq=5 start=1534855664.670639431,0 epo=1 pro=1534859890.693085948,0
180821 14:27:40.704952 177 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.689998132,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861646.829038018,0
180821 14:27:40.717496 177 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.705193324,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.689998132,0
180821 14:27:40.736353 126 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.717712497,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.705193324,0
180821 14:27:40.751478 123 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.736836520,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.717712497,0
180821 14:27:40.763590 173 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.751758824,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.736836520,0
180821 14:27:40.777126 168 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.763842466,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.751758824,0
180821 14:27:40.791095 167 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.777457277,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.763842466,0
180821 14:27:40.808234 171 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.791419487,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.777457277,0
180821 14:27:40.822601 168 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.808601097,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.791419487,0
180821 14:27:40.837601 122 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.822825197,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.808601097,0
180821 14:27:40.864301 129 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.837791590,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.822825197,0
180821 14:27:40.885089 126 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.865708675,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.837791590,0
180821 14:27:40.901847 129 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.885296935,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.865708675,0
180821 14:27:40.915856 173 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.902606623,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.885296935,0
180821 14:27:40.929760 173 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.916097139,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.902606623,0
180821 14:27:40.942607 122 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.930587554,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.916097139,0
180821 14:27:40.959644 128 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.942824492,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.930587554,0
180821 14:27:40.973122 178 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.960146739,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.942824492,0
180821 14:27:40.988157 178 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.974917324,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.960146739,0
180821 14:27:41.001183 179 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.988554372,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.974917324,0
180821 14:27:41.015964 179 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.001552995,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.988554372,0
180821 14:27:41.031030 178 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.016203856,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.001552995,0
180821 14:27:41.047547 126 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.031225361,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.016203856,0
180821 14:27:41.061015 128 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.047872992,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.031225361,0
180821 14:27:41.074587 179 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.061302724,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.047872992,0
180821 14:27:41.087465 174 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.074901102,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.061302724,0
180821 14:27:41.102171 177 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.088295754,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.074901102,0
180821 14:27:41.119045 177 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.102585015,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.088295754,0
180821 14:27:41.135279 128 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.119410020,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.102585015,0
180821 14:27:41.170818 122 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.135530480,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.119410020,0
180821 14:27:41.188962 128 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.171110550,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.135530480,0
180821 14:27:41.801610 176 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.189245834,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.171110550,0
180821 20:41:48.882340 165 storage/replica_proposal.go:214  [n3,s3,r604/2:/Table/54/{2/945/1…-3}] new range lease repl=(n3,s3):2 seq=7 start=1534884105.354893619,0 epo=4 pro=1534884105.354897712,0 following repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.189245834,0
@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-client Relating to the KV client and the KV interface. labels Aug 22, 2018
@tbg tbg added this to the 2.1 milestone Aug 22, 2018
@tbg tbg changed the title storage: rapid chain of ela storage: very frequent near identical lease acquisition log messages Aug 22, 2018
@tbg
Copy link
Member Author

tbg commented Aug 22, 2018

Something I want to point out is that you could hope to explain this via Raft reproposals, but those aren't Raft reproposals - the ProposedTimestamp increases. So someone (presumably n2, assuming there are no lease transfers involved) is requesting leases in a hot loop for a bit.

@tbg
Copy link
Member Author

tbg commented Aug 22, 2018

I guess one way this could happen is:

  1. many proposals clear this line and end up with an invalid lease: https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/replica.go#L1389
  2. the proposals each request a lease, but spaced out over an interval large enough to not have them all join on the same lease request. Say we have 1000 requests and on average 10 would join into the same inflight lease proposal. Then you'd get 100 lease successive lease requests.

@tbg
Copy link
Member Author

tbg commented Aug 22, 2018

This still seems weird and I wonder if there's a different way. I did guess that perhaps there was a contended mutex below that line that could've held up requests, but as far as I can tell we don't acquire any mutexes other than r.mu, and that's above the line and held throughout.

Another possibility is that for some period of time, all lease renewals failed to the client but actually applied. But this would need similar "cohorting" in that method because once the first such proposal applies, any future requests see it, and large portions of that code are under the replica mu. Hmm, I'm still scratching my head.

@jordanlewis
Copy link
Member

This is really easy to reproduce. Just run TPCC and you'll get these.

@tbg
Copy link
Member Author

tbg commented Aug 22, 2018

Looking at the logs more, it seems that this isn't an isolated incident. It's happening for lots and lots of ranges, and during server startup. (The log starts a minute before the incident).

$ cat cockroach.jordan-tpcc-0003.jordan.2018-08-21T14_26_48Z.003613.log | grep 'new range lease' | grep -F '14:27:40' | grep -oE 'r[0-9]+\/' | sort | uniq -c
  10 r101/
   8 r103/
  18 r104/
   4 r105/
  10 r126/
  19 r127/
  18 r128/
  15 r2619/
  14 r2747/
  15 r2997/
   9 r3043/
  15 r3056/
  10 r329/
   9 r331/
  10 r385/
   9 r403/
  15 r410/
   8 r419/
  16 r463/
  12 r464/
  18 r480/
  17 r485/
   8 r531/
   8 r542/
  19 r604/
  11 r606/
  19 r63/
  10 r639/
  11 r691/
   8 r77/
  11 r78/
  10 r86/

and that's just a single busy second. The activity transcends much of this.

The reproposals are spaced across 10-30ms intervals, and somehow it seems that it must be driven by something about replica ticks. My first instinct is that this has to do with our cancelling of lease proposals in this code:

if p.command.MaxLeaseIndex == 0 {
// Commands without a MaxLeaseIndex cannot be reproposed, as they might
// apply twice. We also don't want to ask the proposer to retry these
// special commands.
r.cleanupFailedProposalLocked(p)
log.VEventf(p.ctx, 2, "refresh (reason: %s) returning AmbiguousResultError for command "+
"without MaxLeaseIndex: %v", reason, p.command)
p.finishApplication(proposalResult{Err: roachpb.NewError(
roachpb.NewAmbiguousResultError(
fmt.Sprintf("unknown status for command without MaxLeaseIndex "+
"at refreshProposalsLocked time (refresh reason: %s)", reason)))})

However, you expect that to be called every RaftElectionTimeoutTicks ticks which defaults to 15, and a tick is (by default) every 200ms for a total of every 3s.
If something called this much more frequently you'd get lots of cancelled lease requests, but you would still not expect to compound these lease requests for over a second.

@tbg
Copy link
Member Author

tbg commented Aug 22, 2018

@jordanlewis you expect a flurry of them at startup, but are you sure you get these duplicates? I assume you will as there's no way this is something special you're doing. What TPCC were you running? I know it's three nodes.

@jordanlewis
Copy link
Member

warehouses=1000, no other special settings. I'm not sure if they're duplicates, but there's lots of "flurries" of them that aren't during startup, but later.

@nstewart nstewart added the S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. label Sep 18, 2018
@tbg
Copy link
Member Author

tbg commented Sep 20, 2018

@nstewart I'm not sure why you added S-3-erroneous-edge-case here. Can you elaborate?

@tbg tbg removed the S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. label Sep 20, 2018
@tbg
Copy link
Member Author

tbg commented Sep 29, 2018

@nvanbenschoten could I put this on your plate? Basically just keep an eye out for the lease acquisition log spam when you run tpcc the next time (which I assume you do all the time). There's hopefully nothing to fix, but perhaps we want to cut down on the number of these messages.

@nvanbenschoten
Copy link
Member

Sure thing, although I probably won't have time to investigate in the next week or so. I'm hoping there's an easy win here.

@tbg tbg added A-coreperf and removed A-kv-client Relating to the KV client and the KV interface. labels Oct 1, 2018
@tbg
Copy link
Member Author

tbg commented Oct 1, 2018

Sure, this isn't the highest priority. My main concern here is improving the logging UX (well, and making sure there isn't something terrible going on with these leases).

@nvanbenschoten nvanbenschoten added A-kv-distribution Relating to rebalancing and leasing. and removed A-coreperf labels Oct 12, 2018
@nvanbenschoten
Copy link
Member

Resolved by #31644.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

4 participants