Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: investigate kv95 performance after Feb 18 #62359

Closed
erikgrinaker opened this issue Mar 22, 2021 · 6 comments
Closed

kv: investigate kv95 performance after Feb 18 #62359

erikgrinaker opened this issue Mar 22, 2021 · 6 comments
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 22, 2021

kv95 performance dropped significantly on Feb 18, as seen here: https://roachperf.crdb.dev/?filter=&view=kv95%2Fenc%3Dfalse%2Fnodes%3D1%2Fcpu%3D32&tab=gce

The initial drop was investigated in #62148, and was due to tracing. This was solved by #61777 (verified by cherry-picking onto the culprit PR #59992), but performance did not fully recover, so there were other performance regressions in the period following Feb 18. We need to figure out what these are and get back to the pre-Feb 18 baseline.

@erikgrinaker erikgrinaker added C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker labels Mar 22, 2021
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Mar 23, 2021

I took some 30-second CPU profiles of kv95/enc=false/nodes=1/cpu=32 at about 1 minute in, spaced at ~1 week intervals (except 2020-02-19 since we had a particularly large drop then). Used the last commit before the date.

View with e.g. go tool pprof -http localhost:8000 <file> → View → Flame Graph.

A few clear offenders stand out to me:

  • 2021-02-19: tracing.Span.GetRecording (mostly gone again in next profile)
  • 2021-02-19: kvfollowerreadsccl.canSendToFollower (under DistSender)
  • 2021-03-15: colbuilder.NewColOperatortabledesc.NewBuilderprotoutil.Clone
  • 2021-03-15: colflow.batchInfoCollector.Next with kvfollowerreadsccl.canSendToFollower again

Except for tracing, these all remain in the final profile, and we should be able to get some significant gains by addressing them.

@jordanlewis
Copy link
Member

cc @postamar

@postamar
Copy link
Contributor

postamar commented Mar 23, 2021

Hi @erikgrinaker, I strongly suspect my offending change is responsible for a large part of the performance deterioration, to the point where I recommend rerunning the benchmarks once #62388 is merged and backported, which should be soon.

@angelapwen
Copy link

@postamar We came to the same conclusion ourselves a bit ago (thanks to @stevendanna). Thank you and @jordanlewis for looking out!

@erikgrinaker
Copy link
Contributor Author

#62465 and #62388 got us back to Feb 17-levels on kv95.

@jordanlewis
Copy link
Member

Awesome, great job everyone!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker
Projects
None yet
Development

No branches or pull requests

4 participants