Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue3373 storage exit crash #3553

Merged

Conversation

cangfengzhs
Copy link
Contributor

@cangfengzhs cangfengzhs commented Dec 23, 2021

What type of PR is this?

  • bug
  • feature
  • enhancement

What does this PR do?

Use RCU replace ThreadLocal in MetaClient

Which issue(s)/PR(s) this PR relates to?

#3373
#3497

Special notes for your reviewer, ex. impact of this fix, etc:

At the beginning, we found that storage would crash after running for a long time (a large number of insert edge operations were performed at the same time). At the same time, Storage's memory usage will be very high. So we guess that there is a memory leak after the system OOM. However, it was later discovered that this is not the problem. Even if the Storage does not have OOM, it will crash when it is stopped. All coredump stacks destruct a static thread variable when the thread exits. This is a variable of type folly::SingletonThreadLocal introduced in MetaClient.

At the same time, in another scenario, if compaction is triggered when storage is started, it will crash directly, and the coredump stack and stop will be the same.

After a long time of investigation, we did not find the specific cause of this problem, but we found that this was a problem that only appeared after the introduction of folly::SingletonThreadLocal, so we chose to deprecate folly::SingletonThreadLocal and replace it with RCU it.

After using RCU, there is indeed no crash. I am not sure whether it was really fixed or just because the probability of crash has decreased and I did not find it.

In addition, the performance of using RCU should also be better than the performance of ThreadLocal, because no read-write lock means no blocking

Additional context/ Design document:

Checklist:

  • Documentation affected (Please add the label if documentation needs to be modified.)
  • Incompatibility (If it breaks the compatibility, please describe it and add the corresponding label.)
  • If it's needed to cherry-pick (If cherry-pick to some branches is required, please label the destination version(s).)
  • Performance impacted: Consumes more CPU/Memory

Release notes:

Please confirm whether to be reflected in release notes and how to describe:

@cangfengzhs cangfengzhs linked an issue Dec 23, 2021 that may be closed by this pull request
@cangfengzhs cangfengzhs force-pushed the issue3373-storage-exit-crash branch from 46dcfa5 to cbc618f Compare December 29, 2021 02:17
@cangfengzhs cangfengzhs added the ready-for-testing PR: ready for the CI test label Dec 29, 2021
@cangfengzhs cangfengzhs marked this pull request as ready for review December 29, 2021 02:21
@cangfengzhs cangfengzhs requested review from a team December 29, 2021 02:22
Copy link
Contributor

@critical27 critical27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job.

  1. I think localCacheLock_ is useless now, maybe we could remove it.
  2. move killedPlans_ and killedPlans_ to the same rcu

Copy link
Contributor

@critical27 critical27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for test

@Sophie-Xie Sophie-Xie added the cherry-pick-v3.0 PR: need cherry-pick to this version label Jan 4, 2022
@Sophie-Xie Sophie-Xie linked an issue Jan 4, 2022 that may be closed by this pull request
fix storage exit crash

format

address some comment
@cangfengzhs cangfengzhs force-pushed the issue3373-storage-exit-crash branch from bf32da3 to 7081904 Compare January 6, 2022 07:44
@cangfengzhs cangfengzhs requested review from Aiee and critical27 January 6, 2022 07:47
Copy link
Contributor

@critical27 critical27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A long story... Good job~~ LGTM

@Sophie-Xie Sophie-Xie removed the request for review from a team January 7, 2022 03:37
@Sophie-Xie Sophie-Xie removed request for yixinglu and Aiee January 7, 2022 03:37
@critical27 critical27 merged commit 32b5ce4 into vesoft-inc:master Jan 7, 2022
Sophie-Xie pushed a commit that referenced this pull request Jan 10, 2022
* use rcu replace thread local

fix storage exit crash

format

address some comment

* fix bug

* fix bug
critical27 added a commit that referenced this pull request Jan 10, 2022
* Fix typos (#3615)

Co-authored-by: kyle.cao <kyle.cao@vesoft.com>

* fix fetch edges tostring (#3613)

Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>
Co-authored-by: Yichen Wang <18348405+Aiee@users.noreply.github.com>

* fix create space assign offline host (#3583)

* fix create space

* fix test case

Co-authored-by: Harris.Chu <1726587+HarrisChu@users.noreply.github.com>

* Disable ARM version docker image since related third party not ready (#3618)

* Unify raft error code (#3620)

* Meta upgrader v3 (#3540)

* Replace group when create space

* Support white list

* fix test case

* support zone operations

* fix

* Support meta upgrade v3

* add more check about parse host result (#3628)

* Ut fix (#3611)

* Enable ut and fix chaindelete

* Add mock server default worker

* fix service crash (#3616)

* Cleanup branch param in package script (#3622)

* fix crash when the expression exceed the depth (#3606)

* Enhance login password check (#3629)

* fix_batch_insert_problem (#3627)

* filter data before batch insert

* add test cases

* add more testcase

* add notifyStop() for metaClient (#3621)

* add notifyStop() for metaClient

* do clean

* Fix removeSession() (#3651)

Co-authored-by: Yee <2520865+yixinglu@users.noreply.github.com>

* Issue3373 storage exit crash (#3553)

* use rcu replace thread local

fix storage exit crash

format

address some comment

* fix bug

* fix bug

* Fix coalesce bug (#3653)

* fix coalesce

* fix test

* add test

* add tck

* fix

* fix

* fix

* delete double check agg in where clause (#3647)

Co-authored-by: Yee <2520865+yixinglu@users.noreply.github.com>
Co-authored-by: cpw <13495049+CPWstatic@users.noreply.github.com>

* fix meta crash after create space (#3660)

Co-authored-by: Yichen Wang <18348405+Aiee@users.noreply.github.com>

Co-authored-by: Yichen Wang <18348405+Aiee@users.noreply.github.com>
Co-authored-by: kyle.cao <kyle.cao@vesoft.com>
Co-authored-by: jimingquan <mingquan.ji@vesoft.com>
Co-authored-by: yaphet <4414314+darionyaphet@users.noreply.github.com>
Co-authored-by: Harris.Chu <1726587+HarrisChu@users.noreply.github.com>
Co-authored-by: Yee <2520865+yixinglu@users.noreply.github.com>
Co-authored-by: Doodle <13706157+critical27@users.noreply.github.com>
Co-authored-by: Alex Xing <90179377+SuperYoko@users.noreply.github.com>
Co-authored-by: endy.li <25311962+heroicNeZha@users.noreply.github.com>
Co-authored-by: lionel.liu@vesoft.com <52276794+liuyu85cn@users.noreply.github.com>
Co-authored-by: hs.zhang <22708345+cangfengzhs@users.noreply.github.com>
Co-authored-by: jakevin <30525741+jackwener@users.noreply.github.com>
Co-authored-by: cpw <13495049+CPWstatic@users.noreply.github.com>
yixinglu pushed a commit to yixinglu/nebula that referenced this pull request Mar 21, 2022
* use rcu replace thread local

fix storage exit crash

format

address some comment

* fix bug

* fix bug

fix bug

fix bug

Co-authored-by: hs.zhang <22708345+cangfengzhs@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-v3.0 PR: need cherry-pick to this version ready-for-testing PR: ready for the CI test
Projects
None yet
4 participants