Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak for ColumnFamily drop with live iterator #7749

Closed
wants to merge 2 commits into from

Conversation

pdillinger
Copy link
Contributor

Summary: Uncommon bug seen by ASAN with
ColumnFamilyTest.LiveIteratorWithDroppedColumnFamily, if the last two
references to a ColumnFamilyData are both SuperVersions (during
InstallSuperVersion). The fix is to use UnrefAndTryDelete even in
SuperVersion::Cleanup but with a parameter to avoid re-entering Cleanup
on the same SuperVersion being cleaned up.

ColumnFamilyData::Unref is considered unsafe so removed.

Test Plan: ./column_family_test --gtest_filter=LiveIter --gtest_repeat=100

Summary: Uncommon bug seen by ASAN with
ColumnFamilyTest.LiveIteratorWithDroppedColumnFamily, if the last two
references to a ColumnFamilyData are both SuperVersions (during
InstallSuperVersion). The fix is to use UnrefAndTryDelete even in
SuperVersion::Cleanup but with a parameter to avoid re-entering Cleanup
on the same SuperVersion being cleaned up.

Test Plan: ./column_family_test --gtest_filter=*LiveIter* --gtest_repeat=100
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@colin-ife-snyk
Copy link

Hi @pdillinger 👋
Getting in touch on behalf of Snyk. Could this memory leak represent a security risk? I.e. could a malicious user practically exploit it and trigger a DoS on the victim server?

@pdillinger
Copy link
Contributor Author

Hi @pdillinger 👋
Getting in touch on behalf of Snyk. Could this memory leak represent a security risk? I.e. could a malicious user practically exploit it and trigger a DoS on the victim server?

Dropping a column family is not normally done in response to service user data, but for maintenance or internal data migrations.

if (cfd->Unref()) {
delete cfd;
}
cfd->UnrefAndTryDelete(this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: when does ColumnFamilyData have different SuperVersion?
I thought they're always point to each other for the same CF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From new comment

// (But while installing a new SuperVersion, this
// cfd could be referenced only by two SuperVersions.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see, one is the new SV, another one is the old SV.

Comment on lines +671 to +675
// If called under SuperVersion::Cleanup, we should not re-enter Cleanup on
// the same SuperVersion. (But while installing a new SuperVersion, this
// cfd could be referenced only by two SuperVersions.)
if (old_refs == 2 && super_version_ != nullptr &&
super_version_ != sv_under_cleanup) {
Copy link
Contributor

@jay-zhuang jay-zhuang Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sv_under_cleanup is only used by SuperVersion::Cleanup(), should this logic also move to there? I feel having a SuperVersion a parameter for UnrefAndTryDelete() is a little bit confusing.

Copy link
Contributor Author

@pdillinger pdillinger Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When sv_under_cleanup is non-null, we might exclude this code. We might need the code below from any caller. If you have a specific proposal to simplify, let me know. This code is not as simple or clear as I would prefer.

Copy link
Contributor

@jay-zhuang jay-zhuang Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion would be changing SuperVersion::Cleanup() to like:

  if (cfd->GetSuperVersion() == this) {
    if (cfd->Unref()) {
      delete cfd;
    }
  } else {
    cfd->UnrefAndTryDelete();
  }

At least, it won't make UnrefAndTryDelete more complicate and checking the SuperVersion within SuperVersion::cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would either need to keep the Unref() function around, which I fear invites further bugs, or make SuperVersion a friend of ColumnFamilyData. Making a friend suggests we would be violating the natural encapsulation for ColumnFamilyData.

Copy link
Contributor

@jay-zhuang jay-zhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@facebook-github-bot
Copy link
Contributor

@pdillinger merged this pull request in b1ee191.

codingrhythm pushed a commit to SafetyCulture/rocksdb that referenced this pull request Mar 5, 2021
Summary:
Uncommon bug seen by ASAN with
ColumnFamilyTest.LiveIteratorWithDroppedColumnFamily, if the last two
references to a ColumnFamilyData are both SuperVersions (during
InstallSuperVersion). The fix is to use UnrefAndTryDelete even in
SuperVersion::Cleanup but with a parameter to avoid re-entering Cleanup
on the same SuperVersion being cleaned up.

ColumnFamilyData::Unref is considered unsafe so removed.

Pull Request resolved: facebook#7749

Test Plan: ./column_family_test --gtest_filter=*LiveIter* --gtest_repeat=100

Reviewed By: jay-zhuang

Differential Revision: D25354304

Pulled By: pdillinger

fbshipit-source-id: e78f3a3f67c40013b8432f31d0da8bec55c5321c
facebook-github-bot pushed a commit that referenced this pull request Aug 3, 2021
Summary:
The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
the per-thread `SuperVersion` pointers when the only remaining reference
is the back reference from `super_version_`. The idea behind this was to
break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
when the penultimate reference goes away, `ColumnFamilyData` can clean up
the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
is a `SuperVersion` and it is not referenced by anything else.) However,
unlocking the mutex throws a wrench in this plan by making it possible for another thread
to jump in and take another reference to the `ColumnFamilyData`, keeping the
object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
#8440 ,
#8382 ,
and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
destructor we sometimes observe during close in our stress tests.

Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
`SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
#3510 fixed the deadlock but left the
unlocking in place. #6147 then introduced
the circular dependency and associated cleanup logic described above (in order
to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
Finally, #7749 fixed a memory leak but
apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
in `SuperVersion::Cleanup`.

The patch simply eliminates the unlocking and relocking, which has been unnecessary
ever since #3510 made `SuperVersionUnrefHandle` lock-free.
This closes the window during which another thread could increase the reference count,
and hopefully fixes the issues above.

Pull Request resolved: #8605

Test Plan: Ran `make check` and stress tests locally.

Reviewed By: pdillinger

Differential Revision: D30051035

Pulled By: ltamasi

fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528
ltamasi added a commit that referenced this pull request Aug 4, 2021
Summary:
The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
the per-thread `SuperVersion` pointers when the only remaining reference
is the back reference from `super_version_`. The idea behind this was to
break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
when the penultimate reference goes away, `ColumnFamilyData` can clean up
the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
is a `SuperVersion` and it is not referenced by anything else.) However,
unlocking the mutex throws a wrench in this plan by making it possible for another thread
to jump in and take another reference to the `ColumnFamilyData`, keeping the
object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
#8440 ,
#8382 ,
and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
destructor we sometimes observe during close in our stress tests.

Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
`SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
#3510 fixed the deadlock but left the
unlocking in place. #6147 then introduced
the circular dependency and associated cleanup logic described above (in order
to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
Finally, #7749 fixed a memory leak but
apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
in `SuperVersion::Cleanup`.

The patch simply eliminates the unlocking and relocking, which has been unnecessary
ever since #3510 made `SuperVersionUnrefHandle` lock-free.
This closes the window during which another thread could increase the reference count,
and hopefully fixes the issues above.

Pull Request resolved: #8605

Test Plan: Ran `make check` and stress tests locally.

Reviewed By: pdillinger

Differential Revision: D30051035

Pulled By: ltamasi

fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants