Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix allocation race #6439

Merged
merged 1 commit into from
Aug 8, 2017
Merged

Fix allocation race #6439

merged 1 commit into from
Aug 8, 2017

Conversation

behlendorf
Copy link
Contributor

Description

When performing concurrent object allocations using the new
multi-threaded allocator and large dnodes it's possible to
allocate overlapping large dnodes.

This case should have been handled by detecting an error
returned by dnode_hold_impl(). But that logic only checked
the returned dnp was not-NULL, and the dnp variable was not
reset to NULL when retying.

Resolve the issue by properly checking the return value of
dnode_hold_impl(). Then dmu_object_next() function was also
updated to skip to the next whole block rather than increment
by one on a dmu_object_info() error. This isn't strictly with
the first fix in place, but it make it less likely to return
an object number when references the interior of a large dnode.

Motivation and Context

Issue #6414

How Has This Been Tested?

  • Local testing in progress using a ZFS + Lustre configuration.
  • Buildbot.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

Copy link
Contributor

@nedbass nedbass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit message:
s/retying/retrying/
s/strictly/strictly necessary/
s/when references/which references/

@behlendorf behlendorf force-pushed the issue-6414 branch 2 times, most recently from e95b8b2 to 63fa50c Compare August 2, 2017 00:05
@behlendorf
Copy link
Contributor Author

Refreshed.

@behlendorf
Copy link
Contributor Author

Refreshed. The optional dmu_object_next() hunk was dropped since it had the side effect of causing zdb to skip allocated objects when traversing them to dump them. That could have been special cased but it only made things more confusing and this optional change isn't need to address the core issue. Additionally, the new ASSERT was dropped as redundant.

@behlendorf
Copy link
Contributor Author

Additional testing revealed that the original patch didn't access all possible races. I've updated the PR with an updated patch which includes a test case which was able to reproduce the issue. Please review.

When performing concurrent object allocations using the new
multi-threaded allocator and large dnodes it's possible to
allocate overlapping large dnodes.

This case should have been handled by detecting an error
returned by dnode_hold_impl().  But that logic only checked
the returned dnp was not-NULL, and the dnp variable was not
reset to NULL when retrying.  Resolve this issue by properly
checking the return value of dnode_hold_impl().

Additionally, it was possible that dnode_hold_impl() would
misreport a dnode as free when it was in fact in use.  This
could occurs for two reasons:

* The per-slot zrl_lock must be held over the entire critical
  section which includes the alloc/free until the new dnode
  is assigned to children_dnodes.  Additionally, all of the
  zrl_lock's in the range must be held.

* The dn->dn_ot_type cannot be solely relied upon to check
  the type.  When allocating a new dnode its type will be
  DMU_OT_NONE after dnode_create().  Only latter when
  dnode_allocate() is caller will it transition to the new
  type.  This means there's a window when allocating where
  it can mistaken for a free dnode.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@nedbass
Copy link
Contributor

nedbass commented Aug 5, 2017

Nice catch. Changes look reasonable to me. Couple of commit message typos (latter, caller). I guess this also fixes a NULL pointer deref in dmu_tx_sa_registration_hold()?

@@ -61,6 +61,7 @@ dmu_object_alloc_dnsize(objset_t *os, dmu_object_type_t ot, int blocksize,
boolean_t restarted = B_FALSE;
uint64_t *cpuobj = NULL;
int dnodes_per_chunk = 1 << dmu_object_alloc_chunk_shift;
int error;

kpreempt_disable();
cpuobj = &os->os_obj_next_percpu[CPU_SEQID %
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm puzzled by this 'per-cpu' usage pattern here. This pointer is dereferenced later on with preemption enabled, letting other threads to use the same value if scheduled on the same CPU. Moreover, the 'per-cpu' value is updated (atomically?) later, which can race with other threads (see #6470). Are these caveats handled somewhere/somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi-threaded allocator, which introduced the atomics, is only responsible for making a best effort to so concurrent allocations usually occur in different dbufs. This minimizes contention of the dbuf lock. As you noted since preemption isn't disabled over the entire allocation it is possible two CPUs could try and allocate the same object. Or in the case of this PR different, but overlapping objects. The dnode_hold_impl() function is intended to authoritatively resolve these races safely when they occur.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. Regardless, the atomic update could use atomic_cas_64() to only update the variable if the value has not been changed by another thread in the meantime. This might make things less fragile.
Performance wise, cache lines of os_obj_next_percpu will be bounced a lot under heavy multithreaded load. This could be easily resolved by adding sufficient padding for each per-cpu value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I there's definitely room for additional optimization here which should be tackled.

ot = dn_block[i].dn_type;
skip = dn_block[i].dn_extra_slots + 1;
if (dn_block[i].dn_type != DMU_OT_NONE)
skip = dn_block[i].dn_extra_slots + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary for dnode_destroy() to zero dn_type and dn_extra_slots so that holes created by object destruction do not leave nonzero values that result in skip getting an incorrect value when children[] is out of sync with the dbuf and dnh_dnode == NULL?

Copy link
Contributor

@ofaaland ofaaland Aug 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behlendorf pointed to dnode_sync_free(). It does indeed bzero() all the freed dnode slots.

@behlendorf behlendorf merged commit 9631681 into openzfs:master Aug 8, 2017
tonyhutter pushed a commit that referenced this pull request Aug 8, 2017
When performing concurrent object allocations using the new
multi-threaded allocator and large dnodes it's possible to
allocate overlapping large dnodes.

This case should have been handled by detecting an error
returned by dnode_hold_impl().  But that logic only checked
the returned dnp was not-NULL, and the dnp variable was not
reset to NULL when retrying.  Resolve this issue by properly
checking the return value of dnode_hold_impl().

Additionally, it was possible that dnode_hold_impl() would
misreport a dnode as free when it was in fact in use.  This
could occurs for two reasons:

* The per-slot zrl_lock must be held over the entire critical
  section which includes the alloc/free until the new dnode
  is assigned to children_dnodes.  Additionally, all of the
  zrl_lock's in the range must be held to protect moving
  dnodes.

* The dn->dn_ot_type cannot be solely relied upon to check
  the type.  When allocating a new dnode its type will be
  DMU_OT_NONE after dnode_create().  Only latter when
  dnode_allocate() is called will it transition to the new
  type.  This means there's a window when allocating where
  it can mistaken for a free dnode.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6414
Closes #6439
Fabian-Gruenbichler pushed a commit to Fabian-Gruenbichler/zfs that referenced this pull request Sep 6, 2017
When performing concurrent object allocations using the new
multi-threaded allocator and large dnodes it's possible to
allocate overlapping large dnodes.

This case should have been handled by detecting an error
returned by dnode_hold_impl().  But that logic only checked
the returned dnp was not-NULL, and the dnp variable was not
reset to NULL when retrying.  Resolve this issue by properly
checking the return value of dnode_hold_impl().

Additionally, it was possible that dnode_hold_impl() would
misreport a dnode as free when it was in fact in use.  This
could occurs for two reasons:

* The per-slot zrl_lock must be held over the entire critical
  section which includes the alloc/free until the new dnode
  is assigned to children_dnodes.  Additionally, all of the
  zrl_lock's in the range must be held to protect moving
  dnodes.

* The dn->dn_ot_type cannot be solely relied upon to check
  the type.  When allocating a new dnode its type will be
  DMU_OT_NONE after dnode_create().  Only latter when
  dnode_allocate() is called will it transition to the new
  type.  This means there's a window when allocating where
  it can mistaken for a free dnode.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#6414
Closes openzfs#6439
@jgottula jgottula mentioned this pull request Oct 6, 2017
FransUrbo pushed a commit to FransUrbo/zfs that referenced this pull request Apr 28, 2019
When performing concurrent object allocations using the new
multi-threaded allocator and large dnodes it's possible to
allocate overlapping large dnodes.

This case should have been handled by detecting an error
returned by dnode_hold_impl().  But that logic only checked
the returned dnp was not-NULL, and the dnp variable was not
reset to NULL when retrying.  Resolve this issue by properly
checking the return value of dnode_hold_impl().

Additionally, it was possible that dnode_hold_impl() would
misreport a dnode as free when it was in fact in use.  This
could occurs for two reasons:

* The per-slot zrl_lock must be held over the entire critical
  section which includes the alloc/free until the new dnode
  is assigned to children_dnodes.  Additionally, all of the
  zrl_lock's in the range must be held to protect moving
  dnodes.

* The dn->dn_ot_type cannot be solely relied upon to check
  the type.  When allocating a new dnode its type will be
  DMU_OT_NONE after dnode_create().  Only latter when
  dnode_allocate() is called will it transition to the new
  type.  This means there's a window when allocating where
  it can mistaken for a free dnode.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#6414 
Closes openzfs#6439
@behlendorf behlendorf deleted the issue-6414 branch April 19, 2021 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants