Fix allocation race #6439

behlendorf · 2017-08-01T23:22:12Z

Description

When performing concurrent object allocations using the new
multi-threaded allocator and large dnodes it's possible to
allocate overlapping large dnodes.

This case should have been handled by detecting an error
returned by dnode_hold_impl(). But that logic only checked
the returned dnp was not-NULL, and the dnp variable was not
reset to NULL when retying.

Resolve the issue by properly checking the return value of
dnode_hold_impl(). Then dmu_object_next() function was also
updated to skip to the next whole block rather than increment
by one on a dmu_object_info() error. This isn't strictly with
the first fix in place, but it make it less likely to return
an object number when references the interior of a large dnode.

Motivation and Context

Issue #6414

How Has This Been Tested?

Local testing in progress using a ZFS + Lustre configuration.
Buildbot.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

nedbass

commit message:
s/retying/retrying/
s/strictly/strictly necessary/
s/when references/which references/

behlendorf · 2017-08-02T00:05:41Z

Refreshed.

behlendorf · 2017-08-02T17:25:42Z

Refreshed. The optional dmu_object_next() hunk was dropped since it had the side effect of causing zdb to skip allocated objects when traversing them to dump them. That could have been special cased but it only made things more confusing and this optional change isn't need to address the core issue. Additionally, the new ASSERT was dropped as redundant.

behlendorf · 2017-08-05T02:19:22Z

Additional testing revealed that the original patch didn't access all possible races. I've updated the PR with an updated patch which includes a test case which was able to reproduce the issue. Please review.

When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is caller will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

nedbass · 2017-08-05T18:11:23Z

Nice catch. Changes look reasonable to me. Couple of commit message typos (latter, caller). I guess this also fixes a NULL pointer deref in dmu_tx_sa_registration_hold()?

ironMann · 2017-08-06T21:08:40Z

module/zfs/dmu_object.c

@@ -61,6 +61,7 @@ dmu_object_alloc_dnsize(objset_t *os, dmu_object_type_t ot, int blocksize,
 	boolean_t restarted = B_FALSE;
 	uint64_t *cpuobj = NULL;
 	int dnodes_per_chunk = 1 << dmu_object_alloc_chunk_shift;
+	int error;

 	kpreempt_disable();
 	cpuobj = &os->os_obj_next_percpu[CPU_SEQID %


I'm puzzled by this 'per-cpu' usage pattern here. This pointer is dereferenced later on with preemption enabled, letting other threads to use the same value if scheduled on the same CPU. Moreover, the 'per-cpu' value is updated (atomically?) later, which can race with other threads (see #6470). Are these caveats handled somewhere/somehow?

The multi-threaded allocator, which introduced the atomics, is only responsible for making a best effort to so concurrent allocations usually occur in different dbufs. This minimizes contention of the dbuf lock. As you noted since preemption isn't disabled over the entire allocation it is possible two CPUs could try and allocate the same object. Or in the case of this PR different, but overlapping objects. The dnode_hold_impl() function is intended to authoritatively resolve these races safely when they occur.

Thanks for the clarification. Regardless, the atomic update could use atomic_cas_64() to only update the variable if the value has not been changed by another thread in the meantime. This might make things less fragile.
Performance wise, cache lines of os_obj_next_percpu will be bounced a lot under heavy multithreaded load. This could be easily resolved by adding sufficient padding for each per-cpu value.

Yes, I there's definitely room for additional optimization here which should be tackled.

ofaaland · 2017-08-07T18:08:16Z

module/zfs/dnode.c

-			ot = dn_block[i].dn_type;
-			skip = dn_block[i].dn_extra_slots + 1;
+			if (dn_block[i].dn_type != DMU_OT_NONE)
+				skip = dn_block[i].dn_extra_slots + 1;


Is it necessary for dnode_destroy() to zero dn_type and dn_extra_slots so that holes created by object destruction do not leave nonzero values that result in skip getting an incorrect value when children[] is out of sync with the dbuf and dnh_dnode == NULL?

@behlendorf pointed to dnode_sync_free(). It does indeed bzero() all the freed dnode slots.

When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held to protect moving dnodes. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is called will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6414 Closes #6439

When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held to protect moving dnodes. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is called will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#6414 Closes openzfs#6439

behlendorf requested a review from nedbass August 1, 2017 23:22

behlendorf mentioned this pull request Aug 1, 2017

PANIC: blkptr at ffff881f18601440 has invalid CHECKSUM 0 #6414

Closed

nedbass approved these changes Aug 1, 2017

View reviewed changes

behlendorf force-pushed the issue-6414 branch 2 times, most recently from e95b8b2 to 63fa50c Compare August 2, 2017 00:05

behlendorf force-pushed the issue-6414 branch from 63fa50c to 14b3add Compare August 2, 2017 17:20

behlendorf force-pushed the issue-6414 branch from 14b3add to 95d8640 Compare August 2, 2017 18:33

tonyhutter approved these changes Aug 2, 2017

View reviewed changes

dinatale2 approved these changes Aug 3, 2017

View reviewed changes

behlendorf force-pushed the issue-6414 branch from 95d8640 to 6f045b4 Compare August 5, 2017 02:16

behlendorf force-pushed the issue-6414 branch from 6f045b4 to 149f0f9 Compare August 5, 2017 02:21

ironMann reviewed Aug 6, 2017

View reviewed changes

ofaaland reviewed Aug 7, 2017

View reviewed changes

ofaaland approved these changes Aug 7, 2017

View reviewed changes

behlendorf merged commit 9631681 into openzfs:master Aug 8, 2017

phik mentioned this pull request Oct 2, 2017

Corrupted directory; scrub reports no errors. #6705

Closed

jgottula mentioned this pull request Oct 6, 2017

Invalid backup stream #6366

Closed

rincebrain mentioned this pull request Jul 23, 2018

zfs incremental recv fails with 'destination already exists' and corrupts pool #7735

Closed

behlendorf deleted the issue-6414 branch April 19, 2021 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix allocation race #6439

Fix allocation race #6439

behlendorf commented Aug 1, 2017

nedbass left a comment

behlendorf commented Aug 2, 2017

behlendorf commented Aug 2, 2017

behlendorf commented Aug 5, 2017

nedbass commented Aug 5, 2017

ironMann Aug 6, 2017

behlendorf Aug 7, 2017

ironMann Aug 7, 2017

behlendorf Aug 8, 2017

ofaaland Aug 7, 2017

ofaaland Aug 7, 2017 •

edited

Loading

Fix allocation race #6439

Fix allocation race #6439

Conversation

behlendorf commented Aug 1, 2017

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

nedbass left a comment

Choose a reason for hiding this comment

behlendorf commented Aug 2, 2017

behlendorf commented Aug 2, 2017

behlendorf commented Aug 5, 2017

nedbass commented Aug 5, 2017

ironMann Aug 6, 2017

Choose a reason for hiding this comment

behlendorf Aug 7, 2017

Choose a reason for hiding this comment

ironMann Aug 7, 2017

Choose a reason for hiding this comment

behlendorf Aug 8, 2017

Choose a reason for hiding this comment

ofaaland Aug 7, 2017

Choose a reason for hiding this comment

ofaaland Aug 7, 2017 • edited Loading

Choose a reason for hiding this comment

ofaaland Aug 7, 2017 •

edited

Loading