-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix allocation race #6439
Fix allocation race #6439
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit message:
s/retying/retrying/
s/strictly/strictly necessary/
s/when references/which references/
e95b8b2
to
63fa50c
Compare
Refreshed. |
Refreshed. The optional |
Additional testing revealed that the original patch didn't access all possible races. I've updated the PR with an updated patch which includes a test case which was able to reproduce the issue. Please review. |
When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is caller will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Nice catch. Changes look reasonable to me. Couple of commit message typos (latter, caller). I guess this also fixes a NULL pointer deref in |
@@ -61,6 +61,7 @@ dmu_object_alloc_dnsize(objset_t *os, dmu_object_type_t ot, int blocksize, | |||
boolean_t restarted = B_FALSE; | |||
uint64_t *cpuobj = NULL; | |||
int dnodes_per_chunk = 1 << dmu_object_alloc_chunk_shift; | |||
int error; | |||
|
|||
kpreempt_disable(); | |||
cpuobj = &os->os_obj_next_percpu[CPU_SEQID % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm puzzled by this 'per-cpu' usage pattern here. This pointer is dereferenced later on with preemption enabled, letting other threads to use the same value if scheduled on the same CPU. Moreover, the 'per-cpu' value is updated (atomically?) later, which can race with other threads (see #6470). Are these caveats handled somewhere/somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multi-threaded allocator, which introduced the atomics, is only responsible for making a best effort to so concurrent allocations usually occur in different dbufs. This minimizes contention of the dbuf lock. As you noted since preemption isn't disabled over the entire allocation it is possible two CPUs could try and allocate the same object. Or in the case of this PR different, but overlapping objects. The dnode_hold_impl()
function is intended to authoritatively resolve these races safely when they occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. Regardless, the atomic update could use atomic_cas_64()
to only update the variable if the value has not been changed by another thread in the meantime. This might make things less fragile.
Performance wise, cache lines of os_obj_next_percpu
will be bounced a lot under heavy multithreaded load. This could be easily resolved by adding sufficient padding for each per-cpu value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I there's definitely room for additional optimization here which should be tackled.
ot = dn_block[i].dn_type; | ||
skip = dn_block[i].dn_extra_slots + 1; | ||
if (dn_block[i].dn_type != DMU_OT_NONE) | ||
skip = dn_block[i].dn_extra_slots + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary for dnode_destroy() to zero dn_type and dn_extra_slots so that holes created by object destruction do not leave nonzero values that result in skip getting an incorrect value when children[] is out of sync with the dbuf and dnh_dnode == NULL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behlendorf pointed to dnode_sync_free(). It does indeed bzero() all the freed dnode slots.
When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held to protect moving dnodes. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is called will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6414 Closes #6439
When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held to protect moving dnodes. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is called will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#6414 Closes openzfs#6439
When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held to protect moving dnodes. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is called will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#6414 Closes openzfs#6439
Description
When performing concurrent object allocations using the new
multi-threaded allocator and large dnodes it's possible to
allocate overlapping large dnodes.
This case should have been handled by detecting an error
returned by dnode_hold_impl(). But that logic only checked
the returned dnp was not-NULL, and the dnp variable was not
reset to NULL when retying.
Resolve the issue by properly checking the return value of
dnode_hold_impl(). Then dmu_object_next() function was also
updated to skip to the next whole block rather than increment
by one on a dmu_object_info() error. This isn't strictly with
the first fix in place, but it make it less likely to return
an object number when references the interior of a large dnode.
Motivation and Context
Issue #6414
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by
.