Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing gang ABD child removal race condition #10511

Merged
merged 1 commit into from
Jul 14, 2020

Conversation

bwatkinson
Copy link
Contributor

@bwatkinson bwatkinson commented Jun 29, 2020

On linux the list debug code has been setting off a failure when
checking that the node->next->prev value is pointing back at the node.
At times this check evaluates to 0xdead. When removing a child from a
gang ABD we must acquire the child's abd_mtx to make sure that the
same ABD is not being added to another gang ABD while it is being
removed from a gang ABD. This fixes a race condition when checking
if an ABDs link is already active and part of another gang ABD before
adding it to a gang.

Added additional debug code for the gang ABD in abd_verify() to make
sure each child ABD has active links. Also check to make sure another
gang ABD is not added to a gang ABD.

Signed-off-by: Brian Atkinson batkinson@lanl.gov

Currently when an ABD struct that was part of a gang ABD is removed with list_remove_head() in abd_free_gang_abd() we will get a failure in the linux kernel list debug code were the next->prev is pointing at LIST_POISON2 instead of the node address. There was a race condition that existed when removing a child from a gang ABD and the child's abd_mtx need to be acquired to make sure it was not being removed and added to gang ABD at the same time.

Motivation and Context

Fixes race condition in gang ABD between abd_gang_add() and abd_free_gang_abd().

#10401

Description

I have added an additional function, list_link_not_actve(), to avoid a short circuit in the && statement so both next and prev are verified to point at LIST_POISON(1/2) values. In abd_verify() I have added an additional ASSERT for verifying that a each child in a gang ABD has an active link.

How Has This Been Tested?

I have ran using zloop.sh on Centos 8

CentOS 8: Kernel 4.18.0-193

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jun 29, 2020
@codecov
Copy link

codecov bot commented Jun 30, 2020

Codecov Report

Merging #10511 into master will increase coverage by 0.62%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10511      +/-   ##
==========================================
+ Coverage   79.51%   80.13%   +0.62%     
==========================================
  Files         393      292     -101     
  Lines      124638    84430   -40208     
==========================================
- Hits        99106    67661   -31445     
+ Misses      25532    16769    -8763     
Flag Coverage Δ
#kernel 80.13% <100.00%> (-0.02%) ⬇️
#user ?
Impacted Files Coverage Δ
include/os/linux/spl/sys/list.h 98.14% <100.00%> (+0.03%) ⬆️
module/zfs/abd.c 94.95% <100.00%> (+0.12%) ⬆️
module/icp/asm-x86_64/aes/aeskey.c 0.00% <0.00%> (-100.00%) ⬇️
module/icp/algs/modes/gcm_generic.c 11.11% <0.00%> (-88.89%) ⬇️
module/zfs/vdev_raidz_math_sse2.c 14.39% <0.00%> (-84.85%) ⬇️
module/icp/algs/aes/aes_impl_generic.c 0.82% <0.00%> (-69.55%) ⬇️
module/icp/algs/skein/skein_block.c 31.94% <0.00%> (-68.06%) ⬇️
module/icp/algs/aes/aes_impl_x86-64.c 40.00% <0.00%> (-60.00%) ⬇️
module/icp/algs/edonr/edonr.c 45.72% <0.00%> (-42.59%) ⬇️
module/zfs/ddt_zap.c 57.14% <0.00%> (-38.32%) ⬇️
... and 279 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e59a377...3a43a4a. Read the comment docs.

@bwatkinson bwatkinson force-pushed the gang_abd_debug branch 4 times, most recently from bae7ea9 to 5ae5354 Compare July 2, 2020 17:27
@bwatkinson bwatkinson force-pushed the gang_abd_debug branch 2 times, most recently from b4344e0 to f1a3753 Compare July 7, 2020 21:32
@behlendorf behlendorf requested review from behlendorf and sdimitro July 8, 2020 19:47
@ahrens ahrens changed the title Additional gang ABD debug code Fixing gang ABD child removal race condition Jul 8, 2020
module/zfs/abd.c Outdated
* adding it to the other gang ABD.
*/
mutex_enter(&cabd->abd_mtx);
ASSERT3B(list_link_active(&cabd->abd_gang_link), ==, B_TRUE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ASSERT3B(list_link_active(&cabd->abd_gang_link), ==, B_TRUE);
ASSERT(list_link_active(&cabd->abd_gang_link));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this.

module/zfs/abd.c Outdated
@@ -373,6 +384,7 @@ void
abd_gang_add(abd_t *pabd, abd_t *cabd, boolean_t free_on_free)
{
ASSERT(abd_is_gang(pabd));
ASSERT3B(abd_is_gang(cabd), ==, B_FALSE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ASSERT3B(abd_is_gang(cabd), ==, B_FALSE);
ASSERT(!abd_is_gang(cabd));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this.

Comment on lines 184 to 188
static inline int
list_link_not_active(list_node_t *node)
{
return ((node->next == LIST_POISON1) && (node->prev == LIST_POISON2));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing that this is different from !list_link_active(). I am guessing this is to catch cases where only one of the pointers is POISON? And I'm also guessing that should never happen (only one of the pointers being POISON).

Instead of adding this function, could we make list_link_active() assert that either both or neither pointer is POISON? e.g.

EQUIV(node->next == LIST_POISON1, node->prev == LIST_POISON2);
return (node->next != LIST_POISON1);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did add this to avoid the logic short circuiting on the && statement in link_list_active(). We were observing, due to the race condition, that next would be pointing at a valid ABD, but prev was pointing at LIST_POISON2. I am fine with just updating list_link_active() to the EQUIV statement and return.

@sdimitro
Copy link
Contributor

I may be wrong but I believe there is still a race condition with this code. I came up with the hypothetical scenario below trying to highlight every step so others can point to them if I am wrong.

Setup: We've got a N-way mirror with devices A,B,C,D, ... N again. We issue a write to the mirror which splits into N ZIOs that end up in the I/O aggregate code in question.

Step 1: Thread for device A I/O creates its ABD with its gang chain, then grabs the lock for the child ABD adds it to the chain and drops it - this is according to the code in abd_gang_add().

Step 2: Same thread (device A) eventually is done and gets to abd_free_gang_abd(). Now with the updated code it correctly grabs the lock for the child, removes it from its current gang-chain, and drops its lock with mutex_exit(&cabd->abd_mtx) that was just added. Now let's say that this thread gets pre-empted or something and we are stuck for some time X. During that time X we have dropped the lock of the child but we haven't executed anything else yet from that function.

Step 3: Thread for device B I/O suddenly wakes up and calls abd_gang_add() for that same child ABD. It grabs the lock (it can because the other thread just dropped that lock), sees that the ABD is not part of another gang ABD and executes the following code before adding the ABD to its own gang chain:

	} else {
		child_abd = cabd;
		if (free_on_free)
			child_abd->abd_flags |= ABD_FLAG_GANG_FREE;
	}

Potential Race: Now this is the part where I may be wrong but if the device B thread was had free_on_free set to B_TRUE it will mutate the child ABD's flags, enabling ABD_FLAG_GANG_FREE. This action could potentially affect the device A thread as that thread is making a decision on its control flow based on that flag without holding any locks:

abd_free_gang_abd(abd_t *abd)
{
...
		mutex_exit(&cabd->abd_mtx);
		abd->abd_size -= cabd->abd_size;
		if (cabd->abd_flags & ABD_FLAG_GANG_FREE) {
			if (cabd->abd_flags & ABD_FLAG_OWNER)
				abd_free(cabd);
			else
				abd_put(cabd);
		}
...
}

Again, I may be wrong but my point being that device B thread may be forcing device A thread to call abd_free() or abd_put() in the above if-clause by enabling its ABD_FLAG_GANG_FREE flag which originally was unset by thread A.

Does this seem possible?

@bwatkinson
Copy link
Contributor Author

bwatkinson commented Jul 10, 2020

I may be wrong but I believe there is still a race condition with this code. I came up with the hypothetical scenario below trying to highlight every step so others can point to them if I am wrong.

Setup: We've got a N-way mirror with devices A,B,C,D, ... N again. We issue a write to the mirror which splits into N ZIOs that end up in the I/O aggregate code in question.

Step 1: Thread for device A I/O creates its ABD with its gang chain, then grabs the lock for the child ABD adds it to the chain and drops it - this is according to the code in abd_gang_add().

Step 2: Same thread (device A) eventually is done and gets to abd_free_gang_abd(). Now with the updated code it correctly grabs the lock for the child, removes it from its current gang-chain, and drops its lock with mutex_exit(&cabd->abd_mtx) that was just added. Now let's say that this thread gets pre-empted or something and we are stuck for some time X. During that time X we have dropped the lock of the child but we haven't executed anything else yet from that function.

Step 3: Thread for device B I/O suddenly wakes up and calls abd_gang_add() for that same child ABD. It grabs the lock (it can because the other thread just dropped that lock), sees that the ABD is not part of another gang ABD and executes the following code before adding the ABD to its own gang chain:

	} else {
		child_abd = cabd;
		if (free_on_free)
			child_abd->abd_flags |= ABD_FLAG_GANG_FREE;
	}

Potential Race: Now this is the part where I may be wrong but if the device B thread was had free_on_free set to B_TRUE it will mutate the child ABD's flags, enabling ABD_FLAG_GANG_FREE. This action could potentially affect the device A thread as that thread is making a decision on its control flow based on that flag without holding any locks:

abd_free_gang_abd(abd_t *abd)
{
...
		mutex_exit(&cabd->abd_mtx);
		abd->abd_size -= cabd->abd_size;
		if (cabd->abd_flags & ABD_FLAG_GANG_FREE) {
			if (cabd->abd_flags & ABD_FLAG_OWNER)
				abd_free(cabd);
			else
				abd_put(cabd);
		}
...
}

Again, I may be wrong but my point being that device B thread may be forcing device A thread to call abd_free() or abd_put() in the above if-clause by enabling its ABD_FLAG_GANG_FREE flag which originally was unset by thread A.

Does this seem possible?

So while your case could potentially happen if the user was not using the free_on_free flag correctly, that is not the case in the aggregate function. It is important to note that only newly allocated ABD’s pass B_TRUE here. Any ABD that can potentially be part of multiple gang ABD’s always passes B_FALSE. We are really putting the onus on the coder here to correctly pass the right flag for free_on_free. We are taking care to always pass B_FALSE in the case of the same ABD possibly being part of two gang ABD’s.

On linux the list debug code has been setting off a failure when
checking that the node->next->prev value is pointing back at the node.
At times this check evaluates to 0xdead. When removing a child from a
gang ABD we must acquire the child's abd_mtx to make sure that the
same ABD is not being added to another gang ABD while it is being
removed from a gang ABD. This fixes a race condition when checking
if an ABDs link is already active and part of another gang ABD before
adding it to a gang.

Added additional debug code for the gang ABD in abd_verify() to make
sure each child ABD has active links. Also check to make sure another
gang ABD is not added to a gang ABD.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 13, 2020
@behlendorf behlendorf merged commit e4d3d77 into openzfs:master Jul 14, 2020
@bwatkinson bwatkinson deleted the gang_abd_debug branch July 14, 2020 18:30
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
On linux the list debug code has been setting off a failure when
checking that the node->next->prev value is pointing back at the node.
At times this check evaluates to 0xdead. When removing a child from a
gang ABD we must acquire the child's abd_mtx to make sure that the
same ABD is not being added to another gang ABD while it is being
removed from a gang ABD. This fixes a race condition when checking
if an ABDs link is already active and part of another gang ABD before
adding it to a gang.

Added additional debug code for the gang ABD in abd_verify() to make
sure each child ABD has active links. Also check to make sure another
gang ABD is not added to a gang ABD.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes openzfs#10511
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
On linux the list debug code has been setting off a failure when
checking that the node->next->prev value is pointing back at the node.
At times this check evaluates to 0xdead. When removing a child from a
gang ABD we must acquire the child's abd_mtx to make sure that the
same ABD is not being added to another gang ABD while it is being
removed from a gang ABD. This fixes a race condition when checking
if an ABDs link is already active and part of another gang ABD before
adding it to a gang.

Added additional debug code for the gang ABD in abd_verify() to make
sure each child ABD has active links. Also check to make sure another
gang ABD is not added to a gang ABD.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes openzfs#10511
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants