-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
race condition between spa async threads and export #9015
Labels
Type: Defect
Incorrect behavior (e.g. crash, hang)
Comments
sdimitro
added a commit
to sdimitro/zfs
that referenced
this issue
Jul 16, 2019
In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue openzfs#9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
12 tasks
sdimitro
added a commit
to sdimitro/zfs
that referenced
this issue
Jul 16, 2019
In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue openzfs#9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
sdimitro
added a commit
to sdimitro/zfs
that referenced
this issue
Jul 17, 2019
In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue openzfs#9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
TulsiJain
pushed a commit
to TulsiJain/zfs
that referenced
this issue
Jul 20, 2019
In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue openzfs#9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes openzfs#9015 Closes openzfs#9044
TulsiJain
pushed a commit
to TulsiJain/zfs
that referenced
this issue
Jul 20, 2019
In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue openzfs#9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes openzfs#9015 Closes openzfs#9044
tonyhutter
pushed a commit
to tonyhutter/zfs
that referenced
this issue
Aug 13, 2019
In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue openzfs#9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes openzfs#9015 Closes openzfs#9044
tonyhutter
pushed a commit
to tonyhutter/zfs
that referenced
this issue
Aug 21, 2019
In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue openzfs#9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes openzfs#9015 Closes openzfs#9044
tonyhutter
pushed a commit
to tonyhutter/zfs
that referenced
this issue
Aug 22, 2019
In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue openzfs#9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes openzfs#9015 Closes openzfs#9044
tonyhutter
pushed a commit
to tonyhutter/zfs
that referenced
this issue
Aug 23, 2019
In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue openzfs#9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes openzfs#9015 Closes openzfs#9044
tonyhutter
pushed a commit
to tonyhutter/zfs
that referenced
this issue
Sep 17, 2019
In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue openzfs#9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes openzfs#9015 Closes openzfs#9044
tonyhutter
pushed a commit
to tonyhutter/zfs
that referenced
this issue
Sep 18, 2019
In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue openzfs#9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes openzfs#9015 Closes openzfs#9044
tonyhutter
pushed a commit
to tonyhutter/zfs
that referenced
this issue
Sep 23, 2019
In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue openzfs#9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes openzfs#9015 Closes openzfs#9044
tonyhutter
pushed a commit
that referenced
this issue
Sep 26, 2019
In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue #9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #9015 Closes #9044
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@shartse recently hit a ztest failure where the deadman timed out. The problem was a race condition between multiple threads doing attempting to export the pool and hitting a deadlock suspending and resuming the SPA's async thread and zthrs. The specific scenario from Sara's core is the following:
There are 3 threads (A,B,C) that just entered
spa_export_common()
and one zthr (thread D - which in Sara's case was the livelist zthr but could just as easily have been a different one).Timeline:
spa_async_suspend()
.spa_namespace_lock
.spa_async_suspend()
and got stuck waiting for Thread A to drop thespa_namespace_lock
inspa_export_common()
.spa_async_resume()
, dropping thespa_namespace_lock
and exiting.spa_namespace_lock
, hits the same error and is about the enterspa_async_resume()
.spa_async_suspend()
callingzthr_cancel()
in one of the zthrs, grabbing thezthr_request_lock
and later wait oncv_wait()
forzthr_cv
to be broadcasted.zthr_cv
, is stuck indsl_sync_task()/spa_open_common()
waiting for thespa_namespace_lock
to be dropped by thread B.spa_async_resume()
which callszthr_resume()
for thread D, at which point it gets stuck waiting to grab thezthr_request_lock
mutex that is held by thread C inzthr_cancel
.Resulting Deadlock:
Thread B holds the
spa_namespace_lock
and is stuck waiting for thezthr_request_lock
.Thread C holds the
zthr_request_lock
and is stuck waiting forzthr_cv
to receive a signal.Thread D is stuck waiting for the
spa_namespace_lock
and it cannot signalzthr_cv
.Relevant Code paths:
Even though in Sara's scenario it was a livelist open-context thread, since this has not been upstreamed yet, I have the snippet of another zthr that could just as easily have caused this.
There are a couple of potential solutions here. After discussing it with Matt, we figured that serializing most of
spa_export_common()
by bailing early if there are other threads in that codepath may be the best solution in making export more stable overall.The text was updated successfully, but these errors were encountered: