Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes race condition in event wait logic of SDMMC driver. #2989

Merged

Conversation

antmerlino
Copy link
Contributor

Background

After some configuration changes, namely switching to tickless mode, I started to hit the DEBUGASSERTION here:
https://github.com/apache/incubator-nuttx/blob/master/arch/arm/src/stm32f7/stm32_sdmmc.c#L1460

Basically, the SDMMC driver timed out, but didn't expect to since wkupevents and waitevents are zero. Further debugging shows the event did finish and the timeout should not have occurred.

After @davids5 and I debugged further, we found 2 issues:

  1. The first is addressed here. There is a race condition where your transfer can finish before you start your watchdog resulting in the watchdog not being cancelled and you timing out later unexpectedly.
  2. The second is an issue @davids5 says he has some recollection of @xiaoxiang781216 identifying awhile back. Basically it's that enter_critical_section's are not being respected when calling into the watchdog in tickless mode. A task switch causes interrupts to be re-enabled and thus breaks the critical section. This sched switch and re-enabling of interrupts is actually the reason the watchdog wasn't getting started until after the the watchdog was started. Wrapping the wd_start in sched_lock() as a test fixed the issue. @xiaoxiang781216 Do you recall this? Is there an issue somewhere we could discuss solutions?

Summary

This change fixes a race condition that was noted in a comment in the code. The fundamental issue was that the watchdog was being started after the transfer is started. Instead of passing in a timeout and starting the watchdog inside of SDIO_EVENTWAIT, we can simply do that in SDIO_WAITENABLE.

Impact

  • All arch with SDMMC support
  • mmcsd_sdio driver
  • include/nuttx/sdio.h

The SDMMC driver interface has changed slightly. Moving the timeout parameter from the SDIO_EVENTWAIT call to the SDIO_WAITENABLE call.

Because the SDIO_WAITENABLE call starts the watchdog, the caller must ensure SDIO_CANCEL is called if the caller is not going to continue to calling SDIO_EVENTWAIT.

Testing

I have tested this on an STM32F765VI based platform and it addresses the race condition.

Kudos to @davids5 for helping debug this one!

@antmerlino antmerlino force-pushed the sdmmc-fix-event-race-condition branch 3 times, most recently from 54d3d9d to a4e26a6 Compare March 7, 2021 04:27
@xiaoxiang781216
Copy link
Contributor

  1. The second is an issue @davids5 says he has some recollection of @xiaoxiang781216 identifying awhile back. Basically it's that enter_critical_section's are not being respected when calling into the watchdog in tickless mode. A task switch causes interrupts to be re-enabled and thus breaks the critical section. This sched switch and re-enabling of interrupts is actually the reason the watchdog wasn't getting started until after the the watchdog was started. Wrapping the wd_start in sched_lock() as a test fixed the issue. @xiaoxiang781216 Do you recall this? Is there an issue somewhere we could discuss solutions?

Yes, here is the issue: #1138
Basically, enter_critical_section should stop any schedule until the thread initiatively give up the time slice by wait/sleep and should postone the schedule of the high priority thread waked up by the current thread.

@antmerlino antmerlino force-pushed the sdmmc-fix-event-race-condition branch 4 times, most recently from d5b147f to aa0259e Compare March 8, 2021 08:09
@davids5
Copy link
Contributor

davids5 commented Mar 8, 2021

I will test this today.

@davids5 davids5 self-requested a review March 8, 2021 12:07
Copy link
Contributor

@davids5 davids5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antmerlino - At first I thought we lost 100 KB/S but I think it is card wear...

MCU: STM32H7[4|5]xxx, rev. Y The SD then dies

master (ish):

sd_bench
INFO  [sd_bench] Using block size = 4096 bytes, sync=0
INFO  [sd_bench]
INFO  [sd_bench] Testing Sequential Write Speed...
INFO  [sd_bench]   Run  0:   545.97 KB/s, max write time: 16 ms (= 250.00 KB/s), fsync: 5 ms
INFO  [sd_bench]   Run  1:   544.51 KB/s, max write time: 14 ms (= 285.71 KB/s), fsync: 4 ms
INFO  [sd_bench]   Run  2:   537.77 KB/s, max write time: 31 ms (= 129.03 KB/s), fsync: 5 ms
INFO  [sd_bench]   Run  3:   544.27 KB/s, max write time: 14 ms (= 285.71 KB/s), fsync: 4 ms
INFO  [sd_bench]   Run  4:   544.19 KB/s, max write time: 15 ms (= 266.67 KB/s), fsync: 5 ms
INFO  [sd_bench]   Avg   :   543.34 KB/s

nsh> sd_bench
INFO  [sd_bench] Using block size = 4096 bytes, sync=0
INFO  [sd_bench]
INFO  [sd_bench] Testing Sequential Write Speed...
INFO  [sd_bench]   Run  0:   545.07 KB/s, max write time: 15 ms (= 266.67 KB/s), fsync: 4 ms
INFO  [sd_bench]   Run  1:   544.49 KB/s, max write time: 16 ms (= 250.00 KB/s), fsync: 5 ms
INFO  [sd_bench]   Run  2:   545.58 KB/s, max write time: 13 ms (= 307.69 KB/s), fsync: 4 ms
INFO  [sd_bench]   Run  3:   545.90 KB/s, max write time: 13 ms (= 307.69 KB/s), fsync: 4 ms

this pr:

sd_bench
INFO  [sd_bench] Using block size = 4096 bytes, sync=0
INFO  [sd_bench]
INFO  [sd_bench] Testing Sequential Write Speed...
INFO  [sd_bench]   Run  0:   544.48 KB/s, max write time: 15 ms (= 266.67 KB/s), fsync: 4 ms
INFO  [sd_bench]   Run  1:   539.10 KB/s, max write time: 30 ms (= 133.33 KB/s), fsync: 5 ms
INFO  [sd_bench]   Run  2:   549.72 KB/s, max write time: 13 ms (= 307.69 KB/s), fsync: 6 ms
INFO  [sd_bench]   Run  3:   548.32 KB/s, max write time: 13 ms (= 307.69 KB/s), fsync: 6 ms
INFO  [sd_bench]   Run  4:   546.97 KB/s, max write time: 14 ms (= 285.71 KB/s), fsync: 4 ms
INFO  [sd_bench]   Avg   :   545.72 KB/s
nsh> sd_bench
INFO  [sd_bench] Using block size = 4096 bytes, sync=0
INFO  [sd_bench]
INFO  [sd_bench] Testing Sequential Write Speed...
ERROR [sd_bench] Write error

@davids5
Copy link
Contributor

davids5 commented Mar 8, 2021

@antmerlino - there are also some CI failures that need looking at.

@davids5
Copy link
Contributor

davids5 commented Mar 8, 2021

@antmerlino MCU: STM32F76xxx, rev. Z - passes.

@davids5
Copy link
Contributor

davids5 commented Mar 8, 2021

@antmerlino MCU: STM32F42x, rev. 3 passes.

@antmerlino antmerlino force-pushed the sdmmc-fix-event-race-condition branch from aa0259e to 922208d Compare March 8, 2021 16:42
@antmerlino
Copy link
Contributor Author

@antmerlino - there are also some CI failures that need looking at.

Yeah, still trying to get them fixed.

davids5 added a commit to nuttx-to-asf/incubator-nuttx that referenced this pull request Mar 12, 2021
   mmcsd:Remove CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   stm32h7:sdmmc remove CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   stm32f7:sdmmc remove CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   stm32f7:sdmmc WRITE COMPLETE prevent false triggers
   stm32h7:sdmmc WRITE COMPLETE prevent false triggers

   While testing PR apache#2989 on the H7 I noticed that the cards
   were staying in 1-bit mode. The root cause was that the
   scr read path was using DMA without an invlidate.

   This was caused by CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT,
   but the sdmmc driver, did not use the delayed invalidate
   nor would it work on 8 bytes.

   The driver fully supported dcache mgt on runt buffers, but
   the #ifdef CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT blocked it.

   Reviewing the PR that added CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   it may have been valid at the time. But after the dcache operations
   we fixed. It is not necessary and offers no benefit.
davids5 added a commit to nuttx-to-asf/incubator-nuttx that referenced this pull request Mar 12, 2021
   mmcsd:Remove CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   stm32h7:sdmmc remove CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   stm32f7:sdmmc remove CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   stm32f7:sdmmc WRITE COMPLETE prevent false triggers
   stm32h7:sdmmc WRITE COMPLETE prevent false triggers

   While testing PR apache#2989 on the H7 I noticed that the cards
   were staying in 1-bit mode. The root cause was that the
   scr read path was using DMA without an invlidate.

   This was caused by CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT,
   but the sdmmc driver, did not use the delayed invalidate
   nor would it work on 8 bytes.

   The driver fully supported dcache mgt on runt buffers, but
   the #ifdef CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT blocked it.

   Reviewing the PR that added CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   it may have been valid at the time. But after the dcache operations
   we fixed. It is not necessary and offers no benefit.
acassis pushed a commit that referenced this pull request Mar 12, 2021
   mmcsd:Remove CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   stm32h7:sdmmc remove CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   stm32f7:sdmmc remove CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   stm32f7:sdmmc WRITE COMPLETE prevent false triggers
   stm32h7:sdmmc WRITE COMPLETE prevent false triggers

   While testing PR #2989 on the H7 I noticed that the cards
   were staying in 1-bit mode. The root cause was that the
   scr read path was using DMA without an invlidate.

   This was caused by CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT,
   but the sdmmc driver, did not use the delayed invalidate
   nor would it work on 8 bytes.

   The driver fully supported dcache mgt on runt buffers, but
   the #ifdef CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT blocked it.

   Reviewing the PR that added CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   it may have been valid at the time. But after the dcache operations
   we fixed. It is not necessary and offers no benefit.
@acassis
Copy link
Contributor

acassis commented Mar 24, 2021

@antmerlino please fix the Conflicting files.

@davids5
Copy link
Contributor

davids5 commented Mar 24, 2021

@acassis - after the rebase this is not ready to be merged, as we still need to resolve the issue, as reported above. I plan to retest this week.

@davids5
Copy link
Contributor

davids5 commented Apr 5, 2021

@antmerlino
I am seeing corruption after a write on the H7/

nsh>  ls  /fs/microsd/
/fs/microsd:
 .Trashes
 etc/
 .metadata_never_index
 .fseventsd/
 .Trash-1000
 dataman
 ufw/
 uavcan.db/
 log/
nsh> echo > /fs/microsd/etc/foo.txt
nsh>  ls  /fs/microsd/
/fs/microsd:
 RRaA

Debugging it now....

This change makes it so that the timeout is set as part of the SDIO_WAITENABLE call instead of the SDIO_EVENTWAIT call. By doing so, you eliminate all opportunity for a race condition.

stm32h7:sdmmc Check if busy ended early
@davids5 davids5 force-pushed the sdmmc-fix-event-race-condition branch from 922208d to 8632a38 Compare April 5, 2021 20:42
@davids5
Copy link
Contributor

davids5 commented Apr 5, 2021

@antmerlino - Changes added - thank you for your guidance! Rebased on master, squashed and force pushed.

@davids5
Copy link
Contributor

davids5 commented Apr 5, 2021

H7 - Now working
F7 - Still working
F4 - Still working

@davids5 davids5 self-requested a review April 5, 2021 21:17
Copy link
Contributor

@davids5 davids5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antmerlino - Thank you again!

Copy link
Contributor

@xiaoxiang781216 xiaoxiang781216 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@xiaoxiang781216 xiaoxiang781216 merged commit b21cb33 into apache:master Apr 6, 2021
jlaitine pushed a commit to tiiuae/nuttx that referenced this pull request May 25, 2021
…ELAYED_INVLDT

   mmcsd:Remove CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   stm32h7:sdmmc remove CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   stm32f7:sdmmc remove CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   stm32f7:sdmmc WRITE COMPLETE prevent false triggers
   stm32h7:sdmmc WRITE COMPLETE prevent false triggers

   While testing PR apache#2989 on the H7 I noticed that the cards
   were staying in 1-bit mode. The root cause was that the
   scr read path was using DMA without an invlidate.

   This was caused by CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT,
   but the sdmmc driver, did not use the delayed invalidate
   nor would it work on 8 bytes.

   The driver fully supported dcache mgt on runt buffers, but
   the #ifdef CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT blocked it.

   Reviewing the PR that added CONFIG_ARCH_HAVE_SDIO_DELAYED_INVLDT
   it may have been valid at the time. But after the dcache operations
   we fixed. It is not necessary and offers no benefit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants