fix for the RWW errors while using Storage driver for internal flash on K64F #2150

rgrover · 2016-07-12T14:09:58Z

fixes https://github.com/ARMmbed/configuration-store/issues/21

Read-while-write errors may occur when there is a read/fetch on a bank of
internal flash while an erase/program operation to the same bank is active.

The possibility of read-while-write errors depends on the application's span
of internal-flash.

In the trivial (and also the default) case when this boundary is set to lie at
the junction between the two banks of internal flash (i.e. BLOCK1_START_ADDR),
there is no possibility of read-while-write. In the more common case, when
the system doesn't reserve all of BLOCK1 for the flash driver, there exists
the possibility of read-while-write.

To avoid this problem for the common case (above), we enforce the following:

Force synchronous mode of execution in the storage_driver.
Disable interrupts during erase/program operations. This prevents the
execution of arbitrary user-code which might access flash.
Ensure all code and data structures used in the storage driver execute
out of RAM. Refer to __RAMFUNC which allows for this.

We assume that DEVICE_STORAGE_CONFIG_HARDWARE_MTD_START_ADDR is the boundary between application and this driver. This is used to make a compile-time choice to enforce the above-mentioned conditions.

@kjbracey-arm @simonqhughes @meriac

notify: @marcuschangarm

bogdanm · 2016-07-12T14:23:18Z

An unrelated observation: please note that the mbed coding style doesn't encourage the use of camelCase, see https://developer.mbed.org/teams/SDK-Development/wiki/mbed-sdk-coding-style for more details. Definitely not something that would prevent this PR from being merged, just something to take into account in the future.

rgrover · 2016-07-12T14:57:59Z

@mbed-bot: TEST

HOST_OSES=windows,linux
BUILD_TOOLCHAINS=GCC_ARM,ARM,IAR
TARGETS=K64F

mbed-bot · 2016-07-12T15:19:41Z

[Build 612]
SUCCESS: Building succeeded and tests were run! Be sure to check the test results

c1728p9 · 2016-07-12T15:21:26Z

hal/targets/hal/TARGET_Freescale/TARGET_KSDK2_MCUS/TARGET_K64F/storage_driver.c

+#elif defined (__CC_ARM)
+#define __RAMFUNC __attribute__ ((section(".data.ramfunc"), noinline))
+#elif defined ( __ICCARM__ )
+#define __RAMFUNC __ramfunc


Since you'll need these for most targets which support writing to internal flash you might consider moving this to a common location. The current method for handling toolchain specific attributes it to put them in toolchain.h.

moved __RAMFUNC to toolchain.h

rgrover · 2016-07-13T07:17:46Z

@mbed-bot: TEST

HOST_OSES=windows,linux
BUILD_TOOLCHAINS=GCC_ARM,ARM,IAR
TARGETS=K64F

mbed-bot · 2016-07-13T07:39:41Z

[Build 615]
SUCCESS: Building succeeded and tests were run! Be sure to check the test results

kjbracey · 2016-07-13T07:57:35Z

hal/api/toolchain.h

+ * startup sequence. */
+#ifndef __RAMFUNC
+#if defined(__GNUC__) || defined(__clang__) // GCC and llvm/clang
+#define __RAMFUNC __attribute__ ((section (".data#"), noinline)) /* The '#' following ".data" needs a bit of


This is still a horrendous hack - the toolchain equivalent of SQL injection.

Does ".data.ramfunc", like you did later for ARMCC, really not work?

without this 'hack', there will be compiler warnings, as the comment indicates. Is that acceptable?

Have you exhausted the possibilities for a proper fix?

I recall from the last PR that ".data.ramfunc" complains because it expects ".data.*" sections to be non-executable.

Have you tried a completely new section name, like ".ramfuncs"? Would mean adding it to the linker map.

".data.ramfunc" also produces a compiler warning from gcc

Actually, a completely separate name is the right thing to do for ARMCC too. You were previously worried about the issue of actually marking RAM sections executable. You do want some special naming to allow grouping together into an executable region if necessary.

Lumping this stuff in ".data.*" is just asking for it to end up in a non-executable region.

kjbracey · 2016-07-13T08:04:02Z

Code looks a lot clearer that the earlier attempt.

Only remaining issue with the actual implementation is the "fool the compiler into accidentally outputting a comment symbol".

My immediate response to seeing that is the desire to submit a GCC patch to stop you doing that. GCC's clearly at fault in not escaping or otherwise blocking the "#".

Otherwise implementation is fine - it's doing what is intended. So +1 for implementation if that can be cleaned up.

However, I do vote -1 overall, as I do not want any component in the system disabling interrupts for 5-115ms. It will break other components, and we're going to spend the next couple of years addressing arising from this breakage.

I believe we should not be permitting and encouraging flash storage if block 1 is in use by the program in a reference design.

And I think more engineers need input on this, as the submission of this patch potentially impacts every driver in the K64F system.

rgrover · 2016-07-13T08:50:33Z

@kjbracey-arm please review

rgrover · 2016-07-13T08:50:47Z

@mbed-bot: TEST

HOST_OSES=windows,linux
BUILD_TOOLCHAINS=GCC_ARM,ARM,IAR
TARGETS=K64F

rgrover · 2016-07-13T08:58:57Z

Regarding @kjbracey-arm 's other comment about disabling interrupts: this change will kick in the disabling of interrupts only for applications whose images are > 512KB in size. And in that case, it may be better for most applications to block interrupts briefly instead of suffering an arbitrary hard-fault.

I argue that this change should be allowed to go through as a safety guard for the benefit of the majority of users who may not be affected by brief interrupt blackouts. For the rest, we need to put in place warnings in our documentation.

kjbracey · 2016-07-13T09:03:38Z

hal/api/toolchain.h

+#if defined(__GNUC__) || defined(__clang__) // GCC and llvm/clang
+#define __RAMFUNC __attribute__ ((section (".ramfunc"), noinline))
+#elif defined (__CC_ARM)
+#define __RAMFUNC __attribute__ ((section(".ramfunc"), noinline))


These are now identical, so no separate block required.

kjbracey · 2016-07-13T09:10:55Z

+1 for implementation now. Continued -1 for doing it at all.

Ultimately, it needn't be a choice between disabling interrupts and an arbitrary hard fault. The toolchain could restrict image size to 512K at build time, or the storage driver could return a polite error if it detects that the image size has encroached into its storage zone.

I agree that there is merit in enabling this for users that can cope with interrupt blackout. But I'm not happy with that happening silently and implicitly. I can already envisage the complaints: "when I enable debug, my serial driver starts dropping characters". Whoops, image size exceeded 512K.

My feeling is that there should be some sort of explicit control to permit this behaviour. So that way the people who can't tolerate it have some way of avoiding accidentally triggering it.

Conversely, they may possibly want to force the behaviour. They may want to test resilience against the interrupt loss when their image size doesn't currently exceed 512K, but they know it ultimately will.

Although I guess it is sort of semi-explicit - they do have to manually set the base address at the moment. I guess with sufficient warnings in the documentation...

And the necessary toolchain interlock to stop the image size crossing over that base address.

mbed-bot · 2016-07-13T09:12:29Z

[Build 616]
SUCCESS: Building succeeded and tests were run! Be sure to check the test results

marcuschangarm · 2016-07-13T09:13:25Z

I see no problem in disabling interrupts for any amount of time.

If this breaks the application, then the application is too complex for the hardware and the user should upgrade to a different board anyway.

rgrover · 2016-07-13T09:13:25Z

@mbed-bot: TEST

HOST_OSES=windows,linux
BUILD_TOOLCHAINS=GCC_ARM,ARM,IAR
TARGETS=K64F

kjbracey · 2016-07-13T09:15:00Z

hal/api/toolchain.h

@@ -225,6 +225,21 @@
 #endif
 #endif

+/* Define '__RAMFUNC' as an attribute to mark a function as residing in RAM.
+ * Use of __RAMFUNC puts a function in the .data section--i.e. the


Micro-nit - ".data section" is now wrong. Exact details will depend on the toolchain/linker map. Better to just state that it will place the function into RAM.

kjbracey · 2016-07-13T09:20:36Z

If this breaks the application, then the application is too complex for the hardware and the user should upgrade to a different board anyway.

Kind of agree with that.

The catch is that we have to acknowledge that all our reference applications are now too complex for our main reference board.

rgrover · 2016-07-13T09:24:18Z

@mbed-bot: TEST

HOST_OSES=windows,linux
BUILD_TOOLCHAINS=GCC_ARM,ARM,IAR
TARGETS=K64F

marcuschangarm · 2016-07-13T09:29:18Z

If we can't fit our reference applications into 512 KiB of flash we got bigger problems! Once we get around to optimize for size, this shouldn't be an issue.

If the user wants all the bells and whistles then they should be prepared to pay the price.

If the user wants to use a complex API, like writing to flash, they must be prepared to handle the side-effects, like high latency during the flash operation, and I'm pretty sure that with proper documentation our users can handle it.

mbed-bot · 2016-07-13T09:34:28Z

[Build 617]
SUCCESS: Building succeeded and tests were run! Be sure to check the test results

kjbracey · 2016-07-13T09:35:22Z

If we can't fit our reference applications into 512 KiB of flash we got bigger problems! Once we get around to optimize for size, this shouldn't be an issue.

We are aiming at 512K release for our biggest stuff, but then we need 1M for the debug images.

If the user want's to use a complex API, like writing to flash, they must be prepared to handle the side-effects, like high latency during the flash operation, and I'm pretty sure that with proper documentation our user's can handle it.

The problem is that the components that can't cope with the latency (like radio drivers and SLIP drivers) don't get any direct say in when flash flushes get called. And some drivers may not be able to cope at all - I don't know if we've established that the Wi-Fi module can ever recover if a flush happens.

There should need to be some sort of serious OS mechanism here - some sort of "suspend" call going round to drivers to tell them to prepare for a brief system shutdown.

And I know for a fact that all our current applications are calling flush after every write. They're yet to take any defensive measures to even limit the frequency of them, yet alone try to deal with the necessary work to co-operate with low level drivers.

marcuschangarm · 2016-07-13T09:40:57Z

I think you just made my point. If our image is 1 MiB and it uses the flash for config storage, the K64F is not the right board to use because the application will hardfault everytime it writes to flash while executing from the second flash bank.

Either get a different board or reduce the code size. The last place to put the blame is the flash driver.

kjbracey · 2016-07-13T09:53:12Z

Either get a different board or reduce the code size.

Indeed. Or add a module with separate storage. The fundamental problem here is the attempt to use a bare FRDM-K64F as a "1MB development platform with non-volatile storage". It clearly is not suitable for that.

And I'd be happy leaving it at that - just put a check in the flash driver or build system to guarantee that the program image and flash area are not sharing a block, and politely refuse if they do. That would suit me, and we then can go find a better platform.

By making this change, we are implicitly suggesting we might be trying to limp on with this unsuitable platform.

The last place to put the blame is the flash driver.

Kind of depends on whether you view this as an OS or not. Any OS will normally have basic expectations about expected interrupt latency. Typically most OSes expect something in the 100us-500us range. It's an implicit or explicit part of the API that drivers shouldn't do things to break that. Not blocking out other interrupt users is as expected as not scribbling over their RAM.

I really don't think core drivers should not be doing this, at least not without a very specific permission slip from the application saying "I realise this may break the system, and am willing to accept responsibility."

mbed-bot · 2016-07-13T09:57:16Z

[Build 618]
SUCCESS: Building succeeded and tests were run! Be sure to check the test results

marcuschangarm · 2016-07-13T10:09:06Z

By making this change, we are implicitly suggesting we might be trying to limp on with this unsuitable platform.

I wouldn't call this a change. It is a bug fix. Without it your application will hardfault. At least now it will continue to run. That might lead to timing sensitive code doing things it wasn't intended to do, but that is a bug as well that should be fixed.

Kind of depends on whether you view this as an OS or not. Any OS will normally have basic expectations about expected interrupt latency.

I wouldn't expect anything from an OS unless it was explicitly stated in the features. So far mbed OS doesn't provide any latency guarantees.

simonqhughes · 2016-07-13T10:15:14Z

+1 to adopt these changes.

The changes are extremely beneficial to a wide range of system designs and applications:

the limitations are documented for developer/system architects/designers/etc to evaluate impact for their design.
it's the role of systems engineering to make the appropriate system level trade-offs between platform flash storage and software component constraints.
its the role of the software component designer/implementer to make the component meets its design goals i.e. to provide persistent data storage to highly constrained embedded systems (here storage_driver.c and flash-journal generally) in as broadly a useful way as possible.

Not withstanding the valid points of @kjbracey-arm above including:

However, I do vote -1 overall, as I do not want any component in the system disabling interrupts for 5-115ms. It will break other components, and we're going to spend the next couple of years addressing arising from this breakage.

I disagree with his -1 assessment overall:

The specific design case cited by @kjbracey-arm is a protocol stack app requiring low interrupt latency running out of flash where the system designer has elected to sacrifice a dedicated block for CFSTORE/Flash-journal data storage in preference for supporting a large firmware image (i.e. mbed-client/nanostack application on K64F).
This protocol-stack design case needs to be placed within the broad context of alternative designs that require persistent storage but are unaffected by the current storage_driver.c limitations.
For this reason the current storage_driver.c changes offered in these commits are of immediate benefit to a broad range of developers.

I suggest that the issues described by @kjbracey-arm are the point of departure for:

the next iteration of the storage_driver.c/flash-journal design that can meet a low interrupt latency requirement.
the nanostack system design trade-offs be re-evaluated as to whether the K64F is a suitable reference platform, and whether a design case with an image size ~900kB should be complemented with one where the image size is << 512kB, representative of many highly constrained embedded communication applications in the IOT space.

Please lets not delay these fixes further, which were first initiated by Rohit ~6days ago.

marcuschangarm · 2016-07-13T10:32:11Z

I suggest that the issues described by @kjbracey-arm are the point of departure for:

Using the SD card on the K64F. We could make it easy for our users to use that to store configuration data.

Please lets not delay these fixes further, which were first initiated by Rohit ~6days ago.

Indeed! I got a demo to build! 🎉

kjbracey · 2016-07-13T13:15:30Z

Not going to argue with the bulk of Simon's view. I'm certainly not "-2"ing this.

But to comment on two points:

"low interrupt latency"

I don't think "low" is a fair characterisation. It's normal interrupt latency in the hundreds of microsecond range that all conventional systems reach, and is required for typical hardware to work properly. That range has remained basically constant for about 30 years. For example, the K64F serial port buffers are sized on the assumption of sub-millisecond IRQ response, and will overrun if that's exceeded.

the current storage_driver.c changes offered in these commits are of immediate benefit to a broad range of developers.

I'm not sure who immediately benefits. We're the only team I'm aware of that has hit this problem, because some images exceeded 512K, and we're not intending to activate this patch, because we don't want to take the IRQ latency hit. From our point of view the response to the original issue is "can't fix - hardware limitation". We'll ensure CFStore is only activated on <512K images until we can find an alternative storage location.

marcuschangarm · 2016-07-13T15:27:22Z

I don't think "low" is a fair characterisation.

I think low is a fair characterization. You can't expect interrupt latency in the hundreds of microsecond range if you are on a busy system handling interrupts from multiple sources. The K64F serial port buffers are sized on the assumption that you use DMA on high-latency systems.

I'm not sure who immediately benefits.

My team need this for our client.

rgrover · 2016-07-14T07:20:52Z

@kjbracey-arm If you don't accept the consequences of this change, you can choose to disable your use of configuration-store against K64F-flash--it can still run out of RAM when needed.

We can and will educate users regarding the consequences. There are no alternatives as long as we're using the K64F. I propose that we proceed with this so that others may benefit, and for the sake of bringing this issue to a resolution.

We can take up the discussion about switching to other platforms elsewhere.

kjbracey · 2016-07-14T08:28:02Z

I'm not going to stop you proceeding with this. You're free to merge.

I'm only declaring Gerrit's "-1 - I'd rather you didn't", not a "-2 - you shan't". The patch itself is sound - no remaining objections to the implementation as it stands. (Although checks for image/store overlap should still be added later as an enhancement).

But there are other options for K64F:

keep program image size below 512K
use external flash (eg SD card) if you're using the whole 1MB of flash for program
separate erase from write - do an erase only during a reboot, then only ever write while running.

Those three are equivalent to the solutions being used by our other non-K64F deployed self-updating platforms, where we have experienced the issue of interrupt latency on flash erase and know it breaks our time-critical frequency-hopping networks.

Having this option for K64F will make it less likely that people explore those better options, no matter how many caveats we attach, because this is the "cheap and dirty" and transparent-seeming one. We're going to be asked to try to make it work - that's why I'm against the option existing, rather than just being totally neutral on it.

Once this goes in, I just have to start -1/-2 ing applications that try to rely on it, rather than improving their system configuration. :)

0xc0170 · 2016-07-18T09:03:11Z

@sg- Please review

rgrover mentioned this pull request Jul 12, 2016

avoid Read-While-Write errors #2117

Closed

c1728p9 reviewed Jul 12, 2016
View reviewed changes

kjbracey reviewed Jul 13, 2016
View reviewed changes

rgrover mentioned this pull request Jul 22, 2016

Improvements to k64 internal-flash storage driver #2219

Merged

rgrover closed this Jul 26, 2016

fix for the RWW errors while using Storage driver for internal flash on K64F #2150

fix for the RWW errors while using Storage driver for internal flash on K64F #2150

Uh oh!

Conversation

rgrover commented Jul 12, 2016

Uh oh!

bogdanm commented Jul 12, 2016

Uh oh!

rgrover commented Jul 12, 2016

Uh oh!

mbed-bot commented Jul 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rgrover commented Jul 13, 2016

Uh oh!

mbed-bot commented Jul 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rgrover Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kjbracey commented Jul 13, 2016

Uh oh!

rgrover commented Jul 13, 2016

Uh oh!

rgrover commented Jul 13, 2016

Uh oh!

rgrover commented Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kjbracey commented Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbed-bot commented Jul 13, 2016

Uh oh!

marcuschangarm commented Jul 13, 2016

Uh oh!

rgrover commented Jul 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kjbracey commented Jul 13, 2016

Uh oh!

rgrover commented Jul 13, 2016

Uh oh!

marcuschangarm commented Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbed-bot commented Jul 13, 2016

Uh oh!

kjbracey commented Jul 13, 2016

Uh oh!

marcuschangarm commented Jul 13, 2016

Uh oh!

kjbracey commented Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbed-bot commented Jul 13, 2016

Uh oh!

marcuschangarm commented Jul 13, 2016

Uh oh!

simonqhughes commented Jul 13, 2016

Uh oh!

marcuschangarm commented Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

rgrover Jul 13, 2016 •

edited

Loading

rgrover commented Jul 13, 2016 •

edited

Loading

kjbracey commented Jul 13, 2016 •

edited

Loading

marcuschangarm commented Jul 13, 2016 •

edited

Loading

kjbracey commented Jul 13, 2016 •

edited

Loading

marcuschangarm commented Jul 13, 2016 •

edited

Loading

marcuschangarm commented Jul 13, 2016 •

edited

Loading

rgrover commented Jul 14, 2016 •

edited

Loading