-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] [multi-core] #[shared]
and placement of code and data
#211
Comments
@japaric As far as I can see, RISC-V specs tell only what to do if you have caches, but do not require their presence. However, in real-world devices both I$ and D$ caches are present: K210, FU540 and even FE310.
LPC55S6x User Manual mentions something called "FMC flash cache", but without any details. |
@Disasm thanks for the info
I agree.
Is the I$ cache enabled by default in multi-core devices like the K210? Or does it need some setup after a power on reset? it any case it seems that devices with caches would be OK with merging the input
(That chip is rather new and the manual is a bit lacking and has some errors in some parts.) The ARMv8-M architecture does define caches and registers to perform cache operations but it seems that only the Cortex-M35P cores have built-in caches; the LPC55S6x has 2 Cortex-M33 cores. Reading the CLIDR (Cache Level ID Register) on an actual device returned all zeros which indicates that no (I or D) caches exist at any of the 7 possible levels. For the LPC54114 (heterogeneous, M4F + M0+) NXP recommends that one of the cores runs all its code from RAM in one of their application notes as the device has only one Flash bank and no caches. I haven't found a similar recommendation / application note for the LPC55S6x.
Note that this RFC pertains only to items declared within the Perhaps what you want is something like rust-embedded/cortex-m-rt#164 that can be used in libraries but that's not tied to a particular architecture? (And you can always use |
@korken89 @TeXitoi any thoughts on this RFC? This RFC only involves the experimental multi-core API, which we are allowed to change in backwards incompatible ways during the v0.5.x releases, so this may not be what we stabilize at the end but I think the flexibility it allows will let us collect more data while the feature remains experimental (and it's required to get good perf on homogeneous, cache-less devices like the LPC55S69). |
Overall I think this is a great addition! One question that I am quite sure is not an issue but that I have got stuck on, the |
I think we can FCP (merge) this then. @korken89 the microamp framework does the heavy lifting. This blog post describes how it works but the TL;DR is that the |
Even if I'm not aware of these kind of devices, this proposal seemsclean and flexible. OK for me. |
Thanks for the clarification @japaric ! |
The FCP has passed; this RFC is now in the accepted state. Implementation is in PR #205. |
205: rtfm-syntax refactor + heterogeneous multi-core support r=japaric a=japaric this PR implements RFCs #178, #198, #199, #200, #201, #203 (only the refactor part), #204, #207, #211 and #212. most cfail tests have been removed because the test suite of `rtfm-syntax` already tests what was being tested here. The `rtfm-syntax` crate also has tests for the analysis pass which we didn't have here -- that test suite contains a regression test for #183. the remaining cfail tests have been upgraded into UI test so we can more thoroughly check / test the error message presented to the end user. the cpass tests have been converted into plain examples EDIT: I forgot, there are some examples of the multi-core API for the LPC541xx in [this repository](https://github.com/japaric/lpcxpresso54114) people that would like to try out this API but have no hardware can try out the x86_64 [Linux port] which also has multi-core support. [Linux port]: https://github.com/japaric/linux-rtfm closes #178 #198 #199 #200 #201 #203 #204 #207 #211 #212 closes #163 cc #209 (documents how to deal with errors) Co-authored-by: Jorge Aparicio <jorge@japaric.io>
205: rtfm-syntax refactor + heterogeneous multi-core support r=japaric a=japaric this PR implements RFCs #178, #198, #199, #200, #201, #203 (only the refactor part), #204, #207, #211 and #212. most cfail tests have been removed because the test suite of `rtfm-syntax` already tests what was being tested here. The `rtfm-syntax` crate also has tests for the analysis pass which we didn't have here -- that test suite contains a regression test for #183. the remaining cfail tests have been upgraded into UI test so we can more thoroughly check / test the error message presented to the end user. the cpass tests have been converted into plain examples EDIT: I forgot, there are some examples of the multi-core API for the LPC541xx in [this repository](https://github.com/japaric/lpcxpresso54114) people that would like to try out this API but have no hardware can try out the x86_64 [Linux port] which also has multi-core support. [Linux port]: https://github.com/japaric/linux-rtfm closes #178 #198 #199 #200 #201 #203 #204 #207 #211 #212 closes #163 cc #209 (documents how to deal with errors) Co-authored-by: Jorge Aparicio <jorge@japaric.io>
Done in PR #205 |
211: Bump rand dependency to 0.7 r=korken89 a=therealprof Signed-off-by: Daniel Egger <daniel@eggers-club.de> Co-authored-by: Daniel Egger <daniel@eggers-club.de>
This RFC only affects the multi-core modes proposed in RFC #204.
Background
Multi-core Cortex-M devices usually have several memory regions, each one
connected to a different bus (AHB port in ARM Cortex-M terms) with the goal of
reducing memory contention. Different cores can access different memory regions
without contention and with predictable performance. It is when two, or more,
cores try to access the same memory region that contention occurs and one core
is given priority over the other, resulting in perceived memory access delay on
the second core.
To keep performance predictable it is important that applications carefully
place their resources (code and data) in a way that minimizes contention. To
give an example of the important of memory placement: consider the following
homogeneous
multi-core RTFM application running on the LPC55S69, a device withtwo Cortex-M33 cores, one Flash bank and 5 RAM regions.
If all code is placed in Flash then the response time of task
a0
-- in thiscase measured as the time it takes to go from interrupt entry to the breakpoint
-- varies depending on whether the second core is doing any work or
sleeping because the second core also loads its instructions from Flash.
Without contention (when
cfg(contention)
evaluates tofalse
) the responsetime is
26
clock cycles; with contention the response time is31
clockcycles -- 20% slower. These numbers correspond to a configuration where the
first core is given higher access priority to the Flash.
Proposal
Specify how the framework places function and data in memory, but in a way that
end users can minimize memory contention in their application.
Detailed design
Placement of resources and functions
This RFC proposes that we specify the location of functions and static
variables as follows:
All functions and
static mut
variables that need to be shared betweenthe cores will be placed in shared memory; everything else will be placed in
memory local to the core that uses it.
Or in other words the default is to place items in core-local memory.
Examples of items placed in local memory:
#[rtfm::app]
moduletasks and task dispatchers -- these jump into user code
static mut
resources -- all of them are core localstatic
resources that are not shared between coresstatic mut
variables that appear at the beginning ofthe body of a task,
#[init]
or#[idle]
Examples of items placed in shared memory
static
resources shared between cores#[shared]
A special
#[shared]
attribute will be added to the syntax. This attribute canonly be applied to
static [mut]
variables within the#[rtfm::app]
module.The semantics of this attribute is overriding the placement rule defined in the
previous section: this attribute forces the variable to be located in shared
memory.
The goal of this attribute is reducing memory contention. Consider the following
contrived example:
Without the
#[shared]
attribute the execution of taska1
could result inpotentially high memory contention. The reason is that its argument
y
would belocated in core #0 local memory so any operation on
y
would cause contentionon that memory region because it is used by all tasks running on core #0 that
use stack allocated variables, "spill registers" or access core-local static
variables.
Using the
#[shared]
attribute greatly reduces the memory contention that taska1
can cause to only those moments where tasks running on core #0 concurrentlyaccess shared memory.
This seemingly artificial scenario can easily arise when one uses a lock-free
memory pool or any other form of dynamic memory allocation. The backing storage
for these allocators should be placed in
#[shared]
memory if the allocation islikely to cross the core boundary.
Implementation
The realization of the design varies depending on the multi-core mode being
used.
Homogeneous
In
homogeneous
multi-core mode all items are placed in shared memory bydefault. This applies to all items declared outside the
#[rtfm::app]
module,including external crates (dependencies). To override this default the framework
will make use of these custom core-local linker sections:
.text_{i}
.uninit_{i}
.bss_{i}
.data_{i}
Where the
{i}
indicates on which core-local memory this section should beplaced.
.uninit_{i}
is used for buffers (that hold e.g. message payloads),.bss_{i}
for queues (free queue, ready queue, timer queue) and.data_{i}
forall core-local resources -- as there's no way to 100% sure way to tell from
the AST whether a constructor evaluates to all zeros or not.
With these linker sections authors of linker scripts can control the placement
of functions and data. Using the LPC55S69 as an example one could write the
following linker section to place all shared variables in region
SRAM2
anddedicate regions
SRAM0
andSRAM1
to cores #0 and #1 respectively. To workaround the lack of additional Flash banks the code executed by core #1 is placed
in the
SRAM1
region.#[shared]
The effect of the
#[shared]
attribute on code generation is to not use acustom linker section on the specified variable.
Heterogeneous
Heterogeneous multi-core mode is implemented on top of μAMP and μAMP default is
the opposite of the homogeneous multi-core mode: all items, including
dependencies, are core-local and to place something in shared memory one needs
to opt-in using the
#[microamp::shared]
attribute.No custom linker sections are used in heterogeneous multi-core mode.
#[shared]
The effect of the
#[shared]
attribute is to add the#[microamp::shared]
attribute to the generated
static mut
variable.Drawbacks
This complicates the process of writing linker scripts for homogeneous devices
as the author would need to consider how to best map the many linker sections.
However, if the author wants to keep things as simple as possible they can merge
all core-local sections into the default shared memory section as shown below.
However, this will result in high memory contention.
Final remarks
Even with the help of the framework is easy to run into unintended memory
contention when using the
homogeneous
multi-core mode because code sharing isthe default. Consider this example for the LPC55S69 using linker script from
the "Homogeneous" section :
foo
will be placed in Flash because that's the default for this mode. Iffoo
is not inlined into
i1
then core #1 will run some code off Flash causingmemory contention with core #0, which runs all its code from Flash.
There are some ways around this issue like placing the shared code (
.text
) inSRAM3
to at least never cause contention on Flash, or to use instructioncaches for shared code, if the device has one. Though, the best way to solve the
issue may be to use the
heterogeneous
multi-core mode, even if the device ishomogeneous, as this mode doesn't allow sharing of code, only of data.
cc @Disasm this may be of interest to you. Out of curiosity, are
(instruction) caches mandatory on SMP RISCV devices? Do devices like the K210
usually have caches?
Not familiar with (instruction) caches on Cortex-M devices (I think that in the
ARMv7-M line only Cortex-M7 devices have them -- dunno if ARMv8-M devices have
them) but it seems to me that with the linker script from the "Homogeneous"
section one could (read-only) cache the whole
.text
section, which containsall shared code, (and probably also the
.rodata
section) on the second core(only) to prevent all contention on Flash memory -- at least in the case of the
LPC55S69.
The text was updated successfully, but these errors were encountered: