-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(riscv+esp32c3): demonstrate gpio_interrupt example entirely in SRAM #534
Conversation
Previously, the trap vector itself and its immediate callees (`_start_trap` and `_start_trap_rust_hal`) were located in the mapped instruction flash range `0x420..`, increasing cache pressure and adding variable latency to the very beginning of the interrupt/exception service flow. This change places those routines in iram directly: ``` Num: Value Size Type Bind Vis Ndx Name 48177: 40380280 2428 FUNC GLOBAL DEFAULT 6 _start_trap_rust_hal 48197: 40380bfc 54 FUNC GLOBAL DEFAULT 6 _start_trap_rust 48265: 40380200 0 FUNC GLOBAL DEFAULT 6 _vector_table 48349: 40380100 0 NOTYPE GLOBAL DEFAULT 6 default_start_trap 48350: 40380100 0 NOTYPE GLOBAL DEFAULT 6 _start_trap ``` As seen via `readelf -W -s -C ./target/riscv32imc-unknown-none-elf/debug/examples/gpio_interrupt | grep -E _start_trap\|_vector\|Ndx`
Previously, the `INTERRUPT_TO_PRIORITY` constant would compile down to a runtime lookup table called `anon.HASH.2.llvm.NUMBERS` like so: ```asm status & configured_interrupts[INTERRUPT_TO_PRIORITY[cpu_intr as usize - 1]]; ... 40380fbc: 3c0095b7 lui a1,0x3c009 40380fc0: 2d858593 addi a1,a1,728 # 3c0092d8 <anon.73967c570b981c3046610ed0622f71c3.2.llvm.13737670072012269642> 40380fc4: 952e add a0,a0,a1 40380fc6: 4108 lw a0,0(a0) ``` Which means that in order to load the `a1`th slot from the LUT, we'll have to go out to flash to do it. As an alternative to an `#[inline]`'d function, which is nice in the 'c3's case because it's the identity function that disappears at runtime, instead of using a `const` INTERRUPT_TO_PRIORITY could be made a `static`, which would allow us to place it into the `.data` section via a linker directive.
Prior, the code for `intrinsics::cttz` (e.g. `u128.trailing_zeros()`) was being inlined into the call site, but that code was referencing a lookup table from the `.rodata` section (placed in flash): ``` 40380xxx: 3c0066b7 lui a3,0x3c006 40380xxx: ed168293 addi t0,a3,-303 # 3c005ed1 <str.1+0x81> 40380xxx: 03810813 addi a6,sp,56 ... 40380xxx: 41e00633 neg a2,t5 40380xxx: 00cf7633 and a2,t5,a2 40380xxx: 03160633 mul a2,a2,a7 40380xxx: 826d srli a2,a2,0x1b 40380xxx: 9616 add a2,a2,t0 40380xxx: 00064603 lbu a2,0(a2) ``` Caution: The `str.1` symbol name there is a red herring: `str.1` has a length of just 0x21 bytes, there's just seemingly no symbolic name associated with the jump table and `str.1` happens to be the closest preceding label. I can't claim credit for the `ctz32` algorithm here, all I've done is copy it from Wikipedia and modify the operations to the Rust equivalents, and then applied it four times to extend it to a `u128`. I have tested it both for `0b0..010..0` (single bit set) and `0xff...100` (all non-trailing bits set) for every bit position, which did catch at least two mistakes in the translation.
Previously, as RODATA this was getting placed in flash, so this lookup via `interrupt::riscv::vectored::handle_interrupt`: ```rust let handler = peripherals::__EXTERNAL_INTERRUPTS[interrupt as usize]._handler; ``` was compiling to: ``` 40380f1e: 3c006537 lui a0,0x3c006 40380f22: 63c50c93 addi s9,a0,1596 # 3c00663c <__EXTERNAL_INTERRUPTS> ... 40381112: 9566 add a0,a0,s9 40381114: 410c lw a1,0(a0) ``` which dereferences a pointer into the flash segment `0x3c0...`. After this change to apply the linker directive from the corresponding [commit to the pac crate][pac-commit] o Num: Value Size Type Bind Vis Ndx Name 48188: 3fc80608 248 OBJECT GLOBAL DEFAULT 9 __EXTERNAL_INTERRUPTS ``` Num: Value Size Type Bind Vis Ndx Name 48188: 3fc80608 248 OBJECT GLOBAL DEFAULT 9 __EXTERNAL_INTERRUPTS ``` As seen by `readelf -W -s -C ./esp32c3-hal/target/riscv32imc-unknown-none-elf/debug/examples/gpio_interrupt | grep -E __EXT\|Ndx`. [pac-commit]: rustbox/esp-pacs@c8e7cbb
This change gets a fair bit less clear; removing `#[ram]` from `handle_interrupt` does indeed change the resulting binary (because it's implicitly removing `#[inline(never)]`, but `get_core` is so small a function LLVM will always inline it at all the optimization levels I've tried, whether it's marked `#[inline]` or not. So rather than a specific solution to an actual problem, this patch is more of an example and a warning for what's to come: lacking a precise mechanism to trace and identify properties of all callees along a particular path, everything from here on out is highly dependent on relatively distant effects: in attempting to assert whether "the trap handling flow is entirely located in SRAM" we necessarily have to qualify the answer with "given these optimization settings" (and, soon, much worse). If correcteness here is "not touching flash," then, in the usual language of Rust, we're about to leave attempts at demonstrating soundness behind. Instead, we now shift our focus to showing safety under a particular set of (somewhat fragile) circumstances.
As with the prior change, the problem we're solving here is shrinking the state space: there are many branch points within the optmization machinery, and these settings align with what I've been looking at most thoroughly. NB: I have no idea what effect, if any, the Cargo.toml settings have on downstream crates. Forcing frame pointers in `.cargo/config.toml` definitely has no effect.
Removes some flash accesses that occur as the result of an optimization when mapping a sequence of integers to an action, such as calling handler `interrupt7` for interrupt code 7. Previously, this would compile to something similar to: ```asm 40380538: 34202573 csrr a0,mcause 4038053c: 0506 slli a0,a0,0x1 4038053e: 8105 srli a0,a0,0x1 match code { 40380540: 157d addi a0,a0,-1 40380542: 45f9 li a1,30 40380544: 2ca5e163 bltu a1,a0,40380806 <.LBB2_7+0x2ae> 40380548: 050a slli a0,a0,0x2 4038054a: 3c0075b7 lui a1,0x3c007 4038054e: 1dc58593 addi a1,a1,476 # 3c0071dc <.LJTI2_0> 40380552: 952e add a0,a0,a1 40380554: 4108 lw a0,0(a0) 40380556: 8502 jr a0 ``` followed by a bunch of entries corresponding to various `interruptN` calls. `.LJTI2_0` there is stored in flash, and so this requires some amount of Flash access and/or induces cache pressure. Unfortunately, LLVM's ELF lowering is fairly determined to emit these jump tables as part the [`.rodata` section][elf-lowering-jt-sec], which makes them difficult to specifically relocate to SRAM on their own to preserve the optimization. So, this instead disables the otpimzation globally (for all crates and all paths), but only for examples built directly from the HAL (downstream crates would need to set this setting themselves, as with force-frame-pointers). [elf-lowering-jt-sec]: https://llvm.org/doxygen/TargetLoweringObjectFileImpl_8cpp_source.html#l00942
Addresses two classes of icache thrash present in the interrupt service path, e.g.: ```asm let mut prios = [0u128; 16]; 40380d44: ec840513 addi a0,s0,-312 40380d48: 10000613 li a2,256 40380d4c: ec840b93 addi s7,s0,-312 40380d50: 4581 li a1,0 40380d52: 01c85097 auipc ra,0x1c85 40380d56: 11e080e7 jalr 286(ra) # 42005e70 <memset> ``` and ```asm prios 40380f9c: dc840513 addi a0,s0,-568 40380fa0: ec840593 addi a1,s0,-312 40380fa4: 10000613 li a2,256 40380fa8: dc840493 addi s1,s0,-568 40380fac: 01c85097 auipc ra,0x1c85 40380fb0: eae080e7 jalr -338(ra) # 42005e5a <memcpy> ``` As an added bonus, performance of the whole program improves dramatically with these routines 1) reimplemented for the esp32 RISC-V µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit calls to these functions, and the versions that ship with compiler-builtins are [highly tuned] for other platforms. It seems like the expectation is that the compiler-builtins versions are "reasonable defaults," and they are [weakly linked] specifically to allow the kind of domain-specific overrides as here. In the context of the 'c3, this ends up producing a fairly large implementation that adds a lot of frequent cache pressure for minimal wins: ```readelf Num: Value Size Type Bind Vis Ndx Name 27071: 42005f72 22 FUNC LOCAL HIDDEN 3 memcpy 27072: 42005f88 22 FUNC LOCAL HIDDEN 3 memset 28853: 42005f9e 186 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memcpy 28854: 42006058 110 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memset ``` NB: these implementations are broken when targeting unaligned loads/stores across the instruction bus; at least in my testing this hasn't been a problem, because they are simply never invoked in that context. Additionally, these are just about the simplest possible implementations, with word-sized copies being the only concession made to runtime performance. Even a small amount of additional effort would probably yield fairly massive wins, as three- or four-instruction hot loops like these are basically pathological for the 'c3's pipeline implementation that seems to predict all branches as "never taken." However: there is a real danger in overtraining on the microbenchmarks here, too, as I would expect almost no one has a program whose runtime is dominated by these functions. Making these functions larger and more complex to eke out wins from architectural niches makes LLVM much less willing to inline them, costing additional function calls and preventing e.g. dead code elimination for always-aligned addresses or automatic loop unrolling, etc. [highly tuned]: rust-lang/compiler-builtins#405 [weakly linked]: rust-lang/compiler-builtins#339 (comment)
The culmination of this entire patch series, we now place the interrupt handling code itself into SRAM. The `#[ram]` attribute is for the `GPIO` function itself (and `#[inline(never)]` works fine here because LLVM can't "see" through the ~indirect reference to this function), but then that leaves the matter of the function body itself. Unfortunately, this is necessarily work specific to the details of each handler's implementation. Elsewhere, I got started down this road because we needed SPI to start up within a certain time bound, and so we needed to add attributes to much of the HAL's SPI implementation. And, in addition to `memset` and `memcpy` we had to override `memmove` as well, because something about the datastructures used for DMA made `rustc` use that in a few places. As far as this particular example goes, touching `core::fmt` pulls in a lot of code from fairly far outside our known methods of control: ``` write(&mut self, args) 40380f00: 3c006537 lui a0,0x3c006 40380f04: 7a850513 addi a0,a0,1960 # 3c0067a8 <.Lanon.6ccc2de621f5491d2d69154e15b03ba0.16> 40380f08: fca42c23 sw a0,-40(s0) 40380f0c: 4505 li a0,1 40380f0e: fca42e23 sw a0,-36(s0) 40380f12: 3c006537 lui a0,0x3c006 40380f16: 6a450513 addi a0,a0,1700 # 3c0066a4 <anon.f0eba740dd2cb5e3ab050cbaac366ec5.84.llvm .5894916478550538780> 40380f1a: fea42023 sw a0,-32(s0) 40380f1e: fe042223 sw zero,-28(s0) 40380f22: fe042423 sw zero,-24(s0) 40380f26: 3c006537 lui a0,0x3c006 40380f2a: 5fc50593 addi a1,a0,1532 # 3c0065fc <anon.7f39784d2cd27331f7ef17a88d595274.0.llvm. 10122694886931673864> 40380f2e: fd440513 addi a0,s0,-44 40380f32: fd840613 addi a2,s0,-40 40380f36: 01c82097 auipc ra,0x1c82 40380f3a: 92a080e7 jalr -1750(ra) # 42002860 <core::fmt::write> ``` So, this change circumvents the convenience machinery to call `write_str` directly, thereby avoiding touching `core::fmt::write` and its anonymous helper data. Curiously, with the combination of optimization settings in use, doing so actually erases the `"GPIO Interrupt\n"` string entirely in favor of an inlined, unrolled loop emitting each byte in succession: ```asm 40380efe: 40000537 lui a0,0x40000 40380f02: 06850493 addi s1,a0,104 # 40000068 <ets_delay_us+0x18> 40380f06: 04700513 li a0,71 40380f0a: 9482 jalr s1 40380f0c: 05000513 li a0,80 40380f10: 9482 jalr s1 ... ``` (for those, like me, who have not memorized the decimal->ACII lookup table, 71 is 'G' and 80 is 'P') NB: LTO is _required_ at this stage: turning it off produces calls to things like `_critical_section_1_0_{acquire,release}` and some `Mutex` interactions. The latter might be fixable, but I didn't look into it because the former is seemingly not: the `extern "Rust"` seems to form a boundary that currently can only be pierced by applying LTO.
relates to #43 |
@sethp thanks for looking into this! This is really useful experimentation. I won't address the whole discussion at once as there is a lot to digest. Commits 1 & 3 seem like no-brainers, and I don't think will have any issues being merged. Commit 2 looks okay, but the commit seems to have an undocumented change in that it is no longer subtracting one from the @onsdagens has also been looking into improving interrupt performance for the C3 for the RTIC port. We found that we're not actually using CPU vectoring which could eliminate this look up & function call completely, have you looked into that at all?
Sure! We can start via matrix if that works for you; feel free to DM me or @bjoernQ . If we feel the need to jump on a call later, we certainly can. |
Ohhh yes, this is a very excellent point: a much better strategy than turning off jump tables is to use the existing hardware jump table (at least for the HAL entry bits; IIRC we still had more jump tables to deal with inside the ISR's callees). I have a prototype of this somewhere that I'll try to dust off this week.
Oh, good thought: I'll look into this too. It's a little trickier to override since it's not in |
I wonder if it would be possible to place intrinsics code (if not inlined) and data into SRAM via some black-magic in linker-scripts? Also as a side-note: Probably all this should be behind a feature flag (maybe a default feature). The thing is that e.g. in the matter-rs experiment we are already hitting RAM limits (e.g. increasing some network buffers by just a few bytes makes it stop working currently) - for sure there is a lot potential for optimization there but for things like that interrupt latency isn't much of a concern while RAM usage is. |
I'm not sure about the code, but it's inlined here anyway: the trouble is the lookup table that code uses. That table is seemingly a totally unnamed symbol, and there doesn't appear to be any linker invocation that's of the shape "referents from this named root" (which would be handy for making basically everything in this PR unnecessary!). The closest I could see from linker-land was "take everything in the .rodata section of the libcore object and put it in this output section", which I didn't really consider because I just assumed that would use too much RAM to be useful. Speaking of, your addenda is well taken: I agree that putting these allocations behind some kind of conditional would make sense (especially as these chips seem quite capable of executing interrupt handlers entirely from flash). I'm a little hesitant to use a cargo feature for it since the granularity there is fairly coarse. I wonder if we could defer the question? It looks like the amount allocated here isn't very much in a relative sense (about 1% each of DRAM and IRAM):
vs the baseline from the hal's current main:
(these generated with the very rough, very 'c3 specific tool I've been using for analyzing allocations; I threw it up here if you want to check it out: https://github.com/rustbox/supreme-train/blob/main/src/bin/mem.rs ). |
Happy Friday! I've got a brief update for you on two items here: Hardware VectoringA draft of hardware vectoring PoC is up, which at this point is almost entirely in esp-riscv-rt: esp-rs/esp-riscv-rt#11 . It works, but I was hoping to spend a little more time carrying the thought through to the esp-hal crate before merging, since there's a whole release thing involved. I'm fairly excited about how that's working out, though: I think it'll offer a nice way to sidestep some of the thornier issues in this PR, at least for the moment. If we do carry the vectoring all the way "through" instead of fanning back in at Also, I'm feeling particularly clever about moving a little away from svd2rustI got the ball rolling on integrating the linker directive for I'm also on matrix now as |
Jumping back into this thread after the initial merges + some travelling on my end to address the other parts you brought up :). I am on board with adding optimized memcpy routines (b511568) I think we'd see a net benefit in performance across all applications for little cost. Perhaps one experiment you may wish to try would be to use the built-in ROM memcpy routines, the CPU should have one cycle access (though I've asked internally for some clarification on this) to ROM code; I'm also not sure if ROM code is executed in place or is pulled into the cache. If it avoids hitting the cache whilst having the same execution performance as SRAM that could be a win.
This pains me too, but sadly it seems we need language-level help to overcome this. I've seen a couple of folks poke around the rustc formatting machinery, and subsequently burn out pretty fast. I think the only person with recent success with that was @m-ou-se, though I forget the exact PR they submitted. You are very well versed in Rust and all its intricacies so I presume you have tried this, but how does |
I've confirmed that ROM code is executed in place and won't touch the cache. |
@sethp any thoughts on my comments above? |
Closing this due to inactivity. Please feel free to re-open this PR, or to open a new one, if you wish to return to this work in the future. |
As mentioned over on #448 the current RISC-V HAL interrupt flow isn't memory-resident, with large portions of it living in Flash.
This change set is not intended to be merged as-is, or probably at all: it's more of a "spike" or "steel thread" demonstrating the areas of work I've been able to identify in order to meet the goal of "[an] interrupt request is serviced without touching flash memory." My friend & I needed to do this because we have a hard-real-time interrupt that fires at ~480Hz (60Hz * 8), which made for some 🌶️ contention at the cache layer.
As usual with these things, a lot of the wins can be had for relatively little effort. I've tried to structure these changes in increasing order from least implementation/maintenance burden to most, with each commit message giving a detailed explanation of what problem I believed I was solving at each step. Happily, they're also largely in sync with the size of the wins (with the exception of the
mem*
friends, discussed later).What I'm hoping for from this discussion is mostly just to share my findings, and to identify where we might already have consensus on what the next steps ought to be (where any). In an effort to apply Cunningham's Law, my current beliefs about the disposition and direction of these changes are below.
ready for (early) review
fa671bb is something I intend to replicate in all the other RISC-V linker scripts, at which point that change will be ready to be merged into the mainline HAL.feat(riscv): relocate.trap
machinery to RAM #541const
slices to inline-able functions. I don't yet have any 'c6s to try out the PLIC side of things, though I did order some for that purpose.esp-pacs
generator relocating that object to.trap.rodata
(my preference), or.data
.trap.rodata
, making a change at this point to the linker scripts that live here inesp-hal
to emit objects from that section.data
and it's definitely not breaking.here be dragons
esp32c3/src/examples
directory. They do demonstrate just how tricky it is to 1) predict and 2) validate whether a particular call graph is memory resident or not. Initially, I believed inspecting the rust functions and throwing a few#[inline(always)]
s around would be enough to ensure that property under a wide range of conditions, but what I found was... not that.memcpy
implementation is significantly more difficult than it first appears. Regrettably, it's just so frickin' good:memcpy
(and to a lesser degree,memset
,memmove
, andmemcmp
) show up a lot: even at the highest optimization levels, there are 9memcpy
call sites in thegpio_interrupt
example alone. In my downstream project, we've got ~50memcpy
calls that span the gamut from "one time, at startup" to "hottest of the hot path." Implementing that family of functions ourselves was a tremendous performance win across the board, and is well worth the high price of maintenance, at least for us.esp-hal
side of the IRQ flow only uses memset/memcpy when dealing with the[u128; 16]
array that represents all configured interrupts, so it seems plausible that finding a different way to represent that data structure or fetching the priorities a few at a time instead of all at once might dodge the use of those functions, which here would allow us to meet the "nothing in flash" requirement without taking ownership of them.core
. The first and most immediate place that arises is, as here, in string formatting. However, it's not actually possible to work around in that way, because formatting an error message is a huge part of calls topanic!
, most of which get emitted automatically for e.g. bounds checks on arrays,unwap
calls, arithmetic overflow (when enabled), etc. Mostly this is 🔥 fine 🔥 becausepanic!
s are generally non-recoverable, so taking extra time to load a string from flash right before the CPU goes into a sleep-until-reset loop isn't really a big deal. That said, it feels to me like a potentially large area of future work, with an uncertain scope and level of risk. I'm very curious what y'all's thoughts are on both how important it is to get the panic handling bits fully in RAM and whether there's any strategies that come to mind about how to do it.caveats
EspDefaultHandler
,ExceptionHandler
, andDefaultInterruptHandler
; those'd be separate PRs against esp-riscv-rt and esp-backtrace (probably), but I haven't put much thought yet into helping out downstream crates that might provide their own implementations.Phew, that was a whole lot! Let me know if there's a time y'all want to hop on a video call or if there's a semi-synchronous chat I should join to go through this together! I don't expect this to be our last opportunity to connect, though, so no pressure from my end either way.
cc @MabezDev
Footnotes
I can't for the life of me identify the object in the ELF, but from looking at the hex dump of
.rodata
it's tucked in between two fairly obvious strings that at least provide an upper bound on the size. ↩